In iptables mode, kube-proxy programs DNAT rules that map ClusterIP → Pod IP. In IPVS mode, kube-proxy creates virtual servers and real servers inside the kernel’s IPVS table.
Service lookups happen in BPF maps instead of iptables
Faster updates and better observability
Requires kernel support and CNI integration
flowchart LR
Client[Pod] --> BPF[BPF Service Map]
BPF --> EP[Endpoint Pod IP]
EP --> App[Container process]
How eBPF works in the kernel (short, practical view)#
At a high level, eBPF programs are small, verified bytecode programs that run safely inside the Linux kernel. They are attached to specific hook points (network ingress/egress, socket ops, tracepoints) and can:
Read packet metadata
Look up values in BPF maps (shared key/value data structures)
XDP runs at the earliest point (driver RX path) and can drop/redirect packets before the kernel allocates socket buffers. TC runs later (qdisc layer) with richer context but more overhead.
Aspect
XDP
TC
Hook point
NIC driver RX
qdisc (ingress/egress)
Latency
lowest
low (but higher than XDP)
Access to skb
no (raw packet)
yes (skb metadata)
Actions
drop, pass, redirect
drop, pass, redirect, modify
Use cases
DDoS filtering, fast LB
policy, service routing, observability
Rule of thumb: use XDP for ultra-fast filtering and TC when you need skb metadata or more complex actions.
Below is a deliberately small example that shows a map lookup and a simple rewrite. It is simplified and omits full parsing, checksum updates, and error handling.
clang -O2 -g -target bpf -c xdp_svc.c -o xdp_svc.oip link set dev eth0 xdp obj xdp_svc.o sec xdpclang -O2 -g -target bpf -c tc_svc.c -o tc_svc.otc qdisc add dev eth0 clsacttc filter add dev eth0 ingress bpf da obj tc_svc.o sec tc
Tail calls (program chaining): Tail calls let one eBPF program jump to another without returning, enabling modular pipelines (e.g., parsing → service lookup → policy). They reduce verifier complexity and allow large logic to be split across programs.
Key constraints:
Requires a BPF map of programs (PROG_ARRAY)
Limited by MAX_TAIL_CALL_CNT (default 32)
If the target program is missing, execution falls back
By default, maps live only as long as the creating process. Pinning stores them in bpffs (usually /sys/fs/bpf) so they survive process exit and can be shared across programs.
bash
sudo mount -t bpf bpf /sys/fs/bpfsudo bpftool map listsudo bpftool map pin id <map-id> /sys/fs/bpf/svc_mapsudo bpftool map show pinned /sys/fs/bpf
externalTrafficPolicy: Cluster → load balance across nodes, client IP may be SNAT’d
externalTrafficPolicy: Local → preserve client IP, but only nodes with local Pods receive traffic
Endpoint readiness and probe effects:
Only Ready endpoints receive traffic. A failing readiness probe removes a Pod from Service backends without killing the container. This is why readiness governs routing and liveness governs restart.
Kubernetes Service networking is just virtual IPs plus backend lists. The dataplane can be iptables/IPVS or eBPF, but the model stays the same: ClusterIP points to ready endpoints. Once you understand where that translation happens, Service debugging becomes straightforward.
Series: Kubernetes Internals: How the Cluster Actually Works
homelabird
Sharing hands-on cloud infrastructure and DevOps experience. Writing about Kubernetes, Terraform, and observability, and documenting lessons learned as a solo operator.