K3s and kube-vip with Cilium's Egress Gateway feature

New year, new blog post! This time it’s more magic with Cilium, which has a useful new feature that lets you specify a specific IP address in order to egress traffic from your cluster. This is really handy in very constrained environments, however the official guide suggests that you statically configure an IP address on the designated egress gateway node. This is a bit limiting, as if that node happens to go down then you might need to wait five minutes for a Pod to be scheduled elsewhere to re-plumb in that IP address. In this post we’ll work around that using the venerable kube-vip instead.

NB: The enterprise version of Cilium 1.11 has a HA feature for the Egress Gateway functionality, see this blog post.

We’ll also heavily lean into Cilium’s support for eBPF by doing away with kube-proxy entirely, but note that this does come with some limitations.

Install K3s

First, let’s set some common options for K3s. We disable the in-built CNI and Klipper (the Service LB), disable kube-proxy and the network policy controller (since the functionality will be handled by Cilium), and also specify an additional IP address - that of a VIP which we’ll configure shortly - as a SAN to be able to access our Kubernetes API:

export K3S_VERSION="v1.22.4+k3s1"
export K3S_OPTIONS="--flannel-backend=none --no-flannel --disable-kube-proxy --disable servicelb --disable-network-policy --tls-san=192.168.20.200"

I’ve got three VMs running openSUSE Leap deployed via created on vSphere - cilium{0..2}. Note that I use govc extensively during this article, and I’ll be making use of k3sup to bootstrap my cluster:

k3sup install --cluster --ip $(govc vm.ip /42can/vm/cilium0) --user nick --local-path ~/.kube/cilium.yaml --context cilium --k3s-version $K3S_VERSION --k3s-extra-args $K3S_OPTIONS
k3sup join --ip $(govc vm.ip /42can/vm/cilium1) --server-ip $(govc vm.ip /42can/vm/cilium0) --server --server-user nick --user nick --k3s-version $K3S_VERSION --k3s-extra-args $K3S_OPTIONS
k3sup join --ip $(govc vm.ip /42can/vm/cilium2) --server-ip $(govc vm.ip /42can/vm/cilium0) --server --server-user nick --user nick --k3s-version $K3S_VERSION --k3s-extra-args $K3S_OPTIONS

At this point nodes will be in NotReady status and no Pods will have started as we have no functioning CNI:

% kubectl get nodes -o wide
NAME      STATUS     ROLES                       AGE     VERSION        INTERNAL-IP      EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
cilium0   NotReady   control-plane,etcd,master   2m53s   v1.22.4+k3s1   192.168.20.49    <none>        openSUSE Leap 15.3   5.3.18-57-default   containerd://1.5.8-k3s1
cilium1   NotReady   control-plane,etcd,master   65s     v1.22.4+k3s1   192.168.20.23    <none>        openSUSE Leap 15.3   5.3.18-57-default   containerd://1.5.8-k3s1
cilium2   NotReady   control-plane,etcd,master   24s     v1.22.4+k3s1   192.168.20.119   <none>        openSUSE Leap 15.3   5.3.18-57-default   containerd://1.5.8-k3s1
% kubectl get pods -A                                                                                                            
NAMESPACE     NAME                                     READY   STATUS    RESTARTS   AGE
kube-system   coredns-85cb69466-n7z42                  0/1     Pending   0          2m50s
kube-system   helm-install-traefik--1-xv7q2            0/1     Pending   0          2m50s
kube-system   helm-install-traefik-crd--1-w7w5w        0/1     Pending   0          2m50s
kube-system   local-path-provisioner-64ffb68fd-mfqsr   0/1     Pending   0          2m50s
kube-system   metrics-server-9cf544f65-kqbj4           0/1     Pending   0          2m50s

Add a VIP for the Kubernetes API

Before we go any further and install Cilium, let’s add a VIP for our control plane via kube-vip. Note that we do this as a static Pod (as opposed to as a DaemonSet) since we don’t yet have a functioning CNI. To make this happen we need to do a couple of things - first create the the various RBAC-related resources either via kubectl or by dropping the contents of this file into /var/lib/rancher/k3s/server/manifests, and we also need to generate the manifest for our static Pod, customised slightly based on our environment and also for K3s. To create the manifest in my case, I ran these commands:

% alias kube-vip="docker run --network host --rm ghcr.io/kube-vip/kube-vip:v0.4.0"
% kube-vip manifest pod \
    --interface eth0 \
    --vip 192.168.20.200 \
    --controlplane \
    --services \
    --arp \
    --leaderElection > kube-vip.yaml

This created a kube-vip.yaml with the IP I want to use for my VIP (192.168.20.200) as well as the interface on my nodes to which this should be bound (eth0). The file then needs to be edited, and edit the hostPath path to point to /etc/rancher/k3s/k3s.yaml instead of the default /etc/kubernetes/admin.conf, since the path to the kubeconfig which should be used is different with K3s.

With those changes made, this needs to be copied into the default directory for static Pod manifests on all of our nodes: /var/lib/rancher/k3s/agent/pod-manifests:

% kubectl apply -f https://kube-vip.io/manifests/rbac.yaml
serviceaccount/kube-vip created
clusterrole.rbac.authorization.k8s.io/system:kube-vip-role created
clusterrolebinding.rbac.authorization.k8s.io/system:kube-vip-binding created

% for node in cilium{0..2} ;  do cat kube-vip.yaml | ssh $(govc vm.ip $node) 'cat - | sudo tee /var/lib/rancher/k3s/agent/pod-manifests/kube-vip.yaml' ; done

If everything’s working, after a few seconds you should see the kube-vip Pods in state running and you should be able to ping your VIP:

% kubectl get pods -n kube-system | grep -i vip
kube-vip-cilium0                         1/1     Running     0          107m
kube-vip-cilium1                         1/1     Running     0          5m9s
kube-vip-cilium2                         1/1     Running     0          2m15s

% ping -c 1 192.168.20.200
PING 192.168.20.200 (192.168.20.200) 56(84) bytes of data.
64 bytes from 192.168.20.200: icmp_seq=1 ttl=63 time=1.78 ms

--- 192.168.20.200 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 1.781/1.781/1.781/0.000 ms

Now we can install Cilium via Helm, specifying the VIP for the service host to which Cilium should connect:

helm install cilium cilium/cilium --version 1.11 \
  --namespace kube-system \
  --set kubeProxyReplacement=strict \
  --set k8sServiceHost=192.168.20.200 \
  --set k8sServicePort=6443 \
  --set egressGateway.enabled=true \
  --set bpf.masquerade=true \
  --set hubble.relay.enabled=true \
  --set hubble.ui.enabled=true

Deploying the kube-vip Cloud Controller

The Traefik Ingress controller deployed as part of K3s creates a service of type LoadBalancer. As our install options meant that Klipper isn’t deployed, so we need something else to handle these types of resources and to surface a VIP on our cluster’s behalf. Given we’re using kube-vip for the control plane, we’ll go ahead and use it for this as well. First, install the kube-vip controller component which will watch and handle Services of this type:

kubectl apply -f https://raw.githubusercontent.com/kube-vip/kube-vip-cloud-provider/main/manifest/kube-vip-cloud-controller.yaml

Now create a ConfigMap resource with a range that kube-vip will allocate an IP address from by default:

apiVersion: v1
kind: ConfigMap
metadata:
  name: kubevip
  namespace: kube-system
data:
  range-global: 192.168.20.220-192.168.20.225

Once the kube-vip-cloud-provider-0 in the kube-system namespace is Running, you should see that the LoadBalancer Service for Traefik now has an IP address allocated and is reachable from outside the cluster:

% kubectl get svc traefik -n kube-system
NAME      TYPE           CLUSTER-IP     EXTERNAL-IP      PORT(S)                      AGE
traefik   LoadBalancer   10.43.21.231   192.168.20.220   80:31186/TCP,443:31184/TCP   16h

% curl 192.168.20.220
404 page not found

NB: The 404 is expected and a valid response

Using and testing Cilium’s Egress Gateway feature

As mentioned, the example documentation for the Egress Gateway feature for Cilium suggests that you create a Deployment which launches a container somewhere in your cluster and plumbs in an IP address on an interface, and this IP will be the nominated Egress IP. If that node goes down, it might take some time before kube-scheduler reschedules that Pod onto another node and reconfigures this IP address.

However, we have other options - for example, we can make use of an existing IP address within our cluster, and if it’s one being managed by kube-vip then we can entrust that it’ll be failed over and reassigned in a timely and graceful manner. Let’s test using the IP that’s been assigned to our Traefik LoadBalancer Service - 192.168.20.220 - by kube-vip.

To test this we need an external service running somewhere that we can connect to. I’ve spun up another VM, imaginatively titled test, and this machine has an IP of 192.168.20.70. I’m just going to launch NGINX via Docker:

% echo 'it works' > index.html
% docker run -d --name nginx -p 80:80 -v $(pwd):/usr/share/nginx/html nginx
% docker logs -f nginx

If we attempt to connect from a client in our cluster to this server, we should something like the following:

% kubectl run tmp-shell --rm -i --tty --image nicolaka/netshoot -- /bin/bash
If you don't see a command prompt, try pressing enter.
bash-5.1# curl 192.168.20.70
it works

And from NGINX:

192.168.20.82 - - [07/Dec/2021:12:35:43 +0000] "GET / HTTP/1.1" 200 9 "-" "curl/7.80.0" "-"

Predictably, the source IP is the node on which my netshoot container is running (cilium0). Now let’s add a CiliumEgressNATPolicy which will put in place the configuration to set the source IP address to that of my Traefik LoadBalancer for any traffic destined for my external test server IP:

% cat egress.yaml
apiVersion: cilium.io/v2alpha1
kind: CiliumEgressNATPolicy
metadata:
  name: egress-sample
spec:
  egress:
  - podSelector:
      matchLabels:
        io.kubernetes.pod.namespace: default
  destinationCIDRs:
  - 192.168.20.70/32
  egressSourceIP: "192.168.20.220"
% kubectl apply -f egress.yaml
ciliumegressnatpolicy.cilium.io/egress-sample created

We can verify the NAT that’s been put in place by running the cilium bpf egress list command from within one of our Cilium Pods:

% kubectl -n kube-system get pods -l k8s-app=cilium
NAME           READY   STATUS    RESTARTS   AGE
cilium-28gdd   1/1     Running   0          17h
cilium-n5wv7   1/1     Running   0          17h
cilium-vfvkd   1/1     Running   0          16h

% kubectl exec -it cilium-vfvkd -n kube-system -- cilium bpf egress list
SRC IP & DST CIDR            EGRESS INFO
10.0.0.80 192.168.20.70/32   192.168.20.220 192.168.20.220   
10.0.2.11 192.168.20.70/32   192.168.20.220 192.168.20.220   

Now let’s run that curl command again from our netshoot container and observe what happens again in NGINX:

bash-5.1# curl 192.168.20.70
it works
192.168.20.220 - - [07/Dec/2021:12:47:53 +0000] "GET / HTTP/1.1" 200 9 "-" "curl/7.80.0" "-"

There we can see that the source IP of the request is now the VIP assigned to our Traefik LoadBalancer. This works because behind the scenes, kube-vip is handling the IP address assignment to a specific physical interface on our behalf. For example, I can ssh to this VIP and validate that it’s been assigned to eth0:

% ssh 192.168.20.220
Warning: Permanently added '192.168.20.240' (ED25519) to the list of known hosts.
nick@cilium0:~> ip -br a li eth0
eth0             UP             192.168.20.163/24 192.168.20.200/32 192.168.20.220/32 

As it turns out, it’s currently assigned to the first node in my cluster (and it actually also happens to be the one with the VIP for the control plane). Let’s see what happens when shut this node and get some timings for how long it takes to failover. My tmp-shell Pod is running on cilium2, so I’ll kick off my curl command in a loop running every second, shut cilium0 down, and keep an eye on NGINX:

bash-5.1# while true ; do curl 192.168.20.70 ; sleep 1; done
it works
it works

[..]

And from the NGINX logs, from the time I shut down cilium0 I can see it took about two minutes before it saw any further requests:

192.168.20.220 - - [18/Jan/2022:08:31:48 +0000] "GET / HTTP/1.1" 200 9 "-" "curl/7.80.0" "-"
192.168.20.220 - - [18/Jan/2022:08:34:11 +0000] "GET / HTTP/1.1" 200 9 "-" "curl/7.80.0" "-"

Not bad!