K3s and kube-vip with Cilium's Egress Gateway feature
4 January, 2022New year, new blog post! This time it’s more magic with Cilium, which has a useful new feature that lets you specify a specific IP address in order to egress traffic from your cluster. This is really handy in very constrained environments, however the official guide suggests that you statically configure an IP address on the designated egress gateway node. This is a bit limiting, as if that node happens to go down then you might need to wait five minutes for a Pod to be scheduled elsewhere to re-plumb in that IP address. In this post we’ll work around that using the venerable kube-vip instead.
NB: The enterprise version of Cilium 1.11 has a HA feature for the Egress Gateway functionality, see this blog post.
We’ll also heavily lean into Cilium’s support for eBPF by doing away with kube-proxy entirely, but note that this does come with some limitations.
Install K3s
First, let’s set some common options for K3s. We disable the in-built CNI and Klipper (the Service LB), disable kube-proxy and the network policy controller (since the functionality will be handled by Cilium), and also specify an additional IP address - that of a VIP which we’ll configure shortly - as a SAN to be able to access our Kubernetes API:
export K3S_VERSION="v1.22.4+k3s1"
export K3S_OPTIONS="--flannel-backend=none --no-flannel --disable-kube-proxy --disable servicelb --disable-network-policy --tls-san=192.168.20.200"
I’ve got three VMs running openSUSE Leap deployed via created on vSphere - cilium{0..2}
. Note that I use govc
extensively during this article, and I’ll be making use of k3sup to bootstrap my cluster:
k3sup install --cluster --ip $(govc vm.ip /42can/vm/cilium0) --user nick --local-path ~/.kube/cilium.yaml --context cilium --k3s-version $K3S_VERSION --k3s-extra-args $K3S_OPTIONS
k3sup join --ip $(govc vm.ip /42can/vm/cilium1) --server-ip $(govc vm.ip /42can/vm/cilium0) --server --server-user nick --user nick --k3s-version $K3S_VERSION --k3s-extra-args $K3S_OPTIONS
k3sup join --ip $(govc vm.ip /42can/vm/cilium2) --server-ip $(govc vm.ip /42can/vm/cilium0) --server --server-user nick --user nick --k3s-version $K3S_VERSION --k3s-extra-args $K3S_OPTIONS
At this point nodes will be in NotReady
status and no Pods will have started as we have no functioning CNI:
% kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
cilium0 NotReady control-plane,etcd,master 2m53s v1.22.4+k3s1 192.168.20.49 <none> openSUSE Leap 15.3 5.3.18-57-default containerd://1.5.8-k3s1
cilium1 NotReady control-plane,etcd,master 65s v1.22.4+k3s1 192.168.20.23 <none> openSUSE Leap 15.3 5.3.18-57-default containerd://1.5.8-k3s1
cilium2 NotReady control-plane,etcd,master 24s v1.22.4+k3s1 192.168.20.119 <none> openSUSE Leap 15.3 5.3.18-57-default containerd://1.5.8-k3s1
% kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system coredns-85cb69466-n7z42 0/1 Pending 0 2m50s
kube-system helm-install-traefik--1-xv7q2 0/1 Pending 0 2m50s
kube-system helm-install-traefik-crd--1-w7w5w 0/1 Pending 0 2m50s
kube-system local-path-provisioner-64ffb68fd-mfqsr 0/1 Pending 0 2m50s
kube-system metrics-server-9cf544f65-kqbj4 0/1 Pending 0 2m50s
Add a VIP for the Kubernetes API
Before we go any further and install Cilium, let’s add a VIP for our control plane via kube-vip. Note that we do this as a static Pod (as opposed to as a DaemonSet) since we don’t yet have a functioning CNI. To make this happen we need to do a couple of things - first create the the various RBAC-related resources either via kubectl
or by dropping the contents of this file into /var/lib/rancher/k3s/server/manifests
, and we also need to generate the manifest for our static Pod, customised slightly based on our environment and also for K3s. To create the manifest in my case, I ran these commands:
% alias kube-vip="docker run --network host --rm ghcr.io/kube-vip/kube-vip:v0.4.0"
% kube-vip manifest pod \
--interface eth0 \
--vip 192.168.20.200 \
--controlplane \
--services \
--arp \
--leaderElection > kube-vip.yaml
This created a kube-vip.yaml
with the IP I want to use for my VIP (192.168.20.200
) as well as the interface on my nodes to which this should be bound (eth0
). The file then needs to be edited, and edit the hostPath
path
to point to /etc/rancher/k3s/k3s.yaml
instead of the default /etc/kubernetes/admin.conf
, since the path to the kubeconfig which should be used is different with K3s.
With those changes made, this needs to be copied into the default directory for static Pod manifests on all of our nodes: /var/lib/rancher/k3s/agent/pod-manifests
:
% kubectl apply -f https://kube-vip.io/manifests/rbac.yaml
serviceaccount/kube-vip created
clusterrole.rbac.authorization.k8s.io/system:kube-vip-role created
clusterrolebinding.rbac.authorization.k8s.io/system:kube-vip-binding created
% for node in cilium{0..2} ; do cat kube-vip.yaml | ssh $(govc vm.ip $node) 'cat - | sudo tee /var/lib/rancher/k3s/agent/pod-manifests/kube-vip.yaml' ; done
If everything’s working, after a few seconds you should see the kube-vip
Pods in state running and you should be able to ping your VIP:
% kubectl get pods -n kube-system | grep -i vip
kube-vip-cilium0 1/1 Running 0 107m
kube-vip-cilium1 1/1 Running 0 5m9s
kube-vip-cilium2 1/1 Running 0 2m15s
% ping -c 1 192.168.20.200
PING 192.168.20.200 (192.168.20.200) 56(84) bytes of data.
64 bytes from 192.168.20.200: icmp_seq=1 ttl=63 time=1.78 ms
--- 192.168.20.200 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 1.781/1.781/1.781/0.000 ms
Now we can install Cilium via Helm, specifying the VIP for the service host to which Cilium should connect:
helm install cilium cilium/cilium --version 1.11 \
--namespace kube-system \
--set kubeProxyReplacement=strict \
--set k8sServiceHost=192.168.20.200 \
--set k8sServicePort=6443 \
--set egressGateway.enabled=true \
--set bpf.masquerade=true \
--set hubble.relay.enabled=true \
--set hubble.ui.enabled=true
Deploying the kube-vip Cloud Controller
The Traefik Ingress controller deployed as part of K3s creates a service of type LoadBalancer. As our install options meant that Klipper isn’t deployed, so we need something else to handle these types of resources and to surface a VIP on our cluster’s behalf. Given we’re using kube-vip for the control plane, we’ll go ahead and use it for this as well. First, install the kube-vip controller component which will watch and handle Services of this type:
kubectl apply -f https://raw.githubusercontent.com/kube-vip/kube-vip-cloud-provider/main/manifest/kube-vip-cloud-controller.yaml
Now create a ConfigMap
resource with a range that kube-vip will allocate an IP address from by default:
apiVersion: v1
kind: ConfigMap
metadata:
name: kubevip
namespace: kube-system
data:
range-global: 192.168.20.220-192.168.20.225
Once the kube-vip-cloud-provider-0
in the kube-system
namespace is Running
, you should see that the LoadBalancer Service for Traefik now has an IP address allocated and is reachable from outside the cluster:
% kubectl get svc traefik -n kube-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
traefik LoadBalancer 10.43.21.231 192.168.20.220 80:31186/TCP,443:31184/TCP 16h
% curl 192.168.20.220
404 page not found
NB: The 404 is expected and a valid response
Using and testing Cilium’s Egress Gateway feature
As mentioned, the example documentation for the Egress Gateway feature for Cilium suggests that you create a Deployment which launches a container somewhere in your cluster and plumbs in an IP address on an interface, and this IP will be the nominated Egress IP. If that node goes down, it might take some time before kube-scheduler reschedules that Pod onto another node and reconfigures this IP address.
However, we have other options - for example, we can make use of an existing IP address within our cluster, and if it’s one being managed by kube-vip then we can entrust that it’ll be failed over and reassigned in a timely and graceful manner. Let’s test using the IP that’s been assigned to our Traefik LoadBalancer Service - 192.168.20.220
- by kube-vip.
To test this we need an external service running somewhere that we can connect to. I’ve spun up another VM, imaginatively titled test
, and this machine has an IP of 192.168.20.70
. I’m just going to launch NGINX via Docker:
% echo 'it works' > index.html
% docker run -d --name nginx -p 80:80 -v $(pwd):/usr/share/nginx/html nginx
% docker logs -f nginx
If we attempt to connect from a client in our cluster to this server, we should something like the following:
% kubectl run tmp-shell --rm -i --tty --image nicolaka/netshoot -- /bin/bash
If you don't see a command prompt, try pressing enter.
bash-5.1# curl 192.168.20.70
it works
And from NGINX:
192.168.20.82 - - [07/Dec/2021:12:35:43 +0000] "GET / HTTP/1.1" 200 9 "-" "curl/7.80.0" "-"
Predictably, the source IP is the node on which my netshoot container is running (cilium0
). Now let’s add a CiliumEgressNATPolicy
which will put in place the configuration to set the source IP address to that of my Traefik LoadBalancer for any traffic destined for my external test server IP:
% cat egress.yaml
apiVersion: cilium.io/v2alpha1
kind: CiliumEgressNATPolicy
metadata:
name: egress-sample
spec:
egress:
- podSelector:
matchLabels:
io.kubernetes.pod.namespace: default
destinationCIDRs:
- 192.168.20.70/32
egressSourceIP: "192.168.20.220"
% kubectl apply -f egress.yaml
ciliumegressnatpolicy.cilium.io/egress-sample created
We can verify the NAT that’s been put in place by running the cilium bpf egress list
command from within one of our Cilium Pods:
% kubectl -n kube-system get pods -l k8s-app=cilium
NAME READY STATUS RESTARTS AGE
cilium-28gdd 1/1 Running 0 17h
cilium-n5wv7 1/1 Running 0 17h
cilium-vfvkd 1/1 Running 0 16h
% kubectl exec -it cilium-vfvkd -n kube-system -- cilium bpf egress list
SRC IP & DST CIDR EGRESS INFO
10.0.0.80 192.168.20.70/32 192.168.20.220 192.168.20.220
10.0.2.11 192.168.20.70/32 192.168.20.220 192.168.20.220
Now let’s run that curl
command again from our netshoot container and observe what happens again in NGINX:
bash-5.1# curl 192.168.20.70
it works
192.168.20.220 - - [07/Dec/2021:12:47:53 +0000] "GET / HTTP/1.1" 200 9 "-" "curl/7.80.0" "-"
There we can see that the source IP of the request is now the VIP assigned to our Traefik LoadBalancer. This works because behind the scenes, kube-vip is handling the IP address assignment to a specific physical interface on our behalf. For example, I can ssh to this VIP and validate that it’s been assigned to eth0
:
% ssh 192.168.20.220
Warning: Permanently added '192.168.20.240' (ED25519) to the list of known hosts.
nick@cilium0:~> ip -br a li eth0
eth0 UP 192.168.20.163/24 192.168.20.200/32 192.168.20.220/32
As it turns out, it’s currently assigned to the first node in my cluster (and it actually also happens to be the one with the VIP for the control plane). Let’s see what happens when shut this node and get some timings for how long it takes to failover. My tmp-shell
Pod is running on cilium2
, so I’ll kick off my curl
command in a loop running every second, shut cilium0
down, and keep an eye on NGINX:
bash-5.1# while true ; do curl 192.168.20.70 ; sleep 1; done
it works
it works
[..]
And from the NGINX logs, from the time I shut down cilium0
I can see it took about two minutes before it saw any further requests:
192.168.20.220 - - [18/Jan/2022:08:31:48 +0000] "GET / HTTP/1.1" 200 9 "-" "curl/7.80.0" "-"
192.168.20.220 - - [18/Jan/2022:08:34:11 +0000] "GET / HTTP/1.1" 200 9 "-" "curl/7.80.0" "-"
Not bad!