K3s / RKE2 and GPU PCI-passthrough

Here’s some notes on getting GPUs working with K3s or RKE2. It’s pulled together from a few places to save folks the same trouble I found, i.e hunting through PRs and issues on GitHub to work out what needs to be configured in order for this to work.

The first step is to get yourself a node with a NVIDIA GPU. In my case I’ve a 3080 configured using PCI passthrough to a VM running Ubuntu 20.04:

nick@gpu0:~$ sudo lspci -v | grep -i nvidia
05:00.0 VGA compatible controller: NVIDIA Corporation Device 2206 (rev a1) (prog-if 00 [VGA controller])
	Subsystem: NVIDIA Corporation Device 1467
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
06:00.0 Audio device: NVIDIA Corporation Device 1aef (rev a1)
	Subsystem: NVIDIA Corporation Device 1467

Node OS configuration

Within this VM, we need to install the NVIDIA drivers and also the container toolkit:

$ sudo apt -y install nvidia-driver-510
$ curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | apt-key add -
$ echo 'deb https://nvidia.github.io/libnvidia-container/stable/ubuntu20.04/$(ARCH) /' > / etc/apt/sources.list.d/nvidia-container-toolkit.list
$ apt update
$ apt install -y nvidia-container-toolkit

With the necessary binary blobs installed, we can verify that the GPU is working at least as far as the host operating system is concerned by running nvidia-smi:

root@gpu0:~# nvidia-smi
Mon May 16 09:28:25 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:05:00.0 Off |                  N/A |
|  0%   28C    P8     3W / 320W |      0MiB / 10240MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Looks good, but that’s the easy bit. Now let’s sort out Kubernetes!

Configuring Kubernetes

The first thing I want to draw your attention to is this PR. It landed in K3s in 1.22, so you need to be installing this version at the very least. This saves the hassle of having to manually craft a containerd config template - it automatically generates the right section if it detects the presence of the drivers and the toolkit. So all we have to do is install K3s with no special options, and when K3s starts you should see the following section in /var/lib/rancher/k3s/agent/etc/containerd/config.toml (if you’re using RKE2 then the path is /var/lib/rancher/rke2/agent/etc/containerd/config.toml):

[plugins.cri.containerd.runtimes."nvidia"]
  runtime_type = "io.containerd.runc.v2"
[plugins.cri.containerd.runtimes."nvidia".options]
  BinaryName = "/usr/bin/nvidia-container-runtime"

With everything started and your Kubernetes cluster up and running, it should look something like this:

NAME       STATUS   ROLES                       AGE   VERSION
control0   Ready    control-plane,etcd,master   12d   v1.23.6+k3s1
gpu0       Ready    <none>                      86m   v1.23.6+k3s1
worker0    Ready    <none>                      61m   v1.23.6+k3s1

Runtime Classes

Although K3s (and RKE2) will have set up the container runtime (containerd) options for us automatically, in practice what we have on that node now is two available runtimes - the default (runcv2) and also nvidia. When a container is scheduled on that node, we need a way of telling the scheduler which runtime should be used - this is what Runtime Classes are for. So, we need to create a runtimeClass for the nvidia runtime:

$ kubectl apply -f - <<EOF
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia
handler: "nvidia"
EOF

Node Feature Discovery

The bit that we need to install next to update our nodes and advertise the availability of a GPU resource to the scheduler is the NVIDIA device plugin. By default this will attempt to create a DaemonSet on all nodes in the cluster, and in my case not every single node has a GPU. So what we actually need is a way of selecting the right nodes, and for this we’ll make use of a handy project called Node Feature Discovery. It’s an add-on which detects and then adds labels to nodes with bits of information on what hardware is in that node and so on. We can then use these labels to target our nodes with a GPU. It’s available as a Helm chart, or we can install it directly via kubectl:

$ kubectl apply -k "https://github.com/kubernetes-sigs/node-feature-discovery/deployment/overlays/default?ref=v0.11.0"

Once deployed, you’ll notice a bunch of extra labels like:

$ kubectl describe node gpu0 | grep -i pci
    feature.node.kubernetes.io/pci-0300_10de.present=true
    feature.node.kubernetes.io/pci-0300_1b36.present=true

Looking on our host, the NVIDIA GPU corresponds with the PCI ID in the label above:

root@gpu0:~# lspci -nn | grep -i nvidia
05:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2206] (rev a1)
06:00.0 Audio device [0403]: NVIDIA Corporation Device [10de:1aef] (rev a1)

So we can use a nodeSelector to target nodes with this particular label.

NVIDIA Device Plugin

Now, the Helm chart provided by NVIDIA for the container toolkit is missing a couple of things that we’ll need to make sure the workload lands in the right place and with the right options:

  • It won’t have the runtimeClass set;
  • It won’t have a nodeSelector for our GPU

So instead, we’ll need to grab and template out the manifests and add in our options.

$ helm repo add nvidia-device-plugin https://nvidia.github.io/k8s-device-plugin
$ helm template \
    nvidia-device-plugin \
    --version=0.11.0 \
    --set runtimeClassName=nvidia \
    nvidia-device-plugin/nvidia-device-plugin > ~/nvidia-device-plugin.yml

Then edit nvidia-device-plugin.yml and add those two things to the template spec section. It should look something like this:

spec:
  [..]
  template:
    [..]
    spec:
      [..]
      nodeSelector:
        feature.node.kubernetes.io/pci-0302_10de.present: "true"
      runtimeClassName: nvidia

With those changes made we can instantiate those resources in our cluster:

$ kubectl apply -f ~/nvidia-device-plugin.yml

If everything’s lined up and goes according to plan, you should see the right number of Pods being scheduled as part of the DaemonSet, and there should be some useful bits of information in the logs:

2022/05/30 13:51:50 Loading NVML
2022/05/30 13:51:50 Starting FS watcher.
2022/05/30 13:51:50 Starting OS watcher.
2022/05/30 13:51:50 Retreiving plugins.
2022/05/30 13:51:50 Starting GRPC server for 'nvidia.com/gpu'
2022/05/30 13:51:50 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2022/05/30 13:51:50 Registered device plugin for 'nvidia.com/gpu' with kubelet

And examining the node itself should show that we do in fact have a nvidia.com/gpu resource available:

$ kubectl get node gpu0 -o jsonpath="{.status.allocatable}" | jq
{
  "cpu": "8",
  "ephemeral-storage": "99891578802",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "32882948Ki",
  "nvidia.com/gpu": "1",
  "pods": "110"
}

Testing

To test we can use a CUDA vector add example image which should get scheduled to our GPU node if nothing’s amiss. Note the runtimeClassName added to the Pod spec:

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vector-add
spec:
  runtimeClassName: nvidia
  restartPolicy: OnFailure
  containers:
    - name: cuda-vector-add
      env:
      - name: NVIDIA_VISIBLE_DEVICES
        value: all
      - name: NVIDIA_DRIVER_CAPABILITIES
        value: compute,utility
      # https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile
      image: "k8s.gcr.io/cuda-vector-add:v0.1"
      resources:
        limits:
          nvidia.com/gpu: 1
$ kubectl apply -f gputest.yaml
pod/cuda-vector-add created
$ kubectl get pod cuda-vector-add -o jsonpath="{.spec.nodeName}"
gpu0
$ kubectl get pod cuda-vector-add -o jsonpath="{.status.phase}"
Running

Looks good!