K3s / RKE2 and GPU PCI-passthrough
16 May, 2022Here’s some notes on getting GPUs working with K3s or RKE2. It’s pulled together from a few places to save folks the same trouble I found, i.e hunting through PRs and issues on GitHub to work out what needs to be configured in order for this to work.
The first step is to get yourself a node with a NVIDIA GPU. In my case I’ve a 3080 configured using PCI passthrough to a VM running Ubuntu 20.04:
nick@gpu0:~$ sudo lspci -v | grep -i nvidia
05:00.0 VGA compatible controller: NVIDIA Corporation Device 2206 (rev a1) (prog-if 00 [VGA controller])
Subsystem: NVIDIA Corporation Device 1467
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
06:00.0 Audio device: NVIDIA Corporation Device 1aef (rev a1)
Subsystem: NVIDIA Corporation Device 1467
Node OS configuration
Within this VM, we need to install the NVIDIA drivers and also the container toolkit:
$ sudo apt -y install nvidia-driver-510
$ curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | apt-key add -
$ echo 'deb https://nvidia.github.io/libnvidia-container/stable/ubuntu20.04/$(ARCH) /' > / etc/apt/sources.list.d/nvidia-container-toolkit.list
$ apt update
$ apt install -y nvidia-container-toolkit
With the necessary binary blobs installed, we can verify that the GPU is working at least as far as the host operating system is concerned by running nvidia-smi
:
root@gpu0:~# nvidia-smi
Mon May 16 09:28:25 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:05:00.0 Off | N/A |
| 0% 28C P8 3W / 320W | 0MiB / 10240MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Looks good, but that’s the easy bit. Now let’s sort out Kubernetes!
Configuring Kubernetes
The first thing I want to draw your attention to is this PR. It landed in K3s in 1.22, so you need to be installing this version at the very least. This saves the hassle of having to manually craft a containerd config template - it automatically generates the right section if it detects the presence of the drivers and the toolkit. So all we have to do is install K3s with no special options, and when K3s starts you should see the following section in /var/lib/rancher/k3s/agent/etc/containerd/config.toml
(if you’re using RKE2 then the path is /var/lib/rancher/rke2/agent/etc/containerd/config.toml
):
[plugins.cri.containerd.runtimes."nvidia"]
runtime_type = "io.containerd.runc.v2"
[plugins.cri.containerd.runtimes."nvidia".options]
BinaryName = "/usr/bin/nvidia-container-runtime"
With everything started and your Kubernetes cluster up and running, it should look something like this:
NAME STATUS ROLES AGE VERSION
control0 Ready control-plane,etcd,master 12d v1.23.6+k3s1
gpu0 Ready <none> 86m v1.23.6+k3s1
worker0 Ready <none> 61m v1.23.6+k3s1
Runtime Classes
Although K3s (and RKE2) will have set up the container runtime (containerd) options for us automatically, in practice what we have on that node now is two available runtimes - the default (runcv2) and also nvidia. When a container is scheduled on that node, we need a way of telling the scheduler which runtime should be used - this is what Runtime Classes are for. So, we need to create a runtimeClass
for the nvidia runtime:
$ kubectl apply -f - <<EOF
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: nvidia
handler: "nvidia"
EOF
Node Feature Discovery
The bit that we need to install next to update our nodes and advertise the availability of a GPU resource to the scheduler is the NVIDIA device plugin. By default this will attempt to create a DaemonSet on all nodes in the cluster, and in my case not every single node has a GPU. So what we actually need is a way of selecting the right nodes, and for this we’ll make use of a handy project called Node Feature Discovery. It’s an add-on which detects and then adds labels to nodes with bits of information on what hardware is in that node and so on. We can then use these labels to target our nodes with a GPU. It’s available as a Helm chart, or we can install it directly via kubectl
:
$ kubectl apply -k "https://github.com/kubernetes-sigs/node-feature-discovery/deployment/overlays/default?ref=v0.11.0"
Once deployed, you’ll notice a bunch of extra labels like:
$ kubectl describe node gpu0 | grep -i pci
feature.node.kubernetes.io/pci-0300_10de.present=true
feature.node.kubernetes.io/pci-0300_1b36.present=true
Looking on our host, the NVIDIA GPU corresponds with the PCI ID in the label above:
root@gpu0:~# lspci -nn | grep -i nvidia
05:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2206] (rev a1)
06:00.0 Audio device [0403]: NVIDIA Corporation Device [10de:1aef] (rev a1)
So we can use a nodeSelector
to target nodes with this particular label.
NVIDIA Device Plugin
Now, the Helm chart provided by NVIDIA for the container toolkit is missing a couple of things that we’ll need to make sure the workload lands in the right place and with the right options:
- It won’t have the
runtimeClass
set; - It won’t have a nodeSelector for our GPU
So instead, we’ll need to grab and template out the manifests and add in our options.
$ helm repo add nvidia-device-plugin https://nvidia.github.io/k8s-device-plugin
$ helm template \
nvidia-device-plugin \
--version=0.11.0 \
--set runtimeClassName=nvidia \
nvidia-device-plugin/nvidia-device-plugin > ~/nvidia-device-plugin.yml
Then edit nvidia-device-plugin.yml
and add those two things to the template spec section. It should look something like this:
spec:
[..]
template:
[..]
spec:
[..]
nodeSelector:
feature.node.kubernetes.io/pci-0302_10de.present: "true"
runtimeClassName: nvidia
With those changes made we can instantiate those resources in our cluster:
$ kubectl apply -f ~/nvidia-device-plugin.yml
If everything’s lined up and goes according to plan, you should see the right number of Pods being scheduled as part of the DaemonSet, and there should be some useful bits of information in the logs:
2022/05/30 13:51:50 Loading NVML
2022/05/30 13:51:50 Starting FS watcher.
2022/05/30 13:51:50 Starting OS watcher.
2022/05/30 13:51:50 Retreiving plugins.
2022/05/30 13:51:50 Starting GRPC server for 'nvidia.com/gpu'
2022/05/30 13:51:50 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2022/05/30 13:51:50 Registered device plugin for 'nvidia.com/gpu' with kubelet
And examining the node itself should show that we do in fact have a nvidia.com/gpu
resource available:
$ kubectl get node gpu0 -o jsonpath="{.status.allocatable}" | jq
{
"cpu": "8",
"ephemeral-storage": "99891578802",
"hugepages-1Gi": "0",
"hugepages-2Mi": "0",
"memory": "32882948Ki",
"nvidia.com/gpu": "1",
"pods": "110"
}
Testing
To test we can use a CUDA vector add example image which should get scheduled to our GPU node if nothing’s amiss. Note the runtimeClassName
added to the Pod spec:
apiVersion: v1
kind: Pod
metadata:
name: cuda-vector-add
spec:
runtimeClassName: nvidia
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
env:
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: compute,utility
# https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile
image: "k8s.gcr.io/cuda-vector-add:v0.1"
resources:
limits:
nvidia.com/gpu: 1
$ kubectl apply -f gputest.yaml
pod/cuda-vector-add created
$ kubectl get pod cuda-vector-add -o jsonpath="{.spec.nodeName}"
gpu0
$ kubectl get pod cuda-vector-add -o jsonpath="{.status.phase}"
Running
Looks good!