RKE2 and NVIDIA GPUs with the GRID Operator

14 June, 2022

Similar to my earlier entry on enabling GPUs with K3s or RKE2, this post builds on the process when you’re working with the NVIDIA GRID Operator and specifically RKE2.

Being an Operator, it introduces a custom controller into your cluster which encompasses the business logic required to automatically set up and provision GPU nodes within your cluster. In fact, I believe it’s your only option if you want to take advantage of features like vGPUs which entail a licensing server elsewhere on your network (and the provisioning of which is outside the scope of this post).

Building the OS image

By default, the GRID Operator assumes that any node booted into the cluster with a GPU doesn’t have the driver installed to initialise the hardware. Therefore, it first launches a custom image and if necessary builds the kernel module before loading it on the host. The steps for doing this are well documented here and work fine as long as you’re using an OS in the linked repo.

In practice I found this process a little slow when it comes to initialising a new node, and cloud being the name of the game you might find yourself scaling your cluster up and down to meet workload demands, meaning you don’t want to add any extra - avoidable - delays into the whole process. Therefore, I decided to bake a custom OS image with the prerequisites already installed instead and then there’s a flag you can set when you install the Helm chart to tell the Operator not to bother with driver installation. More on that in a minute.

The other problem is that K3s and RKE2 do a lot of the work of configuring the container runtime for us, as long as the NVIDIA container toolkit and the driver are already installed. This doesn’t work as expected if we’re initialising the GPU after we’ve already launched RKE2 on a node.

So to include the drivers and the toolkit in the OS image, you want to include a step along these lines:

export DEBIAN_FRONTEND=noninteractive

curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add -
echo 'deb https://nvidia.github.io/libnvidia-container/stable/ubuntu20.04/$(ARCH) /' | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

apt-get install -y build-essential nvidia-container-toolkit

curl -s -o /tmp/NVIDIA-Linux-x86_64-510.47.03-grid.run http://your.repo/NVIDIA-Linux-x86_64-510.47.03-grid.run

sh /tmp/NVIDIA-Linux-x86_64-510.47.03-grid.run -s

rm -f /tmp/NVIDIA-Linux-x86_64-510.47.03-grid.run
rm -f /tmp/script.sh

I use Packer to build custom OS images, and it’s trivial to include this shell script snippets such as these as part of that process.

The other thing you need to do is to include the config so that the node registers itself with your licensing server. If you’re not building an image with this stuff included and instead you’re relying on the Operator to do its bit with the driver then this step is managed for you. However, if you’re following along and skipping this last bit, then we need to bake it into our OS image by including something like the following after you’ve installed the driver:

cat > /etc/nvidia/gridd.conf << EOF
ServerAddress=192.168.1.123
EOF

As an aside, if you don’t include the licensing information then your GPU-resourced Pods will run fine for a brief while (<1 hour), but then all of a sudden performance will completely drop off a cliff. If you log into the node itself and check the logs for the gridd systemd unit, you’ll see something along the lines of notice: vmiop_log: (0x0): vGPU license state: Unlicensed (Restricted). This is because the driver drops the GPU into limp mode when you exceed the grace period for the licensing server being contactable. When you haven’t come across this behaviour before, it can be very confusing to troubleshoot.

Configuring RKE2

In this example, I’m also going to make the NVIDIA container runtime the default on our GPU nodes. Create a file called /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl with the following contents:

[plugins.opt]
  path = "/var/lib/rancher/rke2/agent/containerd"

[plugins.cri]
  stream_server_address = "127.0.0.1"
  stream_server_port = "10010"
  enable_selinux = false
  sandbox_image = "index.docker.io/rancher/pause:3.6"

[plugins.cri.containerd]
  snapshotter = "overlayfs"
  disable_snapshot_annotations = true
  default_runtime_name = "nvidia"

[plugins.cri.containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"

[plugins.cri.containerd.runtimes."nvidia"]
  runtime_type = "io.containerd.runc.v2"
[plugins.cri.containerd.runtimes."nvidia".options]
  BinaryName = "/usr/bin/nvidia-container-runtime"

If you do this via cloud-init as part of node provisioning, then it’ll be automatically picked up by RKE2 when it first launches.

Installing the Operator

Now we can go ahead and install the Operator with a few specific options to match both our pre-installed driver as well as where to find the various bits that RKE2 ships with:

helm install gpu-operator \
     -n gpu-operator --create-namespace \
     nvidia/gpu-operator \
     --set driver.enabled=false \
     --set toolkit.enabled=false \
     --set driver.licensingConfig.configMapName=licensing-config \
     --set toolkit.env[0].name=CONTAINERD_CONFIG \
     --set toolkit.env[0].value=/var/lib/rancher/rke2/agent/etc/containerd/config.toml \
     --set toolkit.env[1].name=CONTAINERD_SOCKET \
     --set toolkit.env[1].value=/run/k3s/containerd/containerd.sock \
     --set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \
     --set toolkit.env[2].value=nvidia \
     --set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT \
     --set-string toolkit.env[3].value=true

After a brief period, you should end up with something looking like this:

nick@deadline ~ % k get pods -n gpu-operator
NAME                                                          READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-cs2qx                                   1/1     Running     0          45s
gpu-operator-7bfc5f55-hmqhk                                   1/1     Running     0          60s
gpu-operator-node-feature-discovery-master-6598566c8c-7rlbc   1/1     Running     0          60s
gpu-operator-node-feature-discovery-worker-jv5q6              1/1     Running     0          60s
gpu-operator-node-feature-discovery-worker-qkm77              1/1     Running     0          60s
nvidia-cuda-validator-ctzzz                                   0/1     Completed   0          41s
nvidia-dcgm-exporter-rl4bj                                    1/1     Running     0          45s
nvidia-device-plugin-daemonset-wc5bm                          1/1     Running     0          45s
nvidia-device-plugin-validator-xvcp7                          0/1     Completed   0          29s
nvidia-operator-validator-zqj9t                               1/1     Running     0          46s

And of course nodes should now have the nvidia.com/gpu resource listed:

nick@deadline ~ % k get node worker-gpu-53d95674-787gx -o jsonpath="{.status.allocatable}" | jq
{
  "cpu": "6",
  "ephemeral-storage": "197546904422",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "32883428Ki",
  "nvidia.com/gpu": "1",
  "pods": "110"
}

If anything fails it’s usually fairly clear from the Pod logs as to what went awry, in particular the logs for the ‘validators’.