Nova and libvirt 6.x

This one’s a heads-up in case you’ve an OpenStack deployment that’s been around for a while and which hosts instances spawned prior to the Train release. Pets, if you will. You’re in for a bit of a shock if one of these instances is stopped and then started again - libvirt will refuse to create the domain, with a Nova backtrace along these lines:

2020-08-19 16:53:20.533 6 ERROR oslo_messaging.rpc.server libvirt.libvirtError: Requested operation is not valid: format of backing image '/var/lib/nova/instances/_base/c3395c4245b7573c83342d68a0d0ea675b7a1722' of
 image '/var/lib/nova/instances/947df0d3-5aab-456d-a200-63b055934a43/disk' was not specified in the image metadata (See https://libvirt.org/kbase/backing_chains.html for troubleshooting)

The error is because of a change introduced in libvirt in 6.0 which means that it’ll fail to launch the domain if the underlying disk’s backing store doesn’t have a format explicitly defined.

The good news is that this was spotted - and fixed - in Nova in the Train release. There’s an associated bug on Launchpad along with additional background on the problem here. The bad news is that this fix only applies to new instances created with this fix in place. Older instances which were created without specifying that backing file format will fail with the above error if you’ve upgraded libvirt on your hypervisors.

I hit this when upgrading from Train to Ussuri via Kolla-Ansible; The Kolla-built Ubuntu-based Docker images for Ussuri include this newer version of libvirt in this release, and so when an older VM was stopped (i.e powered off) and then started again I saw this problem.

The fix is relatively straightforward and is mostly a case of following the documentation linked in the error message, with a few OpenStack-specific twists that are worth being aware of especially if you’re using Kolla. You should also take a backup (if possible) of the disk file as well as the backing file. In my case, the instance’s disks are hosted on the hypervisors themselves. You’ll need to adjust this process if you’re presenting block storage via some other means.

For a Kolla-based deployment, the commands need to be run in the nova_compute container so we can guarantee we’re using the right version of the QEMU tools and also to make our lives a bit easier when referring to disk file paths. If you docker exec straight in to this container you’ll be dropped in as the nova user which won’t have the permissions necessary to update the disk file, so instead we need to use nsenter:

root@compute2:~# PID=$(docker inspect --format {{.State.Pid}} nova_compute)
root@compute2:~# nsenter --target $PID --mount --uts --ipc --net --pid
()[root@compute2 /]#

Now we need to navigate to the folder hosting our instance’s disk. You’ll need the UUID of the instance to do that (947df0d3-5aab-456d-a200-63b055934a43 in my example), then we can use qemu-img info to find a bit more about it:

# cd /var/lib/nova/instances/947df0d3-5aab-456d-a200-63b055934a43
# qemu-img info disk
image: disk 
file format: qcow2
virtual size: 80 GiB (85899345920 bytes)
disk size: 21.3 GiB
cluster_size: 65536
backing file: /var/lib/nova/instances/_base/c3395c4245b7573c83342d68a0d0ea675b7a1722
Format specific information:
  compat: 1.1 
  lazy refcounts: false 
  refcount bits: 16 
  corrupt: false

In this case, we’re missing a field - backing file format. If we examine another instance booted using Ussuri, we can see that’s present:

image: disk
file format: qcow2
virtual size: 10 GiB (10737418240 bytes)
disk size: 291 MiB
cluster_size: 65536
backing file: /var/lib/nova/instances/_base/c3395c4245b7573c83342d68a0d0ea675b7a1722
backing file format: raw
Format specific information:
    compat: 1.1
    lazy refcounts: false
    refcount bits: 16
    corrupt: false

To fix the problem, we need to run another qemu-img command. We should validate the format of the backing file first, and then armed with the right info we can update the image:

# qemu-img info /var/lib/nova/instances/_base/c3395c4245b7573c83342d68a0d0ea675b7a1722
image: /var/lib/nova/instances/_base/c3395c4245b7573c83342d68a0d0ea675b7a1722
file format: raw
virtual size: 2.2 GiB (2361393152 bytes)
disk size: 1.02 GiB

# qemu-img rebase -f qcow2 -F raw \
-b /var/lib/nova/instances/_base/c3395c4245b7573c83342d68a0d0ea675b7a1722 \
/var/lib/nova/instances/947df0d3-5aab-456d-a200-63b055934a43/disk

# qemu-img info disk
image: disk 
file format: qcow2
virtual size: 80 GiB (85899345920 bytes)
disk size: 21.3 GiB
cluster_size: 65536
backing file: /var/lib/nova/instances/_base/c3395c4245b7573c83342d68a0d0ea675b7a1722
backing file format: raw
Format specific information:
  compat: 1.1 
  lazy refcounts: false 
  refcount bits: 16 
  corrupt: false

The second command updated the image, and the last command validated that we now see the backing file format specified correctly.

If you’ve done everything right then you should now be able to start the instance up again without Nova erroring.