mediated device pass-through: fix race condition on VM reboot

When rebooting a VM from PVE (via CLI/API), the reboot code is called under a guest lock, which creates a reboot request, shuts down the VM and then calls the regular cleanup code, which includes the mdev cleanup. In parallel, the qmeventd observes that the VM process has gone, and starts 'qm cleanup' which is (among other tasks) also starts the VM again if a reboot from the PVE side is pending. The qmeventd synchronizes this through a lock on the guest, with a default timeout of 10 seconds. Since we currently also always wait 10 seconds for the NVIDIA driver to clean up the mdev, this creates a race condition for the cleanup lock. IOW., when the call to `qm cleanup` starts before we started to sleep for 10 seconds, it will not be able to acquire its lock and not start the vm again. To avoid the race condition in practice, do two things: * increase the timeout in `qm cleanup` to 60 seconds. Technically this still might run into a timeout, as we can configure up to 16 mediated devices with each delaying 10 seconds in the worst case, but realistically most users won't configure more than two or three of them, if even that. * change the hard-coded `sleep 10` to a loop sleeping for 1 second each before checking the state again. This shortens the timeout when the NVIDIA driver did not require the full 10s to finish the clean-up. Further, add a bit of logging, so one can properly see in the task log what is happening at which point in time. Fixes: 49c51a60 (pci: workaround nvidia driver issue on mdev cleanup) Signed-off-by: Dominik Csapak <d.csapak@proxmox.com> Reviewed-by: Mira Limbeck <m.limbeck@proxmox.com> [ TL: change warn to print, reword commit message ] Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2025-01-07 17:17:57 +03:00 · 2024-03-07 10:33:37 +01:00 · 2024-03-07 10:33:37 +01:00 · a672c578e0
commit a672c578e0
parent 04736ecbd5
2 changed files with 14 additions and 5 deletions
--- a/PVE/CLI/qm.pm
+++ b/PVE/CLI/qm.pm
@ -915,7 +915,8 @@ __PACKAGE__->register_method({
 	my $storecfg = PVE::Storage::config();
 	warn "Starting cleanup for $vmid\n";

-	PVE::QemuConfig->lock_config($vmid, sub {
+	# mdev cleanup can take a while, so wait up to 60 seconds
+	PVE::QemuConfig->lock_config_full($vmid, 60, sub {
 	    my $conf = PVE::QemuConfig->load_config ($vmid);
 	    my $pid = PVE::QemuServer::check_running ($vmid);
 	    die "vm still running\n" if $pid;
--- a/PVE/QemuServer.pm
+++ b/PVE/QemuServer.pm
@ -6133,12 +6133,20 @@ sub cleanup_pci_devices {
 	    my $dev_sysfs_dir = "/sys/bus/mdev/devices/$uuid";

 	    # some nvidia vgpu driver versions want to clean the mdevs up themselves, and error
-	    # out when we do it first. so wait for 10 seconds and then try it
-	    if ($d->{ids}->[0]->[0]->{vendor} =~ m/^(0x)?10de$/) {
-		sleep 10;
+	    # out when we do it first. so wait for up to 10 seconds and then try it manually
+	    if ($d->{ids}->[0]->[0]->{vendor} =~ m/^(0x)?10de$/ && -e $dev_sysfs_dir) {
+		my $count = 0;
+		while (-e $dev_sysfs_dir && $count < 10) {
+		    sleep 1;
+		    $count++;
+		}
+		print "waited $count seconds for mediated device driver finishing clean up\n";
 	    }

-	    PVE::SysFSTools::file_write("$dev_sysfs_dir/remove", "1") if -e $dev_sysfs_dir;
+	    if (-e $dev_sysfs_dir) {
+		print "actively clean up mediated device with UUID $uuid\n";
+		PVE::SysFSTools::file_write("$dev_sysfs_dir/remove", "1");
+	    }
 	}
    }
    PVE::QemuServer::PCI::remove_pci_reservation($vmid);