virt: handle capacity=0 in getNextVolumeSize by dupondje · Pull Request #426 · oVirt/vdsm

dupondje · 2025-04-01T14:03:55Z

We saw a case where domstats returned capacity=0 just after taking a snapshot.
This causes a ImprobableAllocationError to be triggered and permanently keep the VM into a hanging paused state.

As writes can't be flushed to disk, the qcow2 image can become corrupted also when the VM is stopped.

So we add some fallback if capacity=0 so the disk is at least resized, so we don't end up with a broken VM.

We saw a case where domstats returned capacity=0 just after taking a snapshot. This causes a ImprobableAllocationError to be triggered and permanently keep the VM into a hanging paused state. As writes can't be flushed to disk, the qcow2 image can become corrupted also when the VM is stopped. So we add some fallback if capacity=0 so the disk is at least resized, so we don't end up with a broken VM. Signed-off-by: Jean-Louis Dupond <jean-louis@dupond.be>

dupondje · 2025-04-01T14:07:37Z

@nirs : Might be useful to know if you've seen any other of such cases? As I can't simulate it on libvirt level, it's hard to find a cause in libvirt. But as the issue has quite some impact, I think it's better to have some failsafe in the code.

nirs · 2025-04-01T14:34:58Z

lib/vdsm/virt/vmdevices/storage.py

        """
        nextSize = utils.round(curSize + self.volExtensionChunk, MiB)
+        if capacity == 0:
+            return nextSize


This may cause bad allocation to size that the vm can never use. If this state continues, we may continue to extend the disk until all storage space is consumed. So this is not a valid workaround.

We need to understand why capacity is zero, and find another way to get the capacity so we never allocated more than we should.

If we need a workaround it should be ant another level. Why do we try to extend? Is this caused by libvirt returning wrong capacity?

When the issue happens, does it continue or this is a temporary condition? If we skip the extend when libvirt fail to report the capacity, the vm will pause and at that point we will try to extend again and succeed if libvirt report snow the right value.

Thanks for the quick response, always good to have somebody with more history in the code having a look at it :)

So what happens, we made a backup of the VM, a snapshot was created.

2025-04-01 04:13:45,885+0200 INFO (jsonrpc/4) [api.virt] START snapshot(snapDrives=[{'imageID': 'c990286f-a6cc-451e-bd75-e9463f7514c9', 'baseVolumeID': '744952ac-0b5d-4c20-b72a-1f6428cd99f2', 'volumeID': '88fe8d06-be9c-46a6-9223-59206fc97b63', 'domainID': 'e1d2ff33-c3fd-4c1a-bcd1-2047e4efc362'}, {'imageID': '65a631ba-f7c5-4be1-88ce-16f59c2c8640', 'baseVolumeID': 'efbd8056-720a-47b6-bede-4325d576ffb9', 'volumeID': 'c92414d9-307a-4e2a-be53-dade8ec6cdc9', 'domainID': 'e1d2ff33-c3fd-4c1a-bcd1-2047e4efc362'}], snapMemory=None, frozen=False, jobUUID='bb511d9b-9fbe-40bc-a64b-a637279d690a', timeout=30, freeze_timeout=8) from=::ffff:x,54882, flow_id=00cc9982-075f-403f-87a0-1f4cab61f08c, vmId=eb9b0a1c-7659-4482-980f-7e5b9ac9336a (api:31)

2025-04-01 04:13:52,237+0200 INFO (virt/bb511d9b) [virt.vm] (vmId='eb9b0a1c-7659-4482-980f-7e5b9ac9336a') Clearing block threshold for drive 'sda[2]' (thinp:172) 2025-04-01 04:13:52,238+0200 INFO (virt/bb511d9b) [vdsm.api] START getVolumeSize(sdUUID='e1d2ff33-c3fd-4c1a-bcd1-2047e4efc362', spUUID='997839a4-c790-4dc4-9c5f-8a3de81a4794', imgUUID='c990286f-a6cc-451e-bd75-e9463f7514c9', volUUID='88fe8d06-be9c-46a6-9223-59206fc97b63') from=internal, task_id=a87b3e6d-8312-4e74-b4a8-2a0e97b579ec (api:31) 2025-04-01 04:13:52,238+0200 INFO (virt/bb511d9b) [vdsm.api] FINISH getVolumeSize return={'apparentsize': '2684354560', 'truesize': '2684354560'} from=internal, task_id=a87b3e6d-8312-4e74-b4a8-2a0e97b579ec (api:37)

But then something goes wrong:

2025-04-01 04:13:54,065+0200 INFO (periodic/2) [virt.vm] (vmId='eb9b0a1c-7659-4482-980f-7e5b9ac9336a') CPU stopped: pause (vm:6071) 2025-04-01 04:13:54,066+0200 ERROR (periodic/2) [Executor] Unhandled exception in <Task discardable <VolumeWatermarkMonitor vm=eb9b0a1c-7659-4482-980f-7e5b9ac9336a at 0x7fd7cafc0d30> timeout=1.0, duration=0.07 at 0x7fd7cafe7cd0> (executor:300) Traceback (most recent call last): File "/usr/lib/python3.9/site-packages/vdsm/executor.py", line 298, in _execute_task task() File "/usr/lib/python3.9/site-packages/vdsm/executor.py", line 374, in __call__ self._callable() File "/usr/lib/python3.9/site-packages/vdsm/virt/periodic.py", line 314, in __call__ self._execute() File "/usr/lib/python3.9/site-packages/vdsm/virt/periodic.py", line 382, in _execute self._vm.volume_monitor.monitor_volumes() File "/usr/lib/python3.9/site-packages/vdsm/virt/thinp.py", line 278, in monitor_volumes self._extend_drive_if_needed(drive) File "/usr/lib/python3.9/site-packages/vdsm/virt/thinp.py", line 310, in _extend_drive_if_needed self._handle_unset(drive) File "/usr/lib/python3.9/site-packages/vdsm/virt/thinp.py", line 325, in _handle_unset if self._drive_needs_extend(drive): File "/usr/lib/python3.9/site-packages/vdsm/virt/thinp.py", line 404, in _drive_needs_extend raise ImprobableAllocationError( vdsm.virt.thinp.ImprobableAllocationError: Improbable allocation, pausing the VM to avoid corruption: BlockInfo(index=4, name='sda', path='/rhev/data-center/mnt/blockSD/e1d2ff33-c3fd-4c1a-bcd1-2047e4efc362/images/c990286f-a6cc-451e-bd75-e9463f7514c9/88fe8d06-be9c-46a6-9223-59206fc97b63', allocation=328192, capacity=0, physical=2684354560, threshold=0) 2025-04-01 04:13:54,164+0200 INFO (libvirt/events) [virt.vm] (vmId='eb9b0a1c-7659-4482-980f-7e5b9ac9336a') CPU stopped: onSuspend (vm:6071)

As you see, libvirt seems to return capacity=0, causing the extend to fail.
So the snapshot image did not want to extend, and the VM went into a paused state.

But this is not a one-time error, but it kept going on:

2025-04-01 04:14:22,011+0200 INFO (periodic/0) [virt.vm] (vmId='eb9b0a1c-7659-4482-980f-7e5b9ac9336a') CPU stopped: pause (vm:6071) 2025-04-01 04:14:22,012+0200 ERROR (periodic/0) [Executor] Unhandled exception in <Task discardable <VolumeWatermarkMonitor vm=eb9b0a1c-7659-4482-980f-7e5b9ac9336a at 0x7fd7caf43160> timeout=1.0, duration=0.01 at 0x7fd7c87d1400> (executor:300) Traceback (most recent call last): File "/usr/lib/python3.9/site-packages/vdsm/executor.py", line 298, in _execute_task task() File "/usr/lib/python3.9/site-packages/vdsm/executor.py", line 374, in __call__ self._callable() File "/usr/lib/python3.9/site-packages/vdsm/virt/periodic.py", line 314, in __call__ self._execute() File "/usr/lib/python3.9/site-packages/vdsm/virt/periodic.py", line 382, in _execute self._vm.volume_monitor.monitor_volumes() File "/usr/lib/python3.9/site-packages/vdsm/virt/thinp.py", line 278, in monitor_volumes self._extend_drive_if_needed(drive) File "/usr/lib/python3.9/site-packages/vdsm/virt/thinp.py", line 310, in _extend_drive_if_needed self._handle_unset(drive) File "/usr/lib/python3.9/site-packages/vdsm/virt/thinp.py", line 325, in _handle_unset if self._drive_needs_extend(drive): File "/usr/lib/python3.9/site-packages/vdsm/virt/thinp.py", line 404, in _drive_needs_extend raise ImprobableAllocationError( vdsm.virt.thinp.ImprobableAllocationError: Improbable allocation, pausing the VM to avoid corruption: BlockInfo(index=4, name='sda', path='/rhev/data-center/mnt/blockSD/e1d2ff33-c3fd-4c1a-bcd1-2047e4efc362/images/c990286f-a6cc-451e-bd75-e9463f7514c9/88fe8d06-be9c-46a6-9223-59206fc97b63', allocation=32819 2, capacity=0, physical=2684354560, threshold=0)

This continued until the VM was manually stopped (and corrupting the qcow2 file because it couldn't flush to disk I guess).

Maybe some additional useful info about the image.

2025-04-01 04:14:04,505+0200 INFO (jsonrpc/0) [vdsm.api] START getQemuImageInfo(sdUUID='e1d2ff33-c3fd-4c1a-bcd1-2047e4efc362', spUUID='997839a4-c790-4dc4-9c5f-8a3de81a4794', imgUUID='c990286f-a6cc-451e-bd75-e9463f7514c9', volUUID='88fe8d06-be9c-46a6-9223-59206fc97b63') from=::ffff:x,54882, flow_id=00cc9982-075f-403f-87a0-1f4cab61f08c, task_id=6218e687-c0f6-4710-9608-36a813e0db35 (api:31) 2025-04-01 04:14:04,528+0200 INFO (jsonrpc/0) [vdsm.api] FINISH getQemuImageInfo return={'info': {'virtualsize': 1920383410176, 'format': 'qcow2', 'clustersize': 65536, 'backingfile': '744952ac-0b5d-4c20-b72a-1f6428cd99f2', 'compat': '1.1', 'bitmaps': [Qcow2BitmapInfo(name='96276f29-8014-437b-8f0e-035866e0afb7', granularity=65536, flags=['in-use', 'auto'])]}} from=::ffff:x,54882, flow_id=00cc9982-075f-403f-87a0-1f4cab61f08c, task_id=6218e687-c0f6-4710-9608-36a813e0db35 (api:37)

2025-04-01 04:17:23,439+0200 INFO (jsonrpc/1) [vdsm.api] START getVolumeInfo(sdUUID='e1d2ff33-c3fd-4c1a-bcd1-2047e4efc362', spUUID='997839a4-c790-4dc4-9c5f-8a3de81a4794', imgUUID='c990286f-a6cc-451e-bd75-e9463f7514c9', volUUID='88fe8d06-be9c-46a6-9223-59206fc97b63') from=::ffff:x,54882, flow_id=00cc9982-075f-403f-87a0-1f4cab61f08c, task_id=565da4db-c8d3-4e50-9596-dd42c7867b6f (api:31) 2025-04-01 04:17:23,520+0200 INFO (jsonrpc/1) [storage.volumemanifest] Info request: sdUUID=e1d2ff33-c3fd-4c1a-bcd1-2047e4efc362 imgUUID=c990286f-a6cc-451e-bd75-e9463f7514c9 volUUID = 88fe8d06-be9c-46a6-9223-59206fc97b63 (volume:228) 2025-04-01 04:17:23,536+0200 INFO (jsonrpc/1) [storage.volumemanifest] e1d2ff33-c3fd-4c1a-bcd1-2047e4efc362/c990286f-a6cc-451e-bd75-e9463f7514c9/88fe8d06-be9c-46a6-9223-59206fc97b63 info is {'uuid': '88fe8d06-be9c-46a6-9223-59206fc97b63', 'type': 'SPARSE', 'format': 'COW', 'disktype': 'DATA', 'voltype': 'LEAF', 'capacity': '1920383410176', 'parent': '744952ac-0b5d-4c20-b72a-1f6428cd99f2', 'description': '', 'pool': '', 'domain': 'e1d2ff33-c3fd-4c1a-bcd1-2047e4efc362', 'image': 'c990286f-a6cc-451e-bd75-e9463f7514c9', 'ctime': '1743473609', 'mtime': '0', 'legality': 'LEGAL', 'generation': 0, 'sequence': 2, 'apparentsize': '2684354560', 'truesize': '2684354560', 'status': 'OK', 'lease': {'path': '/dev/e1d2ff33-c3fd-4c1a-bcd1-2047e4efc362/leases', 'offset': 113246208, 'owners': [], 'version': None}, 'children': []} (volume:267) 2025-04-01 04:17:23,536+0200 INFO (jsonrpc/1) [vdsm.api] FINISH getVolumeInfo return={'info': {'uuid': '88fe8d06-be9c-46a6-9223-59206fc97b63', 'type': 'SPARSE', 'format': 'COW', 'disktype': 'DATA', 'voltype': 'LEAF', 'capacity': '1920383410176', 'parent': '744952ac-0b5d-4c20-b72a-1f6428cd99f2', 'description': '', 'pool': '', 'domain': 'e1d2ff33-c3fd-4c1a-bcd1-2047e4efc362', 'image': 'c990286f-a6cc-451e-bd75-e9463f7514c9', 'ctime': '1743473609', 'mtime': '0', 'legality': 'LEGAL', 'generation': 0, 'sequence': 2, 'apparentsize': '2684354560', 'truesize': '2684354560', 'status': 'OK', 'lease': {'path': '/dev/e1d2ff33-c3fd-4c1a-bcd1-2047e4efc362/leases', 'offset': 113246208, 'owners': [], 'version': None}, 'children': []}} from=::ffff:x,54882, flow_id=00cc9982-075f-403f-87a0-1f4cab61f08c, task_id=565da4db-c8d3-4e50-9596-dd42c7867b6f (api:37)

I've noticed in the code the extension limitations were set because sometimes libvirt/qemu send invalid info and kept on extending.
But is that issue still valid? Or might it have been fixed in qemu/libvirt and making the check useless?

Looks to be caused by a corruption in the qcow2 image (cause by a bug? https://gitlab.com/qemu-project/qemu/-/issues/2912)

JasperB-TeamBlue

LGTM

dupondje requested review from aesteve-rh and mz-pdm as code owners April 1, 2025 14:03

dupondje force-pushed the fix_capacity_zero branch from cd781dd to c14ed92 Compare April 1, 2025 14:04

dupondje requested review from JasperB-TeamBlue and nirs April 1, 2025 14:05

nirs reviewed Apr 1, 2025

View reviewed changes

JasperB-TeamBlue approved these changes Apr 1, 2025

View reviewed changes

dupondje closed this Apr 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

virt: handle capacity=0 in getNextVolumeSize#426

virt: handle capacity=0 in getNextVolumeSize#426
dupondje wants to merge 1 commit intooVirt:masterfrom
teamdotblue:fix_capacity_zero

dupondje commented Apr 1, 2025 •

edited

Loading

Uh oh!

dupondje commented Apr 1, 2025

Uh oh!

nirs Apr 1, 2025

Uh oh!

dupondje Apr 2, 2025

Uh oh!

dupondje Apr 4, 2025

Uh oh!

JasperB-TeamBlue left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dupondje commented Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dupondje commented Apr 1, 2025

Uh oh!

nirs Apr 1, 2025

Choose a reason for hiding this comment

Uh oh!

dupondje Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

dupondje Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

JasperB-TeamBlue left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dupondje commented Apr 1, 2025 •

edited

Loading