Skip to content

virt: handle capacity=0 in getNextVolumeSize#426

Closed
dupondje wants to merge 1 commit intooVirt:masterfrom
teamdotblue:fix_capacity_zero
Closed

virt: handle capacity=0 in getNextVolumeSize#426
dupondje wants to merge 1 commit intooVirt:masterfrom
teamdotblue:fix_capacity_zero

Conversation

@dupondje
Copy link
Member

@dupondje dupondje commented Apr 1, 2025

We saw a case where domstats returned capacity=0 just after taking a snapshot.
This causes a ImprobableAllocationError to be triggered and permanently keep the VM into a hanging paused state.

As writes can't be flushed to disk, the qcow2 image can become corrupted also when the VM is stopped.

So we add some fallback if capacity=0 so the disk is at least resized, so we don't end up with a broken VM.

We saw a case where domstats returned capacity=0 just after taking a
snapshot.
This causes a ImprobableAllocationError to be triggered and permanently
keep the VM into a hanging paused state.

As writes can't be flushed to disk, the qcow2 image can become corrupted
also when the VM is stopped.

So we add some fallback if capacity=0 so the disk is at least resized,
so we don't end up with a broken VM.

Signed-off-by: Jean-Louis Dupond <jean-louis@dupond.be>
@dupondje
Copy link
Member Author

dupondje commented Apr 1, 2025

@nirs : Might be useful to know if you've seen any other of such cases? As I can't simulate it on libvirt level, it's hard to find a cause in libvirt. But as the issue has quite some impact, I think it's better to have some failsafe in the code.

"""
nextSize = utils.round(curSize + self.volExtensionChunk, MiB)
if capacity == 0:
return nextSize
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may cause bad allocation to size that the vm can never use. If this state continues, we may continue to extend the disk until all storage space is consumed. So this is not a valid workaround.

We need to understand why capacity is zero, and find another way to get the capacity so we never allocated more than we should.

If we need a workaround it should be ant another level. Why do we try to extend? Is this caused by libvirt returning wrong capacity?

When the issue happens, does it continue or this is a temporary condition? If we skip the extend when libvirt fail to report the capacity, the vm will pause and at that point we will try to extend again and succeed if libvirt report snow the right value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick response, always good to have somebody with more history in the code having a look at it :)

So what happens, we made a backup of the VM, a snapshot was created.

2025-04-01 04:13:45,885+0200 INFO  (jsonrpc/4) [api.virt] START snapshot(snapDrives=[{'imageID': 'c990286f-a6cc-451e-bd75-e9463f7514c9', 'baseVolumeID': '744952ac-0b5d-4c20-b72a-1f6428cd99f2', 'volumeID': '88fe8d06-be9c-46a6-9223-59206fc97b63', 'domainID': 'e1d2ff33-c3fd-4c1a-bcd1-2047e4efc362'}, {'imageID': '65a631ba-f7c5-4be1-88ce-16f59c2c8640', 'baseVolumeID': 'efbd8056-720a-47b6-bede-4325d576ffb9', 'volumeID': 'c92414d9-307a-4e2a-be53-dade8ec6cdc9', 'domainID': 'e1d2ff33-c3fd-4c1a-bcd1-2047e4efc362'}], snapMemory=None, frozen=False, jobUUID='bb511d9b-9fbe-40bc-a64b-a637279d690a', timeout=30, freeze_timeout=8) from=::ffff:x,54882, flow_id=00cc9982-075f-403f-87a0-1f4cab61f08c, vmId=eb9b0a1c-7659-4482-980f-7e5b9ac9336a (api:31)
2025-04-01 04:13:52,237+0200 INFO  (virt/bb511d9b) [virt.vm] (vmId='eb9b0a1c-7659-4482-980f-7e5b9ac9336a') Clearing block threshold for drive 'sda[2]' (thinp:172)
2025-04-01 04:13:52,238+0200 INFO  (virt/bb511d9b) [vdsm.api] START getVolumeSize(sdUUID='e1d2ff33-c3fd-4c1a-bcd1-2047e4efc362', spUUID='997839a4-c790-4dc4-9c5f-8a3de81a4794', imgUUID='c990286f-a6cc-451e-bd75-e9463f7514c9', volUUID='88fe8d06-be9c-46a6-9223-59206fc97b63') from=internal, task_id=a87b3e6d-8312-4e74-b4a8-2a0e97b579ec (api:31)
2025-04-01 04:13:52,238+0200 INFO  (virt/bb511d9b) [vdsm.api] FINISH getVolumeSize return={'apparentsize': '2684354560', 'truesize': '2684354560'} from=internal, task_id=a87b3e6d-8312-4e74-b4a8-2a0e97b579ec (api:37)

But then something goes wrong:

2025-04-01 04:13:54,065+0200 INFO  (periodic/2) [virt.vm] (vmId='eb9b0a1c-7659-4482-980f-7e5b9ac9336a') CPU stopped: pause (vm:6071)
2025-04-01 04:13:54,066+0200 ERROR (periodic/2) [Executor] Unhandled exception in <Task discardable <VolumeWatermarkMonitor vm=eb9b0a1c-7659-4482-980f-7e5b9ac9336a at 0x7fd7cafc0d30> timeout=1.0, duration=0.07 at 0x7fd7cafe7cd0> (executor:300)
Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/vdsm/executor.py", line 298, in _execute_task
    task()
  File "/usr/lib/python3.9/site-packages/vdsm/executor.py", line 374, in __call__
    self._callable()
  File "/usr/lib/python3.9/site-packages/vdsm/virt/periodic.py", line 314, in __call__
    self._execute()
  File "/usr/lib/python3.9/site-packages/vdsm/virt/periodic.py", line 382, in _execute
    self._vm.volume_monitor.monitor_volumes()
  File "/usr/lib/python3.9/site-packages/vdsm/virt/thinp.py", line 278, in monitor_volumes
    self._extend_drive_if_needed(drive)
  File "/usr/lib/python3.9/site-packages/vdsm/virt/thinp.py", line 310, in _extend_drive_if_needed
    self._handle_unset(drive)
  File "/usr/lib/python3.9/site-packages/vdsm/virt/thinp.py", line 325, in _handle_unset
    if self._drive_needs_extend(drive):
  File "/usr/lib/python3.9/site-packages/vdsm/virt/thinp.py", line 404, in _drive_needs_extend
    raise ImprobableAllocationError(
vdsm.virt.thinp.ImprobableAllocationError: Improbable allocation, pausing the VM to avoid corruption: BlockInfo(index=4, name='sda', path='/rhev/data-center/mnt/blockSD/e1d2ff33-c3fd-4c1a-bcd1-2047e4efc362/images/c990286f-a6cc-451e-bd75-e9463f7514c9/88fe8d06-be9c-46a6-9223-59206fc97b63', allocation=328192, capacity=0, physical=2684354560, threshold=0)
2025-04-01 04:13:54,164+0200 INFO  (libvirt/events) [virt.vm] (vmId='eb9b0a1c-7659-4482-980f-7e5b9ac9336a') CPU stopped: onSuspend (vm:6071)

As you see, libvirt seems to return capacity=0, causing the extend to fail.
So the snapshot image did not want to extend, and the VM went into a paused state.

But this is not a one-time error, but it kept going on:

2025-04-01 04:14:22,011+0200 INFO  (periodic/0) [virt.vm] (vmId='eb9b0a1c-7659-4482-980f-7e5b9ac9336a') CPU stopped: pause (vm:6071)
2025-04-01 04:14:22,012+0200 ERROR (periodic/0) [Executor] Unhandled exception in <Task discardable <VolumeWatermarkMonitor vm=eb9b0a1c-7659-4482-980f-7e5b9ac9336a at 0x7fd7caf43160> timeout=1.0, duration=0.01 at 0x7fd7c87d1400> (executor:300)
Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/vdsm/executor.py", line 298, in _execute_task
    task()
  File "/usr/lib/python3.9/site-packages/vdsm/executor.py", line 374, in __call__
    self._callable()
  File "/usr/lib/python3.9/site-packages/vdsm/virt/periodic.py", line 314, in __call__
    self._execute()
  File "/usr/lib/python3.9/site-packages/vdsm/virt/periodic.py", line 382, in _execute
    self._vm.volume_monitor.monitor_volumes()
  File "/usr/lib/python3.9/site-packages/vdsm/virt/thinp.py", line 278, in monitor_volumes
    self._extend_drive_if_needed(drive)
  File "/usr/lib/python3.9/site-packages/vdsm/virt/thinp.py", line 310, in _extend_drive_if_needed
    self._handle_unset(drive)
  File "/usr/lib/python3.9/site-packages/vdsm/virt/thinp.py", line 325, in _handle_unset
    if self._drive_needs_extend(drive):
  File "/usr/lib/python3.9/site-packages/vdsm/virt/thinp.py", line 404, in _drive_needs_extend
    raise ImprobableAllocationError(
vdsm.virt.thinp.ImprobableAllocationError: Improbable allocation, pausing the VM to avoid corruption: BlockInfo(index=4, name='sda', path='/rhev/data-center/mnt/blockSD/e1d2ff33-c3fd-4c1a-bcd1-2047e4efc362/images/c990286f-a6cc-451e-bd75-e9463f7514c9/88fe8d06-be9c-46a6-9223-59206fc97b63', allocation=32819
2, capacity=0, physical=2684354560, threshold=0)

This continued until the VM was manually stopped (and corrupting the qcow2 file because it couldn't flush to disk I guess).

Maybe some additional useful info about the image.

2025-04-01 04:14:04,505+0200 INFO  (jsonrpc/0) [vdsm.api] START getQemuImageInfo(sdUUID='e1d2ff33-c3fd-4c1a-bcd1-2047e4efc362', spUUID='997839a4-c790-4dc4-9c5f-8a3de81a4794', imgUUID='c990286f-a6cc-451e-bd75-e9463f7514c9', volUUID='88fe8d06-be9c-46a6-9223-59206fc97b63') from=::ffff:x,54882, flow_id=00cc9982-075f-403f-87a0-1f4cab61f08c, task_id=6218e687-c0f6-4710-9608-36a813e0db35 (api:31)
2025-04-01 04:14:04,528+0200 INFO  (jsonrpc/0) [vdsm.api] FINISH getQemuImageInfo return={'info': {'virtualsize': 1920383410176, 'format': 'qcow2', 'clustersize': 65536, 'backingfile': '744952ac-0b5d-4c20-b72a-1f6428cd99f2', 'compat': '1.1', 'bitmaps': [Qcow2BitmapInfo(name='96276f29-8014-437b-8f0e-035866e0afb7', granularity=65536, flags=['in-use', 'auto'])]}} from=::ffff:x,54882, flow_id=00cc9982-075f-403f-87a0-1f4cab61f08c, task_id=6218e687-c0f6-4710-9608-36a813e0db35 (api:37)
2025-04-01 04:17:23,439+0200 INFO  (jsonrpc/1) [vdsm.api] START getVolumeInfo(sdUUID='e1d2ff33-c3fd-4c1a-bcd1-2047e4efc362', spUUID='997839a4-c790-4dc4-9c5f-8a3de81a4794', imgUUID='c990286f-a6cc-451e-bd75-e9463f7514c9', volUUID='88fe8d06-be9c-46a6-9223-59206fc97b63') from=::ffff:x,54882, flow_id=00cc9982-075f-403f-87a0-1f4cab61f08c, task_id=565da4db-c8d3-4e50-9596-dd42c7867b6f (api:31)
2025-04-01 04:17:23,520+0200 INFO  (jsonrpc/1) [storage.volumemanifest] Info request: sdUUID=e1d2ff33-c3fd-4c1a-bcd1-2047e4efc362 imgUUID=c990286f-a6cc-451e-bd75-e9463f7514c9 volUUID = 88fe8d06-be9c-46a6-9223-59206fc97b63  (volume:228)
2025-04-01 04:17:23,536+0200 INFO  (jsonrpc/1) [storage.volumemanifest] e1d2ff33-c3fd-4c1a-bcd1-2047e4efc362/c990286f-a6cc-451e-bd75-e9463f7514c9/88fe8d06-be9c-46a6-9223-59206fc97b63 info is {'uuid': '88fe8d06-be9c-46a6-9223-59206fc97b63', 'type': 'SPARSE', 'format': 'COW', 'disktype': 'DATA', 'voltype': 'LEAF', 'capacity': '1920383410176', 'parent': '744952ac-0b5d-4c20-b72a-1f6428cd99f2', 'description': '', 'pool': '', 'domain': 'e1d2ff33-c3fd-4c1a-bcd1-2047e4efc362', 'image': 'c990286f-a6cc-451e-bd75-e9463f7514c9', 'ctime': '1743473609', 'mtime': '0', 'legality': 'LEGAL', 'generation': 0, 'sequence': 2, 'apparentsize': '2684354560', 'truesize': '2684354560', 'status': 'OK', 'lease': {'path': '/dev/e1d2ff33-c3fd-4c1a-bcd1-2047e4efc362/leases', 'offset': 113246208, 'owners': [], 'version': None}, 'children': []} (volume:267)
2025-04-01 04:17:23,536+0200 INFO  (jsonrpc/1) [vdsm.api] FINISH getVolumeInfo return={'info': {'uuid': '88fe8d06-be9c-46a6-9223-59206fc97b63', 'type': 'SPARSE', 'format': 'COW', 'disktype': 'DATA', 'voltype': 'LEAF', 'capacity': '1920383410176', 'parent': '744952ac-0b5d-4c20-b72a-1f6428cd99f2', 'description': '', 'pool': '', 'domain': 'e1d2ff33-c3fd-4c1a-bcd1-2047e4efc362', 'image': 'c990286f-a6cc-451e-bd75-e9463f7514c9', 'ctime': '1743473609', 'mtime': '0', 'legality': 'LEGAL', 'generation': 0, 'sequence': 2, 'apparentsize': '2684354560', 'truesize': '2684354560', 'status': 'OK', 'lease': {'path': '/dev/e1d2ff33-c3fd-4c1a-bcd1-2047e4efc362/leases', 'offset': 113246208, 'owners': [], 'version': None}, 'children': []}} from=::ffff:x,54882, flow_id=00cc9982-075f-403f-87a0-1f4cab61f08c, task_id=565da4db-c8d3-4e50-9596-dd42c7867b6f (api:37)

I've noticed in the code the extension limitations were set because sometimes libvirt/qemu send invalid info and kept on extending.
But is that issue still valid? Or might it have been fixed in qemu/libvirt and making the check useless?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks to be caused by a corruption in the qcow2 image (cause by a bug? https://gitlab.com/qemu-project/qemu/-/issues/2912)

Copy link
Contributor

@JasperB-TeamBlue JasperB-TeamBlue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dupondje dupondje closed this Apr 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants