Fix: Graceful QEMU shutdown escalation to prevent disk corruption by odesenfans · Pull Request #925 · aleph-im/aleph-vm

odesenfans · 2026-03-20T09:25:53Z

QemuVM.stop() previously sent an ACPI powerdown and returned immediately, leaving the 30s systemd SIGKILL as the only fallback. A SIGKILL terminates QEMU without flushing disk caches, which can corrupt qcow2 metadata and guest filesystems (e.g. missing kernel files after an in-guest apt upgrade).

The new shutdown sequence:
t=0s ACPI system_powerdown (guest handles clean shutdown)
t=50s QMP "quit" (QEMU flushes block device caches and exits)
t=60s systemd SIGKILL (last resort)

Explain what problem this PR is resolving

Related ClickUp, GitHub or Jira tickets : ALEPH-XXX

Self proofreading checklist

The new code clear, easy to read and well commented.
New code does not duplicate the functions of builtin or popular libraries.
An LLM was used to review the new code and look for simplifications.
New classes and functions contain docstrings explaining what they provide.
All new code is covered by relevant tests.
Documentation has been updated regarding these changes.
Dependencies update in the project.toml have been mirrored in the Debian package build script packaging/Makefile

Changes

Explain the changes that were made. The idea is not to list exhaustively all the changes made (GitHub already provides a full diff), but to help the reviewers better understand:

which specific file changes go together, e.g: when creating a table in the front-end, there usually is a config file that goes with it
the reasoning behind some changes, e.g: deleted files because they are now redundant
the behaviour to expect, e.g: tooltip has purple background color because the client likes it so, changed a key in the API response to be consistent with other endpoints

How to test

Explain how to test your PR.
If a specific config is required explain it here (account, data entry, ...)

Print screen / video

Upload here screenshots or videos showing the changes if relevant.

Notes

Things that the reviewers should know: known bugs that are out of the scope of the PR, other trade-offs that were made.
If the PR depends on a PR in another repo, or merges into another PR (i.o. main), it should also be mentioned here

foxpatch-aleph

This PR correctly addresses an important issue: preventing qcow2 disk corruption by implementing a graceful shutdown escalation (ACPI → QMP quit → SIGKILL). The code logic is sound, timeout values are appropriate (50s ACPI wait + 10s buffer before 60s systemd SIGKILL), and the systemd service file is properly updated. However, the PR checklist claims 'All new code is covered by relevant tests' but no tests exist for the new shutdown escalation logic. Tests should be added for the stop() method's timeout handling and QMP quit fallback before merge.

src/aleph/vm/hypervisors/qemu/qemuvm.py (line 256): After sending QMP quit, the code doesn't wait for the process to exit before returning. While systemd's TimeoutStopSec will eventually clean up, consider adding a short wait (e.g., 5-10s) after QMP quit to verify the process actually terminated. This would provide better observability and earlier detection of issues.

src/aleph/vm/hypervisors/qemu/qemuvm.py (line 226): The check self.journal_stdout != asyncio.subprocess.DEVNULL is always true since journal_stdout is set to journal.stream(...) in start(), never DEVNULL. This check can be simplified to just if self.journal_stdout:.

src/aleph/vm/hypervisors/qemu/qemuvm.py (line 225): Add a docstring to _close_journals() explaining its purpose, consistent with the docstring on _send_qmp_quit().

tests/supervisor/test_qemu_instance.py (line 1): REQUIRED: Add unit tests for the new shutdown escalation logic. Specifically test: (1) graceful shutdown via ACPI powerdown completes within timeout, (2) timeout triggers QMP quit fallback, (3) QMP quit is sent when ACPI fails. Use pytest-asyncio with mocked subprocess and QMP client to test these paths without requiring actual QEMU instances.

src/aleph/vm/hypervisors/qemu/qemuvm.py (line 18): Consider moving GRACEFUL_SHUTDOWN_TIMEOUT to a configuration setting rather than a hardcoded constant. This would allow operators to tune the timeout based on their VM workloads (e.g., VMs with heavy I/O may need longer to shut down cleanly).

codecov · 2026-03-20T09:35:46Z

Codecov Report

❌ Patch coverage is 13.33333% with 26 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.56%. Comparing base (a0bdae7) to head (c4ecad2).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/aleph/vm/hypervisors/qemu/qemuvm.py	13.33%	26 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #925      +/-   ##
==========================================
- Coverage   68.65%   68.56%   -0.10%     
==========================================
  Files         104      104              
  Lines       11924    11952      +28     
  Branches     1016     1019       +3     
==========================================
+ Hits         8187     8195       +8     
- Misses       3472     3493      +21     
+ Partials      265      264       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

QemuVM.stop() previously sent an ACPI powerdown and returned immediately, leaving the 30s systemd SIGKILL as the only fallback. A SIGKILL terminates QEMU without flushing disk caches, which can corrupt qcow2 metadata and guest filesystems (e.g. missing kernel files after an in-guest apt upgrade). The new shutdown sequence: t=0s ACPI system_powerdown (guest handles clean shutdown) t=50s QMP "quit" (QEMU flushes block device caches and exits) t=60s systemd SIGKILL (last resort)

foxpatch-aleph

This PR correctly addresses the disk corruption risk from SIGKILL by implementing a graceful shutdown escalation sequence (ACPI powerdown → QMP quit → SIGKILL). The implementation properly waits for QEMU to exit after QMP quit, preventing race conditions with network cleanup. Code is well-documented with clear logging at each stage. The systemd TimeoutStopSec increase from 30s to 60s is appropriate.

src/aleph/vm/hypervisors/qemu/qemuvm.py (line 286): The 60 is hardcoded here, matching TimeoutStopSec in the systemd service file. Consider defining a constant like SYSTEMD_TIMEOUT_SEC = 60 to keep these values synchronized and avoid maintenance issues if either changes.

src/aleph/vm/hypervisors/qemu/qemuvm.py (line 291): Minor style: the string concatenation "VM %s still running %ds after QMP quit, " "systemd SIGKILL will handle it" works but could be clearer as a single string with an explicit space: "VM %s still running %ds after QMP quit, systemd SIGKILL will handle it".

src/aleph/vm/hypervisors/qemu/qemuvm.py (line 255): The PR checklist mentions tests for new code, but I don't see unit tests specifically covering the graceful shutdown escalation (ACPI timeout → QMP quit). Consider adding tests that mock the process wait timeouts to verify the escalation path works correctly.

foxpatch-aleph requested changes Mar 20, 2026

View reviewed changes

odesenfans and others added 2 commits April 10, 2026 14:41

qemu: wait for process exit after QMP quit before returning from stop()

c4ecad2

aliel force-pushed the fix/qemu-graceful-shutdown branch from 6497b74 to c4ecad2 Compare April 10, 2026 12:49

foxpatch-aleph approved these changes Apr 10, 2026

View reviewed changes

aliel merged commit 21b1d44 into main Apr 10, 2026
16 of 18 checks passed

aliel deleted the fix/qemu-graceful-shutdown branch April 10, 2026 13:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Graceful QEMU shutdown escalation to prevent disk corruption#925

Fix: Graceful QEMU shutdown escalation to prevent disk corruption#925
aliel merged 2 commits intomainfrom
fix/qemu-graceful-shutdown

odesenfans commented Mar 20, 2026

Uh oh!

foxpatch-aleph left a comment

Uh oh!

codecov Bot commented Mar 20, 2026 •

edited

Loading

Uh oh!

foxpatch-aleph left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

odesenfans commented Mar 20, 2026

Self proofreading checklist

Changes

How to test

Print screen / video

Notes

Uh oh!

foxpatch-aleph left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

foxpatch-aleph left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov Bot commented Mar 20, 2026 •

edited

Loading