Fix: Graceful QEMU shutdown escalation to prevent disk corruption#925
Fix: Graceful QEMU shutdown escalation to prevent disk corruption#925
Conversation
foxpatch-aleph
left a comment
There was a problem hiding this comment.
This PR correctly addresses an important issue: preventing qcow2 disk corruption by implementing a graceful shutdown escalation (ACPI → QMP quit → SIGKILL). The code logic is sound, timeout values are appropriate (50s ACPI wait + 10s buffer before 60s systemd SIGKILL), and the systemd service file is properly updated. However, the PR checklist claims 'All new code is covered by relevant tests' but no tests exist for the new shutdown escalation logic. Tests should be added for the stop() method's timeout handling and QMP quit fallback before merge.
src/aleph/vm/hypervisors/qemu/qemuvm.py (line 256): After sending QMP quit, the code doesn't wait for the process to exit before returning. While systemd's TimeoutStopSec will eventually clean up, consider adding a short wait (e.g., 5-10s) after QMP quit to verify the process actually terminated. This would provide better observability and earlier detection of issues.
src/aleph/vm/hypervisors/qemu/qemuvm.py (line 226): The check self.journal_stdout != asyncio.subprocess.DEVNULL is always true since journal_stdout is set to journal.stream(...) in start(), never DEVNULL. This check can be simplified to just if self.journal_stdout:.
src/aleph/vm/hypervisors/qemu/qemuvm.py (line 225): Add a docstring to _close_journals() explaining its purpose, consistent with the docstring on _send_qmp_quit().
tests/supervisor/test_qemu_instance.py (line 1): REQUIRED: Add unit tests for the new shutdown escalation logic. Specifically test: (1) graceful shutdown via ACPI powerdown completes within timeout, (2) timeout triggers QMP quit fallback, (3) QMP quit is sent when ACPI fails. Use pytest-asyncio with mocked subprocess and QMP client to test these paths without requiring actual QEMU instances.
src/aleph/vm/hypervisors/qemu/qemuvm.py (line 18): Consider moving GRACEFUL_SHUTDOWN_TIMEOUT to a configuration setting rather than a hardcoded constant. This would allow operators to tune the timeout based on their VM workloads (e.g., VMs with heavy I/O may need longer to shut down cleanly).
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #925 +/- ##
==========================================
- Coverage 68.65% 68.56% -0.10%
==========================================
Files 104 104
Lines 11924 11952 +28
Branches 1016 1019 +3
==========================================
+ Hits 8187 8195 +8
- Misses 3472 3493 +21
+ Partials 265 264 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
QemuVM.stop() previously sent an ACPI powerdown and returned immediately, leaving the 30s systemd SIGKILL as the only fallback. A SIGKILL terminates QEMU without flushing disk caches, which can corrupt qcow2 metadata and guest filesystems (e.g. missing kernel files after an in-guest apt upgrade). The new shutdown sequence: t=0s ACPI system_powerdown (guest handles clean shutdown) t=50s QMP "quit" (QEMU flushes block device caches and exits) t=60s systemd SIGKILL (last resort)
6497b74 to
c4ecad2
Compare
foxpatch-aleph
left a comment
There was a problem hiding this comment.
This PR correctly addresses the disk corruption risk from SIGKILL by implementing a graceful shutdown escalation sequence (ACPI powerdown → QMP quit → SIGKILL). The implementation properly waits for QEMU to exit after QMP quit, preventing race conditions with network cleanup. Code is well-documented with clear logging at each stage. The systemd TimeoutStopSec increase from 30s to 60s is appropriate.
src/aleph/vm/hypervisors/qemu/qemuvm.py (line 286): The 60 is hardcoded here, matching TimeoutStopSec in the systemd service file. Consider defining a constant like SYSTEMD_TIMEOUT_SEC = 60 to keep these values synchronized and avoid maintenance issues if either changes.
src/aleph/vm/hypervisors/qemu/qemuvm.py (line 291): Minor style: the string concatenation "VM %s still running %ds after QMP quit, " "systemd SIGKILL will handle it" works but could be clearer as a single string with an explicit space: "VM %s still running %ds after QMP quit, systemd SIGKILL will handle it".
src/aleph/vm/hypervisors/qemu/qemuvm.py (line 255): The PR checklist mentions tests for new code, but I don't see unit tests specifically covering the graceful shutdown escalation (ACPI timeout → QMP quit). Consider adding tests that mock the process wait timeouts to verify the escalation path works correctly.
QemuVM.stop() previously sent an ACPI powerdown and returned immediately, leaving the 30s systemd SIGKILL as the only fallback. A SIGKILL terminates QEMU without flushing disk caches, which can corrupt qcow2 metadata and guest filesystems (e.g. missing kernel files after an in-guest apt upgrade).
The new shutdown sequence:
t=0s ACPI system_powerdown (guest handles clean shutdown)
t=50s QMP "quit" (QEMU flushes block device caches and exits)
t=60s systemd SIGKILL (last resort)
Explain what problem this PR is resolving
Related ClickUp, GitHub or Jira tickets : ALEPH-XXX
Self proofreading checklist
packaging/MakefileChanges
Explain the changes that were made. The idea is not to list exhaustively all the changes made (GitHub already provides a full diff), but to help the reviewers better understand:
How to test
Explain how to test your PR.
If a specific config is required explain it here (account, data entry, ...)
Print screen / video
Upload here screenshots or videos showing the changes if relevant.
Notes
Things that the reviewers should know: known bugs that are out of the scope of the PR, other trade-offs that were made.
If the PR depends on a PR in another repo, or merges into another PR (i.o. main), it should also be mentioned here