Skip to content

Fix: Graceful QEMU shutdown escalation to prevent disk corruption#925

Merged
aliel merged 2 commits intomainfrom
fix/qemu-graceful-shutdown
Apr 10, 2026
Merged

Fix: Graceful QEMU shutdown escalation to prevent disk corruption#925
aliel merged 2 commits intomainfrom
fix/qemu-graceful-shutdown

Conversation

@odesenfans
Copy link
Copy Markdown
Contributor

QemuVM.stop() previously sent an ACPI powerdown and returned immediately, leaving the 30s systemd SIGKILL as the only fallback. A SIGKILL terminates QEMU without flushing disk caches, which can corrupt qcow2 metadata and guest filesystems (e.g. missing kernel files after an in-guest apt upgrade).

The new shutdown sequence:
t=0s ACPI system_powerdown (guest handles clean shutdown)
t=50s QMP "quit" (QEMU flushes block device caches and exits)
t=60s systemd SIGKILL (last resort)

Explain what problem this PR is resolving

Related ClickUp, GitHub or Jira tickets : ALEPH-XXX

Self proofreading checklist

  • The new code clear, easy to read and well commented.
  • New code does not duplicate the functions of builtin or popular libraries.
  • An LLM was used to review the new code and look for simplifications.
  • New classes and functions contain docstrings explaining what they provide.
  • All new code is covered by relevant tests.
  • Documentation has been updated regarding these changes.
  • Dependencies update in the project.toml have been mirrored in the Debian package build script packaging/Makefile

Changes

Explain the changes that were made. The idea is not to list exhaustively all the changes made (GitHub already provides a full diff), but to help the reviewers better understand:

  • which specific file changes go together, e.g: when creating a table in the front-end, there usually is a config file that goes with it
  • the reasoning behind some changes, e.g: deleted files because they are now redundant
  • the behaviour to expect, e.g: tooltip has purple background color because the client likes it so, changed a key in the API response to be consistent with other endpoints

How to test

Explain how to test your PR.
If a specific config is required explain it here (account, data entry, ...)

Print screen / video

Upload here screenshots or videos showing the changes if relevant.

Notes

Things that the reviewers should know: known bugs that are out of the scope of the PR, other trade-offs that were made.
If the PR depends on a PR in another repo, or merges into another PR (i.o. main), it should also be mentioned here

Copy link
Copy Markdown

@foxpatch-aleph foxpatch-aleph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR correctly addresses an important issue: preventing qcow2 disk corruption by implementing a graceful shutdown escalation (ACPI → QMP quit → SIGKILL). The code logic is sound, timeout values are appropriate (50s ACPI wait + 10s buffer before 60s systemd SIGKILL), and the systemd service file is properly updated. However, the PR checklist claims 'All new code is covered by relevant tests' but no tests exist for the new shutdown escalation logic. Tests should be added for the stop() method's timeout handling and QMP quit fallback before merge.

src/aleph/vm/hypervisors/qemu/qemuvm.py (line 256): After sending QMP quit, the code doesn't wait for the process to exit before returning. While systemd's TimeoutStopSec will eventually clean up, consider adding a short wait (e.g., 5-10s) after QMP quit to verify the process actually terminated. This would provide better observability and earlier detection of issues.

src/aleph/vm/hypervisors/qemu/qemuvm.py (line 226): The check self.journal_stdout != asyncio.subprocess.DEVNULL is always true since journal_stdout is set to journal.stream(...) in start(), never DEVNULL. This check can be simplified to just if self.journal_stdout:.

src/aleph/vm/hypervisors/qemu/qemuvm.py (line 225): Add a docstring to _close_journals() explaining its purpose, consistent with the docstring on _send_qmp_quit().

tests/supervisor/test_qemu_instance.py (line 1): REQUIRED: Add unit tests for the new shutdown escalation logic. Specifically test: (1) graceful shutdown via ACPI powerdown completes within timeout, (2) timeout triggers QMP quit fallback, (3) QMP quit is sent when ACPI fails. Use pytest-asyncio with mocked subprocess and QMP client to test these paths without requiring actual QEMU instances.

src/aleph/vm/hypervisors/qemu/qemuvm.py (line 18): Consider moving GRACEFUL_SHUTDOWN_TIMEOUT to a configuration setting rather than a hardcoded constant. This would allow operators to tune the timeout based on their VM workloads (e.g., VMs with heavy I/O may need longer to shut down cleanly).

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 20, 2026

Codecov Report

❌ Patch coverage is 13.33333% with 26 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.56%. Comparing base (a0bdae7) to head (c4ecad2).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/aleph/vm/hypervisors/qemu/qemuvm.py 13.33% 26 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #925      +/-   ##
==========================================
- Coverage   68.65%   68.56%   -0.10%     
==========================================
  Files         104      104              
  Lines       11924    11952      +28     
  Branches     1016     1019       +3     
==========================================
+ Hits         8187     8195       +8     
- Misses       3472     3493      +21     
+ Partials      265      264       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

odesenfans and others added 2 commits April 10, 2026 14:41
QemuVM.stop() previously sent an ACPI powerdown and returned immediately,
leaving the 30s systemd SIGKILL as the only fallback. A SIGKILL terminates
QEMU without flushing disk caches, which can corrupt qcow2 metadata and
guest filesystems (e.g. missing kernel files after an in-guest apt upgrade).

The new shutdown sequence:
  t=0s   ACPI system_powerdown (guest handles clean shutdown)
  t=50s  QMP "quit" (QEMU flushes block device caches and exits)
  t=60s  systemd SIGKILL (last resort)
@aliel aliel force-pushed the fix/qemu-graceful-shutdown branch from 6497b74 to c4ecad2 Compare April 10, 2026 12:49
Copy link
Copy Markdown

@foxpatch-aleph foxpatch-aleph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR correctly addresses the disk corruption risk from SIGKILL by implementing a graceful shutdown escalation sequence (ACPI powerdown → QMP quit → SIGKILL). The implementation properly waits for QEMU to exit after QMP quit, preventing race conditions with network cleanup. Code is well-documented with clear logging at each stage. The systemd TimeoutStopSec increase from 30s to 60s is appropriate.

src/aleph/vm/hypervisors/qemu/qemuvm.py (line 286): The 60 is hardcoded here, matching TimeoutStopSec in the systemd service file. Consider defining a constant like SYSTEMD_TIMEOUT_SEC = 60 to keep these values synchronized and avoid maintenance issues if either changes.

src/aleph/vm/hypervisors/qemu/qemuvm.py (line 291): Minor style: the string concatenation "VM %s still running %ds after QMP quit, " "systemd SIGKILL will handle it" works but could be clearer as a single string with an explicit space: "VM %s still running %ds after QMP quit, systemd SIGKILL will handle it".

src/aleph/vm/hypervisors/qemu/qemuvm.py (line 255): The PR checklist mentions tests for new code, but I don't see unit tests specifically covering the graceful shutdown escalation (ACPI timeout → QMP quit). Consider adding tests that mock the process wait timeouts to verify the escalation path works correctly.

@aliel aliel merged commit 21b1d44 into main Apr 10, 2026
16 of 18 checks passed
@aliel aliel deleted the fix/qemu-graceful-shutdown branch April 10, 2026 13:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants