Skip to content

Conversation

@olljanat
Copy link
Contributor

@olljanat olljanat commented Jan 5, 2026

Description

There seems to be several bugs in QEMU driver's graceful_shutdown feature which most likely has been there already as part of #4800 (at least description says that it was not tested). Those bugs are:

  • Use of -monitor instead of -qmp which is read-only.
  • Missing of Capabilities Negotiation.
  • Commands are not send as JSON which QMP protocol expects.
  • Not reading responses from QEMU.
  • Not waiting VM shutdown before sending kill_signal for it.

Testing & Reproduction steps

If you check job error log, it contains message like qemu-system-x86_64: terminating on signal 2 from pid 63315 (/usr/bin/nomad) after you stop job, even when graceful_shutdown is configured.

That why you need use kill_signal = "SIGUSR1" to really see that guest shutdown does not work and VM process gets killed by Nomad.

You can also try Python script like this to see that socket created by -monitor flag is read only.

import json, socket
sock_path = "/path/to/socket"
s = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
s.connect(sock_path)
s.sendall(json.dumps({"execute":"qmp_capabilities"}).encode())
s.sendall(json.dumps({"execute":"system_powerdown"}).encode())
s.close()

and that it only works after you enable another with parameters like:

 "-mon", "chardev=mon1,mode=control,pretty=on",
 "-qmp", "unix:/tmp/qmp-sock,server,nowait"

Here is also working Golang client:

package main

import (
	"fmt"
	"net"
)

const (
	qemuGracefulShutdownMsg = `{"execute": "system_powerdown"}`
	qemuQmpCapabilitiesMsg  = `{"execute":"qmp_capabilities"}`
)

func main() {
	monitorPath := "/tmp/qmp-sock"
	monitorSocket, err := net.Dial("unix", monitorPath)
	if err != nil {
		fmt.Println("could not connect to qemu monitor", "monitorPath", monitorPath, "error", err)
		return
	}
	defer monitorSocket.Close()

	buf := make([]byte, 512)
	monitorSocket.Read([]byte(buf))
	fmt.Println("sending qmp_capabilities command to qemu monitor socket", "monitor_path", monitorPath)
	_, err = monitorSocket.Write([]byte(qemuQmpCapabilitiesMsg))
	if err != nil {
		fmt.Println("failed to send qmp_capabilities", "qmp_capabilities", qemuQmpCapabilitiesMsg, "monitorPath", monitorPath, "error", err)
		return
	}
	monitorSocket.Read([]byte(buf))

	fmt.Println("sending graceful shutdown command to qemu monitor socket", "monitor_path", monitorPath)
	_, err = monitorSocket.Write([]byte(qemuGracefulShutdownMsg))
	if err != nil {
		fmt.Println("failed to send shutdown message", "shutdown message", qemuGracefulShutdownMsg, "monitorPath", monitorPath, "error", err)
		return
	}
	monitorSocket.Read([]byte(buf))
}

but unlike with Python, it reading QEMU responses seems to be needed, other why commands are not effective.

Contributor Checklist

  • Changelog Entry If this PR changes user-facing behavior, please generate and add a
    changelog entry using the make cl command.
  • Testing Please add tests to cover any new functionality or to demonstrate bug fixes and
    ensure regressions will be caught.
  • Documentation If the change impacts user-facing functionality such as the CLI, API, UI,
    and job configuration, please update the Nomad product documentation, which is stored in the
    web-unified-docs repo. Refer to the web-unified-docs contributor guide for docs guidelines.
    Please also consider whether the change requires notes within the upgrade
    guide
    . If you would like help with the docs, tag the nomad-docs team in this PR.

Reviewer Checklist

  • Backport Labels Please add the correct backport labels as described by the internal
    backporting document.
  • Commit Type Ensure the correct merge method is selected which should be "squash and merge"
    in the majority of situations. The main exceptions are long-lived feature branches or merges where
    history should be preserved.
  • Enterprise PRs If this is an enterprise only PR, please add any required changelog entry
    within the public repository.
  • If a change needs to be reverted, we will roll out an update to the code within 7 days.

Changes to Security Controls

Are there any changes to security controls (access controls, encryption, logging) in this pull request? If so, explain.

@mismithhisler
Copy link
Member

Hi @olljanat! Thanks for this contribution. I'll take a swing at reproducing this today.

@mismithhisler mismithhisler moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Jan 5, 2026
@mismithhisler mismithhisler self-assigned this Jan 5, 2026
Copy link
Member

@mismithhisler mismithhisler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great contribution! I left a couple comments, let me know what you think.

@olljanat olljanat force-pushed the fix-qemu-guest-shutdown branch from ecea7e9 to 860331c Compare January 7, 2026 11:59
@mismithhisler
Copy link
Member

Last thing, do you mind adding a changelog via make cl with a description of something like qemu: fixes graceful_shotdown to wait kill_timeout before signalling process?

@mismithhisler mismithhisler added backport/ent/1.8.x+ent Changes are backported to 1.8.x+ent backport/ent/1.10.x+ent backport to 1.10.x+ent release line backport/1.11.x backport to 1.11.x release line labels Jan 7, 2026
@olljanat olljanat force-pushed the fix-qemu-guest-shutdown branch 2 times, most recently from e9eea0e to 7b668ae Compare January 7, 2026 13:17
@olljanat
Copy link
Contributor Author

olljanat commented Jan 7, 2026

Sure, changelog is included now

d.logger.Error("graceful shutdown", "pid", handle.pid, "timeout after", timeout)
break out
case <-ticker.C:
handle, _ = d.tasks.Get(taskID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this line because we already have the task handle from line 718?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, yes if that is actual handle where status updates without need to fetch it again. Updated (but not tested).

@olljanat olljanat force-pushed the fix-qemu-guest-shutdown branch from 7b668ae to 844cded Compare January 7, 2026 15:35
Copy link
Member

@mismithhisler mismithhisler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you for this contribution!

@mismithhisler mismithhisler merged commit 535888a into hashicorp:main Jan 7, 2026
29 of 31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport/ent/1.8.x+ent Changes are backported to 1.8.x+ent backport/ent/1.10.x+ent backport to 1.10.x+ent release line backport/1.11.x backport to 1.11.x release line

Projects

Development

Successfully merging this pull request may close these issues.

2 participants