[sled-agent] VMM graceful shutdown timeout by hawkw · Pull Request #7548 · oxidecomputer/omicron

hawkw · 2025-02-14T20:13:50Z

Presently, sled-agent's InstanceRunner has two mechanisms for shutting down a VMM: sending an instance state PUT request to the propolis-server process for the Stopped state, or forcibly terminating the propolis-server and tearing down the zone. At present, when a request to stop an instance is sent to the sled-agent, it uses the first mechanism, where Propolis is politely asked to stop the instance --- which I'll refer to as "graceful shutdown". The forceful termination path is used when asked to unregister an instance where the VMM has not started up yet, when encountering an unrecoverable VMM error, or when killing an instance that was making use of an expunged disk. Currently, these two paths don't really overlap: when Nexus asks a sled-agent to stop an instance, all it will do is politely ask Propolis to please stop the instance gracefully, and will only fall back to violently shooting the zone in the face if Propolis returns the error that indicates it never knew about that instance in the first place.

This means that, should a VMM get stuck while shutting down the instance, stopping it will never complete successfully, and the Propolis zone won't get cleaned up. This can happen due to e.g. a Crucible activation that will never complete. Thus, the sled-agent should attempt to violently terminate a Propolis zone when a graceful shutdown of the VMM fails to complete in a timely manner.

This commit introduces a timeout for the graceful shutdown process. Now, when we send a PUT request to Propolis with the Stopped instance state, the sled-agent will start a 10-minute timer. If no update from Propolis that indicates the instance has transitioned to Stopped is received before the timer completes, the sled-agent will proceed with the forceful termination of the Propolis zone.

Fixes #4004.
Closes #6795

smklein

Looks good to me, but as you said on the hypervisor channel, it would be nice to get a test for this before merging, if we can. This seems like the sorta feature that's nice-to-have, but which can break easily.

smklein · 2025-02-14T21:20:14Z

sled-agent/src/instance.rs

+                                    // Only start the stop timeout if there
+                                    // isn't one already, so that additional
+                                    // requests to stop coming in don't reset
+                                    // the clock.


It does seem that we'd call put_state again in this case though -- is that desirable?

(a valid answer may be: "Sure!" - I'm just confirming)

🤷‍♀️ we would call it twice before this change, so I didn't mess with it. I don't see any obvious reason why it would be bad to do it again, and maybe it's useful for e.g. a higher-level component retrying?

FWIW my general philosophy here has been to have sled agent pass all state change requests right through to Propolis and let Propolis decide how to deal with any duplicates. (In this case it should just decide the second stop request can be ignored because it's already put one on its queue.)

Yeah, that makes sense to me. I thought not resetting the timeout was worthwhile, though: you can imagine a scenario where an instance fails to stop, and a frustrated user or retrying client keeps sending more stop requests, each of which resets the grace period and stops the stuck VMM from being killed.

Oh, I definitely agree--once the sled agent finds out about the control plane's intent to stop the instance, it should arm the timer and not reset it if the same intent is expressed again. Mostly I just wanted to provide a bit of color about calling Propolis twice; the Propolis state machine is designed to handle that kind of thing precisely so that sled agent doesn't have to worry about it.

Yup, thanks for that! I figured we were on the same page about the timer but felt like it was worth clarifying further.

hawkw · 2025-02-14T21:34:53Z

@smklein:

as you said on the hypervisor channel, it would be nice to get a test for this before merging, if we can. This seems like the sorta feature that's nice-to-have, but which can break easily.

Yeah — this is mostly intended to protect against bugs in other components. So, in an ideal world, it's not going to be exercised that much unless we have a test for it.

gjcolombo

Thanks for putting this together!

gjcolombo · 2025-02-14T22:12:22Z

sled-agent/src/instance.rs

+                                    // Only start the stop timeout if there
+                                    // isn't one already, so that additional
+                                    // requests to stop coming in don't reset
+                                    // the clock.


FWIW my general philosophy here has been to have sled agent pass all state change requests right through to Propolis and let Propolis decide how to deal with any duplicates. (In this case it should just decide the second stop request can be ignored because it's already put one on its queue.)

Presently, sled-agent's `InstanceRunner` has two mechanisms for shutting down a VMM: sending an instance state PUT request to the `propolis-server` process for the `Stopped` state, or forcibly terminating the `propolis-server` and tearing down the zone. At present, when a request to stop an instance is sent to the sled-agent, it uses the first mechanism, where Propolis is politely asked to stop the instance --- which I'll refer to as "graceful shutdown". The forceful termination path is used when asked to unregister an instance where the VMM has not started up yet, when encountering an unrecoverable VMM error, or when killing an instance that was making use of an expunged disk. Currently, these two paths don't really overlap: when Nexus asks a sled-agent to stop an instance, all it will do is politely ask Propolis to please stop the instance gracefully, and will only fall back to violently shooting the zone in the face if Propolis returns the error that indicates it never knew about that instance in the first place. This means that, should a VMM get *stuck* while shutting down the instance, stopping it will never complete successfully, and the Propolis zone won't get cleaned up. This can happen due to e.g. [a Crucible activation that will never complete][1]. Thus, the sled-agent should attempt to violently terminate a Propolis zone when a graceful shutdown of the VMM fails to complete in a timely manner. This commit introduces a timeout for the graceful shutdown process. Now, when we send a PUT request to Propolis with the `Stopped` instance state, the sled-agent will start a 10-minute timer. If no update from Propolis that indicates the instance has transitioned to `Stopped` is received before the timer completes, the sled-agent will proceed with the forceful termination of the Propolis zone. Fixes #4004. [1]: #4004 (comment)

This depends on oxidecomputer/propolis#869

hawkw · 2025-02-21T19:47:28Z

Okay, I've added a test for this which I'm pretty satisfied with. This depends on my Propolis PR oxidecomputer/propolis#869, which adds the ability to single-step the mock server's instance state machine. This is necessary to simulate a scenario in which Propolis gets stuck while shutting down. If @smklein or @gjcolombo are interested in reviewing the test, be my guest --- but I'm going to leave this PR in draft until the Propolis PR merges so that we can point our Git dep at propolis master rather than the PR.

For certain test scenarios, the `propolis-mock-server` ought to have a mechanism for manual control of the mocked instance's progress through the state machine. In particular, this is necessary for testing changes like oxidecomputer/omicron#7548, which adds a timeout tracked by the sled-agent when an instance is stopped. If Propolis is stuck and cannot progress, the sled-agent will forcefully terminate it after that timeout...but testing this requires a way to make the mock Propolis pretend to be stuck. This commit adds the following new endpoints to the mock server which are not part of the real `propolis-server` API: - `PUT /mock/mode`: sets a mock mode, either `Run` (the normal behavior), or `SingleStep`, where state transitions only ocur when asked for by the test. - `GET /mock/mode`: returns whether or not we are single-steppy - `PUT /mock/step`: advances to the next queued generation Testing: I've written [a test in Omicron][1] that uses this, and I can make the mock propolis get wedged in the correct place. So that's nice. Closes #858 [1]: https://github.com/oxidecomputer/omicron/compare/eliza/single-steppy?expand=1

this changes the git dep to oxidecomputer/propolis@2652487

hawkw · 2025-02-21T20:15:50Z

Okay, as we are now depending on the mock-server from oxidecomputer/propolis@2652487 (the master branch), I'm moving this out of draft. If y'all don't want to review the test, I'll auto-merge --- let me know?

sled-agent/src/instance.rs

Co-authored-by: Greg Colombo <greg@oxidecomputer.com>

hawkw · 2025-02-24T18:12:43Z

I'm guessing this helios / deploy failure that couldn't start Nexus is a flake but will look closer.

hawkw · 2025-02-24T18:14:21Z

I'm guessing this helios / deploy failure that couldn't start Nexus is a flake but will look closer.

Looks like sled-agent was still waiting for NTP sync.

hawkw requested review from gjcolombo and smklein February 14, 2025 20:13

This was referenced Feb 14, 2025

want sled-agent timeout on completion of requests to stop a Propolis VM #4004

Closed

mock-server: want manual control of state transitions oxidecomputer/propolis#858

Closed

smklein approved these changes Feb 14, 2025

View reviewed changes

gjcolombo approved these changes Feb 14, 2025

View reviewed changes

hawkw mentioned this pull request Feb 21, 2025

mock-server: add single-step API oxidecomputer/propolis#869

Merged

hawkw added 4 commits February 21, 2025 11:44

[sled-agent] add test for propolis hang mid-shutdown

11968f9

update

2d3da1b

add tests

452dda0

This depends on oxidecomputer/propolis#869

hawkw force-pushed the eliza/give-instance-manager-an-even-bigger-gun branch from 3e536d9 to 452dda0 Compare February 21, 2025 19:44

document test

8105f5f

depend on mock-server from master

2570e79

this changes the git dep to oxidecomputer/propolis@2652487

hawkw marked this pull request as ready for review February 21, 2025 20:14

gjcolombo approved these changes Feb 21, 2025

View reviewed changes

sled-agent/src/instance.rs Outdated Show resolved Hide resolved

typo fix from @gjcolombo

0ebe7d9

Co-authored-by: Greg Colombo <greg@oxidecomputer.com>

hawkw enabled auto-merge (squash) February 21, 2025 20:26

Merge branch 'main' into eliza/give-instance-manager-an-even-bigger-gun

d87d489

hawkw disabled auto-merge February 24, 2025 18:15

hawkw enabled auto-merge (squash) February 24, 2025 18:15

hawkw mentioned this pull request Feb 24, 2025

[nexus] add POST /v1/instances/{instance}/force-terminate #6795

Closed

hawkw merged commit 459fca4 into main Feb 24, 2025
18 checks passed

hawkw deleted the eliza/give-instance-manager-an-even-bigger-gun branch February 24, 2025 19:59

Conversation

hawkw commented Feb 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smklein left a comment

Choose a reason for hiding this comment

Uh oh!

smklein Feb 14, 2025

Choose a reason for hiding this comment

Uh oh!

hawkw Feb 14, 2025

Choose a reason for hiding this comment

Uh oh!

gjcolombo Feb 14, 2025

Choose a reason for hiding this comment

Uh oh!

hawkw Feb 14, 2025

Choose a reason for hiding this comment

Uh oh!

gjcolombo Feb 14, 2025

Choose a reason for hiding this comment

Uh oh!

hawkw Feb 14, 2025

Choose a reason for hiding this comment

Uh oh!

hawkw commented Feb 14, 2025

Uh oh!

gjcolombo left a comment

Choose a reason for hiding this comment

Uh oh!

gjcolombo Feb 14, 2025

Choose a reason for hiding this comment

Uh oh!

hawkw commented Feb 21, 2025

Uh oh!

hawkw commented Feb 21, 2025

Uh oh!

Uh oh!

hawkw commented Feb 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hawkw commented Feb 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hawkw commented Feb 14, 2025 •

edited

Loading

hawkw commented Feb 24, 2025 •

edited

Loading