Skip to content

[sled-agent] VMM graceful shutdown timeout#7548

Merged
hawkw merged 8 commits intomainfrom
eliza/give-instance-manager-an-even-bigger-gun
Feb 24, 2025
Merged

[sled-agent] VMM graceful shutdown timeout#7548
hawkw merged 8 commits intomainfrom
eliza/give-instance-manager-an-even-bigger-gun

Conversation

@hawkw
Copy link
Copy Markdown
Member

@hawkw hawkw commented Feb 14, 2025

Presently, sled-agent's InstanceRunner has two mechanisms for shutting down a VMM: sending an instance state PUT request to the propolis-server process for the Stopped state, or forcibly terminating the propolis-server and tearing down the zone. At present, when a request to stop an instance is sent to the sled-agent, it uses the first mechanism, where Propolis is politely asked to stop the instance --- which I'll refer to as "graceful shutdown". The forceful termination path is used when asked to unregister an instance where the VMM has not started up yet, when encountering an unrecoverable VMM error, or when killing an instance that was making use of an expunged disk. Currently, these two paths don't really overlap: when Nexus asks a sled-agent to stop an instance, all it will do is politely ask Propolis to please stop the instance gracefully, and will only fall back to violently shooting the zone in the face if Propolis returns the error that indicates it never knew about that instance in the first place.

This means that, should a VMM get stuck while shutting down the instance, stopping it will never complete successfully, and the Propolis zone won't get cleaned up. This can happen due to e.g. a Crucible activation that will never complete. Thus, the sled-agent should attempt to violently terminate a Propolis zone when a graceful shutdown of the VMM fails to complete in a timely manner.

This commit introduces a timeout for the graceful shutdown process. Now, when we send a PUT request to Propolis with the Stopped instance state, the sled-agent will start a 10-minute timer. If no update from Propolis that indicates the instance has transitioned to Stopped is received before the timer completes, the sled-agent will proceed with the forceful termination of the Propolis zone.

Fixes #4004.
Closes #6795

Copy link
Copy Markdown
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but as you said on the hypervisor channel, it would be nice to get a test for this before merging, if we can. This seems like the sorta feature that's nice-to-have, but which can break easily.

Comment on lines +635 to +639
// Only start the stop timeout if there
// isn't one already, so that additional
// requests to stop coming in don't reset
// the clock.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does seem that we'd call put_state again in this case though -- is that desirable?

(a valid answer may be: "Sure!" - I'm just confirming)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷‍♀️ we would call it twice before this change, so I didn't mess with it. I don't see any obvious reason why it would be bad to do it again, and maybe it's useful for e.g. a higher-level component retrying?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW my general philosophy here has been to have sled agent pass all state change requests right through to Propolis and let Propolis decide how to deal with any duplicates. (In this case it should just decide the second stop request can be ignored because it's already put one on its queue.)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense to me. I thought not resetting the timeout was worthwhile, though: you can imagine a scenario where an instance fails to stop, and a frustrated user or retrying client keeps sending more stop requests, each of which resets the grace period and stops the stuck VMM from being killed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I definitely agree--once the sled agent finds out about the control plane's intent to stop the instance, it should arm the timer and not reset it if the same intent is expressed again. Mostly I just wanted to provide a bit of color about calling Propolis twice; the Propolis state machine is designed to handle that kind of thing precisely so that sled agent doesn't have to worry about it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, thanks for that! I figured we were on the same page about the timer but felt like it was worth clarifying further.

@hawkw
Copy link
Copy Markdown
Member Author

hawkw commented Feb 14, 2025

@smklein:

as you said on the hypervisor channel, it would be nice to get a test for this before merging, if we can. This seems like the sorta feature that's nice-to-have, but which can break easily.

Yeah — this is mostly intended to protect against bugs in other components. So, in an ideal world, it's not going to be exercised that much unless we have a test for it.

Copy link
Copy Markdown
Contributor

@gjcolombo gjcolombo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together!

Comment on lines +635 to +639
// Only start the stop timeout if there
// isn't one already, so that additional
// requests to stop coming in don't reset
// the clock.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW my general philosophy here has been to have sled agent pass all state change requests right through to Propolis and let Propolis decide how to deal with any duplicates. (In this case it should just decide the second stop request can be ignored because it's already put one on its queue.)

Presently, sled-agent's `InstanceRunner` has two mechanisms for shutting
down a VMM: sending an instance state PUT request to the
`propolis-server` process for the `Stopped` state, or forcibly
terminating the `propolis-server` and tearing down the zone. At present,
when a request to stop an instance is sent to the sled-agent, it uses
the first mechanism, where Propolis is politely asked to stop the
instance --- which I'll refer to as "graceful shutdown". The forceful
termination path is used when asked to unregister an instance where the
VMM has not started up yet, when encountering an unrecoverable VMM
error, or when killing an instance that was making use of an expunged disk. Currently, these two paths don't really overlap: when Nexus asks
a sled-agent to stop an instance, all it will do is politely ask
Propolis to please stop the instance gracefully, and will only fall back
to violently shooting the zone in the face if Propolis returns the error
that indicates it never knew about that instance in the first place.

This means that, should a VMM get *stuck* while shutting down the
instance, stopping it will never complete successfully, and the Propolis
zone won't get cleaned up. This can happen due to e.g. [a Crucible
activation that will never complete][1]. Thus, the sled-agent should
attempt to violently terminate a Propolis zone when a graceful shutdown
of the VMM fails to complete in a timely manner.

This commit introduces a timeout for the graceful shutdown process.
Now, when we send a PUT request to Propolis with the `Stopped` instance
state, the sled-agent will start a 10-minute timer. If no update from
Propolis that indicates the instance has transitioned to `Stopped` is
received before the timer completes, the sled-agent will proceed with
the forceful termination of the Propolis zone.

Fixes #4004.

[1]: #4004 (comment)
@hawkw hawkw force-pushed the eliza/give-instance-manager-an-even-bigger-gun branch from 3e536d9 to 452dda0 Compare February 21, 2025 19:44
@hawkw
Copy link
Copy Markdown
Member Author

hawkw commented Feb 21, 2025

Okay, I've added a test for this which I'm pretty satisfied with. This depends on my Propolis PR oxidecomputer/propolis#869, which adds the ability to single-step the mock server's instance state machine. This is necessary to simulate a scenario in which Propolis gets stuck while shutting down. If @smklein or @gjcolombo are interested in reviewing the test, be my guest --- but I'm going to leave this PR in draft until the Propolis PR merges so that we can point our Git dep at propolis master rather than the PR.

hawkw added a commit to oxidecomputer/propolis that referenced this pull request Feb 21, 2025
For certain test scenarios, the `propolis-mock-server` ought to have a mechanism
for manual control of the mocked instance's progress through the state machine.
In particular, this is necessary for testing changes like
oxidecomputer/omicron#7548, which adds a timeout tracked by the sled-agent when
an instance is stopped. If Propolis is stuck and cannot progress, the sled-agent
will forcefully terminate it after that timeout...but testing this requires a
way to make the mock Propolis pretend to be stuck.

This commit adds the following new endpoints to the mock server which are not
part of the real `propolis-server` API:

- `PUT /mock/mode`: sets a mock mode, either `Run` (the normal behavior), or
  `SingleStep`, where state transitions only ocur when asked for by the test.
- `GET /mock/mode`: returns whether or not we are single-steppy
- `PUT /mock/step`: advances to the next queued generation

Testing: I've written [a test in Omicron][1] that uses this, and I can make
the mock propolis get wedged in the correct place. So that's nice.

Closes #858 

[1]: https://github.com/oxidecomputer/omicron/compare/eliza/single-steppy?expand=1
@hawkw hawkw marked this pull request as ready for review February 21, 2025 20:14
@hawkw
Copy link
Copy Markdown
Member Author

hawkw commented Feb 21, 2025

Okay, as we are now depending on the mock-server from oxidecomputer/propolis@2652487 (the master branch), I'm moving this out of draft. If y'all don't want to review the test, I'll auto-merge --- let me know?

Co-authored-by: Greg Colombo <greg@oxidecomputer.com>
@hawkw hawkw enabled auto-merge (squash) February 21, 2025 20:26
@hawkw
Copy link
Copy Markdown
Member Author

hawkw commented Feb 24, 2025

I'm guessing this helios / deploy failure that couldn't start Nexus is a flake but will look closer.

@hawkw
Copy link
Copy Markdown
Member Author

hawkw commented Feb 24, 2025

I'm guessing this helios / deploy failure that couldn't start Nexus is a flake but will look closer.

Looks like sled-agent was still waiting for NTP sync.

@hawkw hawkw disabled auto-merge February 24, 2025 18:15
@hawkw hawkw enabled auto-merge (squash) February 24, 2025 18:15
@hawkw hawkw merged commit 459fca4 into main Feb 24, 2025
18 checks passed
@hawkw hawkw deleted the eliza/give-instance-manager-an-even-bigger-gun branch February 24, 2025 19:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

want sled-agent timeout on completion of requests to stop a Propolis VM

3 participants