[sled-agent] VMM graceful shutdown timeout#7548
Conversation
smklein
left a comment
There was a problem hiding this comment.
Looks good to me, but as you said on the hypervisor channel, it would be nice to get a test for this before merging, if we can. This seems like the sorta feature that's nice-to-have, but which can break easily.
| // Only start the stop timeout if there | ||
| // isn't one already, so that additional | ||
| // requests to stop coming in don't reset | ||
| // the clock. |
There was a problem hiding this comment.
It does seem that we'd call put_state again in this case though -- is that desirable?
(a valid answer may be: "Sure!" - I'm just confirming)
There was a problem hiding this comment.
🤷♀️ we would call it twice before this change, so I didn't mess with it. I don't see any obvious reason why it would be bad to do it again, and maybe it's useful for e.g. a higher-level component retrying?
There was a problem hiding this comment.
FWIW my general philosophy here has been to have sled agent pass all state change requests right through to Propolis and let Propolis decide how to deal with any duplicates. (In this case it should just decide the second stop request can be ignored because it's already put one on its queue.)
There was a problem hiding this comment.
Yeah, that makes sense to me. I thought not resetting the timeout was worthwhile, though: you can imagine a scenario where an instance fails to stop, and a frustrated user or retrying client keeps sending more stop requests, each of which resets the grace period and stops the stuck VMM from being killed.
There was a problem hiding this comment.
Oh, I definitely agree--once the sled agent finds out about the control plane's intent to stop the instance, it should arm the timer and not reset it if the same intent is expressed again. Mostly I just wanted to provide a bit of color about calling Propolis twice; the Propolis state machine is designed to handle that kind of thing precisely so that sled agent doesn't have to worry about it.
There was a problem hiding this comment.
Yup, thanks for that! I figured we were on the same page about the timer but felt like it was worth clarifying further.
Yeah — this is mostly intended to protect against bugs in other components. So, in an ideal world, it's not going to be exercised that much unless we have a test for it. |
gjcolombo
left a comment
There was a problem hiding this comment.
Thanks for putting this together!
| // Only start the stop timeout if there | ||
| // isn't one already, so that additional | ||
| // requests to stop coming in don't reset | ||
| // the clock. |
There was a problem hiding this comment.
FWIW my general philosophy here has been to have sled agent pass all state change requests right through to Propolis and let Propolis decide how to deal with any duplicates. (In this case it should just decide the second stop request can be ignored because it's already put one on its queue.)
Presently, sled-agent's `InstanceRunner` has two mechanisms for shutting down a VMM: sending an instance state PUT request to the `propolis-server` process for the `Stopped` state, or forcibly terminating the `propolis-server` and tearing down the zone. At present, when a request to stop an instance is sent to the sled-agent, it uses the first mechanism, where Propolis is politely asked to stop the instance --- which I'll refer to as "graceful shutdown". The forceful termination path is used when asked to unregister an instance where the VMM has not started up yet, when encountering an unrecoverable VMM error, or when killing an instance that was making use of an expunged disk. Currently, these two paths don't really overlap: when Nexus asks a sled-agent to stop an instance, all it will do is politely ask Propolis to please stop the instance gracefully, and will only fall back to violently shooting the zone in the face if Propolis returns the error that indicates it never knew about that instance in the first place. This means that, should a VMM get *stuck* while shutting down the instance, stopping it will never complete successfully, and the Propolis zone won't get cleaned up. This can happen due to e.g. [a Crucible activation that will never complete][1]. Thus, the sled-agent should attempt to violently terminate a Propolis zone when a graceful shutdown of the VMM fails to complete in a timely manner. This commit introduces a timeout for the graceful shutdown process. Now, when we send a PUT request to Propolis with the `Stopped` instance state, the sled-agent will start a 10-minute timer. If no update from Propolis that indicates the instance has transitioned to `Stopped` is received before the timer completes, the sled-agent will proceed with the forceful termination of the Propolis zone. Fixes #4004. [1]: #4004 (comment)
This depends on oxidecomputer/propolis#869
3e536d9 to
452dda0
Compare
|
Okay, I've added a test for this which I'm pretty satisfied with. This depends on my Propolis PR oxidecomputer/propolis#869, which adds the ability to single-step the mock server's instance state machine. This is necessary to simulate a scenario in which Propolis gets stuck while shutting down. If @smklein or @gjcolombo are interested in reviewing the test, be my guest --- but I'm going to leave this PR in draft until the Propolis PR merges so that we can point our Git dep at propolis |
For certain test scenarios, the `propolis-mock-server` ought to have a mechanism for manual control of the mocked instance's progress through the state machine. In particular, this is necessary for testing changes like oxidecomputer/omicron#7548, which adds a timeout tracked by the sled-agent when an instance is stopped. If Propolis is stuck and cannot progress, the sled-agent will forcefully terminate it after that timeout...but testing this requires a way to make the mock Propolis pretend to be stuck. This commit adds the following new endpoints to the mock server which are not part of the real `propolis-server` API: - `PUT /mock/mode`: sets a mock mode, either `Run` (the normal behavior), or `SingleStep`, where state transitions only ocur when asked for by the test. - `GET /mock/mode`: returns whether or not we are single-steppy - `PUT /mock/step`: advances to the next queued generation Testing: I've written [a test in Omicron][1] that uses this, and I can make the mock propolis get wedged in the correct place. So that's nice. Closes #858 [1]: https://github.com/oxidecomputer/omicron/compare/eliza/single-steppy?expand=1
this changes the git dep to oxidecomputer/propolis@2652487
|
Okay, as we are now depending on the mock-server from oxidecomputer/propolis@2652487 (the |
Co-authored-by: Greg Colombo <greg@oxidecomputer.com>
|
I'm guessing this |
Looks like sled-agent was still waiting for NTP sync. |
Presently, sled-agent's
InstanceRunnerhas two mechanisms for shutting down a VMM: sending an instance state PUT request to thepropolis-serverprocess for theStoppedstate, or forcibly terminating thepropolis-serverand tearing down the zone. At present, when a request to stop an instance is sent to the sled-agent, it uses the first mechanism, where Propolis is politely asked to stop the instance --- which I'll refer to as "graceful shutdown". The forceful termination path is used when asked to unregister an instance where the VMM has not started up yet, when encountering an unrecoverable VMM error, or when killing an instance that was making use of an expunged disk. Currently, these two paths don't really overlap: when Nexus asks a sled-agent to stop an instance, all it will do is politely ask Propolis to please stop the instance gracefully, and will only fall back to violently shooting the zone in the face if Propolis returns the error that indicates it never knew about that instance in the first place.This means that, should a VMM get stuck while shutting down the instance, stopping it will never complete successfully, and the Propolis zone won't get cleaned up. This can happen due to e.g. a Crucible activation that will never complete. Thus, the sled-agent should attempt to violently terminate a Propolis zone when a graceful shutdown of the VMM fails to complete in a timely manner.
This commit introduces a timeout for the graceful shutdown process. Now, when we send a PUT request to Propolis with the
Stoppedinstance state, the sled-agent will start a 10-minute timer. If no update from Propolis that indicates the instance has transitioned toStoppedis received before the timer completes, the sled-agent will proceed with the forceful termination of the Propolis zone.Fixes #4004.
Closes #6795