Skip to content

should instance-stop *really* move instances to failed when they are discovered to already be gone? #6809

@hawkw

Description

@hawkw

The PUT /instances/{instance}/stop API will send a request to sled-agent asking it to terminate a running instance. If sled-agent responds with something indicating that it actually didn't know about that instance in the first place, Nexus will then transition it to Failed:

if let Err(e) = self
.instance_request_state(
opctx,
state.instance(),
state.vmm(),
InstanceStateChangeRequest::Stop,
)
.await
{
if let (InstanceStateChangeError::SledAgent(inner), Some(vmm)) =
(&e, state.vmm())
{
if inner.vmm_gone() {
let _ = self
.mark_vmm_failed(opctx, authz_instance, vmm, inner)
.await;
}
}
return Err(e);
}

This will not happen when the reason the instance is gone is because another concurrent attempt to stop it has succeeded, because the racing stop attempt will have advanced the VMM's generation number whilst moving it to Destroyed, so we won't mark it as Failed. However, in the event of a sled-agent crash, we may encounter an already-gone VMM here, and may move it to Failed.

This seems a bit wacky to me, since Failed instances (which is what the instance will eventually become as a result of its VMM being marked Failed) are eligible to be auto-restarted, while Stopped instances are not --- because the user actually wanted them to be stopped. And, in this case, the user is expressing intent to have an instance stop running, and we just happened to discover that we had already anticipated their desire to stop it and went ahead and stopped it for them before they even asked us to. Admittedly, we weren't supposed to have done that! But, in this case, the requested state is "instance is not running", and it's not running, so it seems a bit unfortunate to go "oh no, i was supposed to make the instance not be running, and when i tried to do that, i discovered that it was not running because we made a mistake, so now i'm actually going to...make it be running again?"

Imagine a scenario where a user goes to stop an instance so that they can change its boot disk or something, and while doing so, we discover that sled-agent has crashed and the instance isn't there. Moving it to Failed results in the instance being restarted, so now the user has to stop the instance a second time before they can actually do what they were trying to do originally.

Maybe we should just always move the instance to stopped when such an error is encountered by an instance-stop attempt. Obviously we would still go to Failed when attempting to do other things to the instance.

Metadata

Metadata

Labels

nexusRelated to nexus

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions