should `instance-stop` *really* move instances to failed when they are discovered to already be gone?

The `PUT /instances/{instance}/stop` API will send a request to sled-agent asking it to terminate a running instance. If sled-agent responds with something indicating that it actually didn't know about that instance in the first place, Nexus will then transition it to `Failed`: https://github.com/oxidecomputer/omicron/blob/0640bb277df110f3e740464bf4bacdf2bd24c897/nexus/src/app/instance.rs#L755-L775

This will not happen when the reason the instance is gone is because another concurrent attempt to stop it has succeeded, because the racing stop attempt will have advanced the VMM's generation number whilst moving it to `Destroyed`, so we won't mark it as `Failed`. However, in the event of a sled-agent crash, we _may_ encounter an already-gone VMM here, and may move it to `Failed`.

This seems a bit wacky to me, since `Failed` instances (which is what the instance will eventually become as a result of its VMM being marked `Failed`) are eligible to be auto-restarted, while `Stopped` instances are not --- because the user actually _wanted_ them to be stopped. And, in this case, the user is expressing intent to have an instance stop running, and we just happened to discover that we had already anticipated their desire to stop it and went ahead and stopped it for them before they even asked us to. Admittedly, we weren't supposed to have done that! But, in this case, the requested state is "instance is not running", and it's not running, so it seems a bit unfortunate to go "oh no, i was supposed to make the instance not be running, and when i tried to do that, i discovered that it was not running because we made a mistake, so now i'm actually going to...make it be running again?"

Imagine a scenario where a user goes to stop an instance so that they can change its boot disk or something, and while doing so, we discover that sled-agent has crashed and the instance isn't there. Moving it to `Failed` results in the instance being restarted, so now the user has to stop the instance a _second_ time before they can actually do what they were trying to do originally.

Maybe we should just always move the instance to stopped when such an error is encountered by an instance-stop attempt. Obviously we would still go to `Failed` when attempting to do other things to the instance.

	if let Err(e) = self
	.instance_request_state(
	opctx,
	state.instance(),
	state.vmm(),
	InstanceStateChangeRequest::Stop,
	)
	.await
	{
	if let (InstanceStateChangeError::SledAgent(inner), Some(vmm)) =
	(&e, state.vmm())
	{
	if inner.vmm_gone() {
	let _ = self
	.mark_vmm_failed(opctx, authz_instance, vmm, inner)
	.await;
	}
	}

	return Err(e);
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

should `instance-stop` really move instances to failed when they are discovered to already be gone? #6809

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

should instance-stop *really* move instances to failed when they are discovered to already be gone? #6809

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

should `instance-stop` really move instances to failed when they are discovered to already be gone? #6809