[Feature]: Detect orphan instances

### Problem

In rare cases, dstack may lose track of a cloud instance and fail to terminate it. This can happen, for example, when:

- dstack crashes (OOM, `kill -9`, etc.) after calling `Compute.create_instance` or `Compute.run_job`, but before committing the provisioning data to the database.
- The cloud provider API or SDK reports an instance creation error, even though the instance is actually created successfully.
- dstack gives up attempting to terminate an instance (for example, due to provider API downtime).
- There is a bug in dstack.

Although these situations are uncommon, their impact can be significant due to unexpected cloud charges.

### Solution

Periodically retrieve the list of active instances from each backend, compare it with the list of instances tracked by dstack, and terminate any unexpected instances and/or notify the administrator.

### Workaround

### Implementation notes

#### Identifying orphan instances

Many users share the same cloud accounts across multiple dstack projects, multiple dstack servers, or even use the accounts both with and without dstack. Therefore, it is critical to accurately identify truly orphaned instances, rather than terminating every instance that does not belong to the current server or project.

One possible approach is to encode a globally unique instance reference string in the instance name or labels. dstack must commit this reference to the database **before** calling `Compute.create_instance` or `Compute.run_job`, to avoid losing it in the event of a crash.

Using this approach, the orphan detection algorithm would look roughly as follows:

1. Request the list of active instances from a backend.
2. For each instance:
   1. Extract the instance reference string from the instance name or labels.  
      - If extraction fails, ignore the instance — it does not belong to dstack.
   2. Look up an instance record in the database using the reference, scoped to the backend’s project.  
      - If no record is found, ignore the instance — it belongs to a different dstack server or project.
   3. Verify that the instance record’s status implies that a cloud instance should exist.  
      - If it does not, the cloud instance is orphaned.
   4. Verify that the cloud instance ID matches the ID stored in the provisioning data.  
      - If it does not, the cloud instance is orphaned.

Some backends (for example, `hotaisle`) do not allow setting instance names or labels. For such backends, orphan detection may need to be opt-out or implemented using an alternative mechanism.

#### Other orphan resources

Some backends create additional resources per instance. For example, the `nebius` backend creates a boot disk for each instance. These related resources can also become orphaned and can be detected using the same reference-based mechanism.

For instance, the `nebius` backend can encode the instance reference in the boot disk name. If a boot disk is found without a corresponding instance, it can be reported as if the instance were still present.

#### Related issues

Listing backend instances may also help address issue #2732.

### Would you like to help us implement this feature by sending a PR?

Yes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature]: Detect orphan instances #3453

Problem

Solution

Workaround

Implementation notes

Identifying orphan instances

Other orphan resources

Related issues

Would you like to help us implement this feature by sending a PR?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: Detect orphan instances #3453

Description

Problem

Solution

Workaround

Implementation notes

Identifying orphan instances

Other orphan resources

Related issues

Would you like to help us implement this feature by sending a PR?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions