-
Notifications
You must be signed in to change notification settings - Fork 207
Description
Problem
In rare cases, dstack may lose track of a cloud instance and fail to terminate it. This can happen, for example, when:
- dstack crashes (OOM,
kill -9, etc.) after callingCompute.create_instanceorCompute.run_job, but before committing the provisioning data to the database. - The cloud provider API or SDK reports an instance creation error, even though the instance is actually created successfully.
- dstack gives up attempting to terminate an instance (for example, due to provider API downtime).
- There is a bug in dstack.
Although these situations are uncommon, their impact can be significant due to unexpected cloud charges.
Solution
Periodically retrieve the list of active instances from each backend, compare it with the list of instances tracked by dstack, and terminate any unexpected instances and/or notify the administrator.
Workaround
Implementation notes
Identifying orphan instances
Many users share the same cloud accounts across multiple dstack projects, multiple dstack servers, or even use the accounts both with and without dstack. Therefore, it is critical to accurately identify truly orphaned instances, rather than terminating every instance that does not belong to the current server or project.
One possible approach is to encode a globally unique instance reference string in the instance name or labels. dstack must commit this reference to the database before calling Compute.create_instance or Compute.run_job, to avoid losing it in the event of a crash.
Using this approach, the orphan detection algorithm would look roughly as follows:
- Request the list of active instances from a backend.
- For each instance:
- Extract the instance reference string from the instance name or labels.
- If extraction fails, ignore the instance — it does not belong to dstack.
- Look up an instance record in the database using the reference, scoped to the backend’s project.
- If no record is found, ignore the instance — it belongs to a different dstack server or project.
- Verify that the instance record’s status implies that a cloud instance should exist.
- If it does not, the cloud instance is orphaned.
- Verify that the cloud instance ID matches the ID stored in the provisioning data.
- If it does not, the cloud instance is orphaned.
- Extract the instance reference string from the instance name or labels.
Some backends (for example, hotaisle) do not allow setting instance names or labels. For such backends, orphan detection may need to be opt-out or implemented using an alternative mechanism.
Other orphan resources
Some backends create additional resources per instance. For example, the nebius backend creates a boot disk for each instance. These related resources can also become orphaned and can be detected using the same reference-based mechanism.
For instance, the nebius backend can encode the instance reference in the boot disk name. If a boot disk is found without a corresponding instance, it can be reported as if the instance were still present.
Related issues
Listing backend instances may also help address issue #2732.
Would you like to help us implement this feature by sending a PR?
Yes