-
Notifications
You must be signed in to change notification settings - Fork 12
[Spot] Detect spot preemption via VPC API polling #535
Copy link
Copy link
Open
Labels
kind/featureCategorizes issue or PR as related to a new feature.Categorizes issue or PR as related to a new feature.
Description
Description
Parent: #501
Design Doc: #523
Depends on: #530, #532
Summary
When IBM Cloud preempts a spot instance, it stops the VM with status reason stopped_by_preemption.
The interruption controller must detect this via VPC API polling, then:
- Mark the offering as unavailable
- Delete the stopped VM
- Delete the NodeClaim to trigger replacement
- Emit a Kubernetes event
Goal
Modify:
pkg/controllers/interruption/controller.go
Required Changes
1. Call spot preemption check from Reconcile()
In Reconcile() (after the existing node loop), add:
if err := c.checkSpotPreemptions(ctx); err != nil {
log.FromContext(ctx).Error(err, "checking spot preemptions failed")
}2. Implement checkSpotPreemptions()
func (c *Controller) checkSpotPreemptions(ctx context.Context) error {
vpcClient := c.getVPCClient()
instances, err := vpcClient.ListInstances(&vpcv1.ListInstancesOptions{
AvailabilityClass: core.StringPtr("spot"),
})
if err != nil {
return err
}
for _, instance := range instances.Instances {
// Idempotency guard, only act on fully stopped instances
if instance.Status == nil || *instance.Status != "stopped" {
continue
}
for _, reason := range instance.StatusReasons {
if reason.Code != nil && *reason.Code == "stopped_by_preemption" {
nodeClaim := c.matchNodeClaimByProviderID(*instance.ID)
if nodeClaim == nil {
continue
}
instanceType := *instance.Profile.Name
zone := *instance.Zone.Name
key := "spot:" + instanceType + ":" + zone
// 1. Mark offering unavailable
c.unavailableOfferings.Add(key, time.Now().Add(time.Hour))
// 2. Delete stopped VM
if err := vpcClient.DeleteInstance(&vpcv1.DeleteInstanceOptions{
ID: instance.ID,
}); err != nil {
log.FromContext(ctx).Error(err, "failed deleting preempted instance")
}
// 3. Delete NodeClaim
if err := c.kubeClient.Delete(ctx, nodeClaim); err != nil {
log.FromContext(ctx).Error(err, "failed deleting nodeclaim")
}
// 4. Emit event
c.recorder.Event(
nodeClaim,
corev1.EventTypeWarning,
"SpotPreemption",
fmt.Sprintf("Spot instance %s preempted in zone %s", instanceType, zone),
)
}
}
}
return nil
}Idempotency
Only instances with: Status == "stopped" are processed.
Instances already in "deleting" state are skipped to avoid duplicate processing across polling cycles.
Correlating VPC instances to NodeClaims
Preferred method:
Use ProviderID.
ProviderID format:
ibm:///<region>/<instanceID>
Extract instanceID and match against instance.ID.
Acceptance Criteria
- Implement
checkSpotPreemptions() - Only spot instances are listed
- Only
"stopped"instances processed (idempotency guard) - NodeClaim correlated via ProviderID
- Preempted VM and NodeClaim deleted
- Add unit tests
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
kind/featureCategorizes issue or PR as related to a new feature.Categorizes issue or PR as related to a new feature.