Skip to content

[Spot] Detect spot preemption via VPC API polling #535

@meomnzak

Description

@meomnzak

Description

Parent: #501
Design Doc: #523
Depends on: #530, #532

Summary

When IBM Cloud preempts a spot instance, it stops the VM with status reason stopped_by_preemption.

The interruption controller must detect this via VPC API polling, then:

  1. Mark the offering as unavailable
  2. Delete the stopped VM
  3. Delete the NodeClaim to trigger replacement
  4. Emit a Kubernetes event

Goal

Modify:

pkg/controllers/interruption/controller.go

Required Changes

1. Call spot preemption check from Reconcile()

In Reconcile() (after the existing node loop), add:

if err := c.checkSpotPreemptions(ctx); err != nil {
    log.FromContext(ctx).Error(err, "checking spot preemptions failed")
}

2. Implement checkSpotPreemptions()

func (c *Controller) checkSpotPreemptions(ctx context.Context) error {
    vpcClient := c.getVPCClient()

    instances, err := vpcClient.ListInstances(&vpcv1.ListInstancesOptions{
        AvailabilityClass: core.StringPtr("spot"),
    })
    if err != nil {
        return err
    }

    for _, instance := range instances.Instances {

        // Idempotency guard, only act on fully stopped instances
        if instance.Status == nil || *instance.Status != "stopped" {
            continue
        }

        for _, reason := range instance.StatusReasons {
            if reason.Code != nil && *reason.Code == "stopped_by_preemption" {

                nodeClaim := c.matchNodeClaimByProviderID(*instance.ID)
                if nodeClaim == nil {
                    continue
                }

                instanceType := *instance.Profile.Name
                zone := *instance.Zone.Name

                key := "spot:" + instanceType + ":" + zone

                // 1. Mark offering unavailable
                c.unavailableOfferings.Add(key, time.Now().Add(time.Hour))

                // 2. Delete stopped VM
                if err := vpcClient.DeleteInstance(&vpcv1.DeleteInstanceOptions{
                    ID: instance.ID,
                }); err != nil {
                    log.FromContext(ctx).Error(err, "failed deleting preempted instance")
                }

                // 3. Delete NodeClaim
                if err := c.kubeClient.Delete(ctx, nodeClaim); err != nil {
                    log.FromContext(ctx).Error(err, "failed deleting nodeclaim")
                }

                // 4. Emit event
                c.recorder.Event(
                    nodeClaim,
                    corev1.EventTypeWarning,
                    "SpotPreemption",
                    fmt.Sprintf("Spot instance %s preempted in zone %s", instanceType, zone),
                )
            }
        }
    }

    return nil
}

Idempotency

Only instances with: Status == "stopped" are processed.

Instances already in "deleting" state are skipped to avoid duplicate processing across polling cycles.


Correlating VPC instances to NodeClaims

Preferred method:

Use ProviderID.

ProviderID format:

ibm:///<region>/<instanceID>

Extract instanceID and match against instance.ID.


Acceptance Criteria

  • Implement checkSpotPreemptions()
  • Only spot instances are listed
  • Only "stopped" instances processed (idempotency guard)
  • NodeClaim correlated via ProviderID
  • Preempted VM and NodeClaim deleted
  • Add unit tests

Metadata

Metadata

Assignees

Labels

kind/featureCategorizes issue or PR as related to a new feature.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions