Skip to content

Automatically scale StaticCapacity replicas when future ODCRs/Capacity Blocks become active #9001

@chrismld

Description

@chrismld

Description

What problem are you trying to solve?

Karpenter supports StaticCapacity (replicas field) and Capacity Reservations (capacityReservationSelectorTerms), but these features don't integrate for future reservations. The replicas field is static and doesn't automatically scale when future ODCRs or Capacity Blocks become active.

Current behavior:

  • Purchase future ODCR or Capacity Block (e.g., 7 days from now, 8am-4pm)
  • Configure NodePool with replicas: 0 and capacityReservationSelectorTerms
  • When reservation becomes active at 8am: NodePool still has replicas: 0
  • Must manually update to replicas: 8 or wait for pods (2-5 minute cold start)
  • When reservation expires: Karpenter detects expiration and updates node labels, but doesn't scale replicas back to 0

The problem: You're paying for the reservation whether nodes exist or not, but there's no automation to provision StaticCapacity nodes when the reservation becomes active.

Real-world impact: For distributed ML training with Ray (gang scheduling pattern):

  • Need head node (1x c7g.2xlarge) + workers (8x p5.48xlarge) ready simultaneously
  • Capacity Block cost: ~$600-800 for 8 hours
  • Current options:
    1. Wait for pods → 5-minute cold start per job → wasted reservation time
    2. Manual replicas management → error-prone, requires intervention (even if this would be automated, it adds complexity to the user)
    3. External automation (Lambda + EventBridge) → maintenance burden

Proposed solution:

Enable Karpenter to automatically adjust StaticCapacity replicas when reservations become active.

Example API:

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
spec:
  capacityReservationSelectorTerms:
    - tags:
        workload-type: training
  capacityReservationBehavior:
    provisionOnActivation: true  # Auto-scale replicas when reservation becomes active
    deprovisionOnExpiration: true # Auto-scale to 0 when reservation expires

How it could work: Since Karpenter already evaluates if FCR/CB is within valid time window:

  1. When reservation transitions to active state → set NodePool replicas to match instance count
  2. Provision nodes immediately (StaticCapacity behavior)
  3. When reservation expires → set replicas: 0 and drain gracefully

Benefits:

  • Eliminates cold starts for reserved capacity
  • No manual replicas management
  • No external automation required
  • Works with gang scheduling patterns
  • Maximizes utilization of pre-paid capacity

How important is this feature to you?

Critical for ML/AI workloads using future-dated Capacity Blocks and ODCRs.

Why:

  1. Cost efficiency: 5-min cold start × 10 jobs = 50 minutes wasted capacity (~$60-80 for p5 instances)
  2. Gang scheduling: Distributed training needs all nodes ready simultaneously (Ray, Horovod, PyTorch DDP)
  3. Operational complexity: Teams build Lambda + EventBridge automation to watch reservation lifecycle events and update NodePool replicas - this could be handled natively in Karpenter instead
  4. User experience: "I reserved capacity" should mean "capacity is ready," not "capacity will provision when I submit a job"

Willing to collaborate on design/implementation if maintainers are interested.

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureNew feature or requesttriage/acceptedIndicates that the issue has been accepted as a valid issue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions