Description
What problem are you trying to solve?
Karpenter supports StaticCapacity (replicas field) and Capacity Reservations (capacityReservationSelectorTerms), but these features don't integrate for future reservations. The replicas field is static and doesn't automatically scale when future ODCRs or Capacity Blocks become active.
Current behavior:
- Purchase future ODCR or Capacity Block (e.g., 7 days from now, 8am-4pm)
- Configure NodePool with
replicas: 0 and capacityReservationSelectorTerms
- When reservation becomes active at 8am: NodePool still has
replicas: 0
- Must manually update to
replicas: 8 or wait for pods (2-5 minute cold start)
- When reservation expires: Karpenter detects expiration and updates node labels, but doesn't scale replicas back to 0
The problem: You're paying for the reservation whether nodes exist or not, but there's no automation to provision StaticCapacity nodes when the reservation becomes active.
Real-world impact: For distributed ML training with Ray (gang scheduling pattern):
- Need head node (1x c7g.2xlarge) + workers (8x p5.48xlarge) ready simultaneously
- Capacity Block cost: ~$600-800 for 8 hours
- Current options:
- Wait for pods → 5-minute cold start per job → wasted reservation time
- Manual
replicas management → error-prone, requires intervention (even if this would be automated, it adds complexity to the user)
- External automation (Lambda + EventBridge) → maintenance burden
Proposed solution:
Enable Karpenter to automatically adjust StaticCapacity replicas when reservations become active.
Example API:
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
spec:
capacityReservationSelectorTerms:
- tags:
workload-type: training
capacityReservationBehavior:
provisionOnActivation: true # Auto-scale replicas when reservation becomes active
deprovisionOnExpiration: true # Auto-scale to 0 when reservation expires
How it could work: Since Karpenter already evaluates if FCR/CB is within valid time window:
- When reservation transitions to active state → set NodePool replicas to match instance count
- Provision nodes immediately (StaticCapacity behavior)
- When reservation expires → set
replicas: 0 and drain gracefully
Benefits:
- Eliminates cold starts for reserved capacity
- No manual
replicas management
- No external automation required
- Works with gang scheduling patterns
- Maximizes utilization of pre-paid capacity
How important is this feature to you?
Critical for ML/AI workloads using future-dated Capacity Blocks and ODCRs.
Why:
- Cost efficiency: 5-min cold start × 10 jobs = 50 minutes wasted capacity (~$60-80 for p5 instances)
- Gang scheduling: Distributed training needs all nodes ready simultaneously (Ray, Horovod, PyTorch DDP)
- Operational complexity: Teams build Lambda + EventBridge automation to watch reservation lifecycle events and update NodePool replicas - this could be handled natively in Karpenter instead
- User experience: "I reserved capacity" should mean "capacity is ready," not "capacity will provision when I submit a job"
Willing to collaborate on design/implementation if maintainers are interested.
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Description
What problem are you trying to solve?
Karpenter supports StaticCapacity (
replicasfield) and Capacity Reservations (capacityReservationSelectorTerms), but these features don't integrate for future reservations. Thereplicasfield is static and doesn't automatically scale when future ODCRs or Capacity Blocks become active.Current behavior:
replicas: 0andcapacityReservationSelectorTermsreplicas: 0replicas: 8or wait for pods (2-5 minute cold start)The problem: You're paying for the reservation whether nodes exist or not, but there's no automation to provision StaticCapacity nodes when the reservation becomes active.
Real-world impact: For distributed ML training with Ray (gang scheduling pattern):
replicasmanagement → error-prone, requires intervention (even if this would be automated, it adds complexity to the user)Proposed solution:
Enable Karpenter to automatically adjust StaticCapacity
replicaswhen reservations become active.Example API:
How it could work: Since Karpenter already evaluates if FCR/CB is within valid time window:
replicas: 0and drain gracefullyBenefits:
replicasmanagementHow important is this feature to you?
Critical for ML/AI workloads using future-dated Capacity Blocks and ODCRs.
Why:
Willing to collaborate on design/implementation if maintainers are interested.