Skip to content

Conversation

@maxcao13
Copy link
Contributor

WIP

This commit allows price to be discovered from cluster-api instance type offerings. This is done by allowing an annotation to be consumed from a MachineDeployment.

This commit allows price to be discovered from cluster-api instance type offerings.
This is done by allowing an annotation to be consumed from a MachineDeployment.

Signed-off-by: Max Cao <[email protected]>
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: maxcao13
Once this PR has been reviewed and has the lgtm label, please assign neolit123 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 31, 2025
@maxcao13
Copy link
Contributor Author

maxcao13 commented Oct 31, 2025

I've tested this from provisioning using kubemark cluster-api.

For example with these two different MachineDeployments:

apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDeployment
metadata:
  annotations:
    cluster.x-k8s.io/cluster-api-autoscaler-node-group-max-size: "10"
    cluster.x-k8s.io/cluster-api-autoscaler-node-group-min-size: "0"
    capacity.cluster-autoscaler.kubernetes.io/memory: "4G"
    capacity.cluster-autoscaler.kubernetes.io/cpu: "2"
    capacity.cluster-autoscaler.kubernetes.io/ephemeral-disk: "100Gi"
    capacity.cluster-autoscaler.kubernetes.io/maxPods: "110"
    capacity.cluster-autoscaler.kubernetes.io/labels: kubernetes.io/arch=amd64,karpenter.sh/capacity-type=on-demand,node.kubernetes.io/instance-type=kubemark-0,topology.kubernetes.io/zone=us-east-1a
-->    cluster.x-k8s.io/machine-current-price: "0.50" <--
  labels:
    node.cluster.x-k8s.io/karpenter-member: ""
  name: kubemark-workload-karpenter-md-0
  namespace: default
spec:
...omitted
---
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDeployment
metadata:
  annotations:
    cluster.x-k8s.io/cluster-api-autoscaler-node-group-max-size: "10"
    cluster.x-k8s.io/cluster-api-autoscaler-node-group-min-size: "0"
    capacity.cluster-autoscaler.kubernetes.io/memory: "8G"
    capacity.cluster-autoscaler.kubernetes.io/cpu: "4"
    capacity.cluster-autoscaler.kubernetes.io/ephemeral-disk: "200Gi"
    capacity.cluster-autoscaler.kubernetes.io/maxPods: "110"
    capacity.cluster-autoscaler.kubernetes.io/labels: kubernetes.io/arch=amd64,karpenter.sh/capacity-type=on-demand,node.kubernetes.io/instance-type=kubemark-1,topology.kubernetes.io/zone=us-east-1a
-->    cluster.x-k8s.io/machine-current-price: "10.00" # this is more expensive! <--
  labels:
    node.cluster.x-k8s.io/karpenter-member: ""
  name: kubemark-workload-karpenter-md-1
  namespace: default
spec:
...omitted

and a deployment which trigger Karpenter scale-up, Karpenter will pick to scale the MachineDeployment with the lower cost based on the annotation.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: scale-up
  namespace: default
spec:
..omitted
  replicas: 1
    spec:
        resources:
          requests:
            cpu: 500m
            memory: 2G
      nodeSelector:
        node.cluster.x-k8s.io: workload-target
...omitted...

I also sort of got consolidation to work. The thing is that you have to force the MachineDeployments to use capacity type on-demand and force the NodePool to have that as a requirement, otherwise some weird stuff happens since we don't support spot[1].

# my NodePool requirements
        - key: karpenter.sh/capacity-type
          operator: In
          values:
          - on-demand

Here's the events after consolidation:

8m21s       Normal    Unconsolidatable             nodeclaim/default-xjpsc                             Can't replace with a cheaper node
8m21s       Normal    Unconsolidatable             node/kubemark-workload-karpenter-md-0-xtmzg-cl4p2   Can't replace with a cheaper node
0s          Normal    Scheduled                    pod/dummy-trigger                                   Successfully assigned default/dummy-trigger to kubemark-workload-karpenter-md-0-xtmzg-cl4p2
0s          Normal    DisruptionLaunching          nodeclaim/default-txrbq                             Launching NodeClaim: Underutilized/Replace
0s          Normal    DisruptionWaitingReadiness   nodeclaim/default-txrbq                             Waiting on readiness to continue disruption
0s          Normal    NodeRegistrationHealthy      nodepool/default                                    Status condition transitioned, Type: NodeRegistrationHealthy, Status: Unknown -> True, Reason: NodeRegistrationHealthy
0s          Normal    NodeRegistrationHealthy      nodepool/default                                    Status condition transitioned, Type: NodeRegistrationHealthy, Status: True -> Unknown, Reason: AwaitingReconciliation, Message: object is awaiting reconciliation
0s          Normal    Launched                     nodeclaim/default-txrbq                             Status condition transitioned, Type: Launched, Status: Unknown -> True, Reason: Launched
0s          Normal    Registered                   nodeclaim/default-txrbq                             Status condition transitioned, Type: Registered, Status: Unknown -> True, Reason: Registered
0s          Normal    Initialized                  nodeclaim/default-txrbq                             Status condition transitioned, Type: Initialized, Status: Unknown -> True, Reason: Initialized
0s          Normal    Ready                        nodeclaim/default-txrbq                             Status condition transitioned, Type: Ready, Status: Unknown -> True, Reason: Ready
0s          Normal    DisruptionTerminating        node/kubemark-workload-karpenter-md-0-xtmzg-cl4p2   Disrupting Node: Underutilized/Replace
0s          Normal    DisruptionTerminating        nodeclaim/default-xjpsc                             Disrupting NodeClaim: Underutilized/Replace
0s          Warning   FailedDraining               node/kubemark-workload-karpenter-md-0-xtmzg-cl4p2   Failed to drain node, 2 pods are waiting to be evicted
0s          Normal    Ready                        node/kubemark-workload-karpenter-md-1-x65s4-2c2rd   Status condition transitioned, Type: Ready, Status: False -> True, Reason: KubeletReady, Message: kubelet is posting ready status
0s          Normal    DisruptionBlocked            node/kubemark-workload-karpenter-md-0-xtmzg-cl4p2   Node is deleting or marked for deletion
0s          Normal    DisruptionBlocked            nodeclaim/default-xjpsc                             Node is deleting or marked for deletion
0s          Normal    Nominated                    pod/dummy-trigger                                   Pod should schedule on: nodeclaim/default-txrbq, node/kubemark-workload-karpenter-md-1-x65s4-q2png
0s          Normal    SuccessfulCreate             replicaset/scale-up-866ff5dfc                       Created pod: scale-up-866ff5dfc-vhj2l
0s          Normal    Evicted                      pod/scale-up-866ff5dfc-89cnf                        Evicted pod: Underutilized
0s          Normal    Scheduled                    pod/scale-up-866ff5dfc-vhj2l                        Successfully assigned default/scale-up-866ff5dfc-vhj2l to kubemark-workload-karpenter-md-1-x65s4-q2png
0s          Normal    Evicted                      pod/dummy-trigger                                   Evicted pod: Underutilized
0s          Normal    RegisteredNode               node/kubemark-workload-karpenter-md-1-x65s4-2c2rd   Node kubemark-workload-karpenter-md-1-x65s4-2c2rd event: Registered Node kubemark-workload-karpenter-md-1-x65s4-2c2rd in Controller
0s          Normal    Finalized                    node                                                Finalized karpenter.sh/termination
0s          Normal    Finalized                    nodeclaim                                           Finalized karpenter.sh/termination
0s          Normal    DisruptionBlocked            node/kubemark-workload-karpenter-md-1-x65s4-q2png   Node is nominated for a pending pod
0s          Normal    DisruptionBlocked            nodeclaim/default-txrbq                             Node is nominated for a pending pod
0s          Normal    RemovingNode                 node/kubemark-workload-karpenter-md-0-xtmzg-cl4p2   Node kubemark-workload-karpenter-md-0-xtmzg-cl4p2 event: Removing Node kubemark-workload-karpenter-md-0-xtmzg-cl4p2 from Controller
0s          Normal    Unconsolidatable             node/kubemark-workload-karpenter-md-1-x65s4-q2png   Can't replace with a cheaper node
0s          Normal    Unconsolidatable             nodeclaim/default-txrbq                             Can't replace with a cheaper node

All I had to do was edit the annotation on the currently scheduled node's MachineDeployment to be more expensive than the other, so consolidation would eventually happen. There's a 5 minute period that Karpenter has if there's no external trigger for consolidation, so I ran some random pod (you can see default/dummy-trigger in the events).

[1] (reserved capacity-type straight up is not allowed to be consolidated since it is considered "no-cost" so that was something I ran into.)

@maxcao13
Copy link
Contributor Author

/cc @elmiko

@k8s-ci-robot k8s-ci-robot requested a review from elmiko October 31, 2025 02:37
@elmiko
Copy link
Collaborator

elmiko commented Oct 31, 2025

this is awesome @maxcao13 !

i don't think we want to merge yet, as there is more discussion that needs to happen about how cluster-api will want this feature.

I also sort of got consolidation to work. The thing is that you have to force the MachineDeployments to use capacity type on-demand and force the NodePool to have that as a requirement, otherwise some weird stuff happens since we don't support spot[1].

i'm very curious if you think there is any followup we should related to this?

i'm putting a hold, just to make sure we don't merge early.

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 31, 2025
@elmiko
Copy link
Collaborator

elmiko commented Oct 31, 2025

oh, looks like some failure in the unit tests too

@maxcao13
Copy link
Contributor Author

maxcao13 commented Oct 31, 2025

i'm very curious if you think there is any followup we should related to this?

I think we need to figure out how to handle spot (assuming that I'm not just missing some configuration/manifest setup), or at least investigate what's going wrong if it's not related to that.

Right now if we just have a NodePool without a capacity type requirement, and Karpenter tries to consolidate an on-demand node it will default for the new node to be a spot instance: https://github.com/kubernetes-sigs/karpenter/blob/565c33c673d28f5f5a442182dcc14db6cb6b44ea/pkg/controllers/disruption/consolidation.go#L215-L223

When testing, the replacement spot instance nodes start to churn and get deleted over and over for some reason. If we can figure out why, I would feel more confident about consolidation, at least for single-node consolidation.

We may also want to think about introducing the karpenter-core "conformance" e2e tests as a github actions to start validating.

@elmiko
Copy link
Collaborator

elmiko commented Nov 4, 2025

I think we need to figure out how to handle spot (assuming that I'm not just missing some configuration/manifest setup), or at least investigate what's going wrong if it's not related to that.

heard.

i think this is going to be complex due to cluster-api. there isn't a singular "on-demand/spot" field across cluster-api providers. we might have to push for more status to be exposed on those objects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants