Skip to content

Commit 97d7857

Browse files
committed
Add Task to scale down replicas for addons and karpenter in order to fix the aiml-load Pipeline
Description / Motivation: Addons like coredns and ebs-csi-controllers have PDB set due to which Karpenter is not able to scale down the nodepools. Added Task to scale these addons replicas to 0 before scaling down nodepools in aiml-load pipeline. In order to stop karpenter logs Task after nodepools are scaled down, I am adding another Task which sets the Karpenter replicas to zero. This is done to force stop karpenter logs which keeps running even when job is done and prevents the Teardown step to run. Desktop Testing: Tested by triggering Tekton test run.
1 parent 3ec214d commit 97d7857

File tree

3 files changed

+121
-5
lines changed

3 files changed

+121
-5
lines changed

tests/tekton-resources/pipelines/eks/awscli-eks-aiml-load.yaml

Lines changed: 41 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -281,6 +281,7 @@ spec:
281281
kind: Task
282282
name: helm-karpenter-install
283283
- name: get-karp-logs
284+
timeout: "4h"
284285
params:
285286
- name: cluster-name
286287
value: $(params.cluster-name)
@@ -602,6 +603,24 @@ spec:
602603
taskRef:
603604
kind: Task
604605
name: load-aiml-multiple-fine-tuning
606+
- name: scale-down-addons
607+
params:
608+
- name: cluster-name
609+
value: $(params.cluster-name)
610+
- name: coredns-replicas
611+
value: 0
612+
- name: ebs-csi-replicas
613+
value: 0
614+
- name: endpoint
615+
value: $(params.endpoint)
616+
runAfter:
617+
- load-aiml-multiple-fine-tuning
618+
taskRef:
619+
kind: Task
620+
name: scale-addons
621+
workspaces:
622+
- name: config
623+
workspace: config
605624
- name: scale-down-training
606625
params:
607626
- name: cluster-name
@@ -613,7 +632,7 @@ spec:
613632
- name: replicas
614633
value: 0
615634
runAfter:
616-
- load-aiml-multiple-fine-tuning
635+
- scale-down-addons
617636
taskRef:
618637
kind: Task
619638
name: scale-nodepool
@@ -643,7 +662,7 @@ spec:
643662
- name: replicas
644663
value: 0
645664
runAfter:
646-
- load-aiml-multiple-fine-tuning
665+
- scale-down-addons
647666
taskRef:
648667
kind: Task
649668
name: scale-nodepool
@@ -673,7 +692,7 @@ spec:
673692
- name: replicas
674693
value: 0
675694
runAfter:
676-
- load-aiml-multiple-fine-tuning
695+
- scale-down-addons
677696
taskRef:
678697
kind: Task
679698
name: scale-nodepool
@@ -703,7 +722,7 @@ spec:
703722
- name: replicas
704723
value: 0
705724
runAfter:
706-
- load-aiml-multiple-fine-tuning
725+
- scale-down-addons
707726
taskRef:
708727
kind: Task
709728
name: scale-nodepool
@@ -733,7 +752,7 @@ spec:
733752
- name: replicas
734753
value: 0
735754
runAfter:
736-
- load-aiml-multiple-fine-tuning
755+
- scale-down-addons
737756
taskRef:
738757
kind: Task
739758
name: scale-nodepool
@@ -752,6 +771,23 @@ spec:
752771
taskRef:
753772
kind: Task
754773
name: nodepool-replicas-wait
774+
- name: stop-karpenter-logs
775+
params:
776+
- name: cluster-name
777+
value: $(params.cluster-name)
778+
- name: endpoint
779+
value: $(params.endpoint)
780+
- name: replicas
781+
value: 0
782+
runAfter:
783+
- wait-for-scale-down-training
784+
- wait-for-scale-down-inference
785+
- wait-for-scale-down-operator
786+
- wait-for-scale-down-monitoring
787+
- wait-for-scale-down-titan-pool
788+
taskRef:
789+
kind: Task
790+
name: scale-karpenter
755791
finally:
756792
- name: teardown
757793
retries: 10 # To deal with throttling during deletion
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
apiVersion: tekton.dev/v1beta1
2+
kind: Task
3+
metadata:
4+
name: scale-addons
5+
namespace: scalability
6+
spec:
7+
description: |
8+
Scales the CoreDNS and EBS CSI Controller deployments to the specified number of replicas.
9+
This task configures kubectl access to the EKS cluster, scales the deployments,
10+
and waits for the scaling operations to complete.
11+
params:
12+
- name: coredns-replicas
13+
description: Number of replicas to scale CoreDNS to (target replica count)
14+
- name: ebs-csi-replicas
15+
description: Number of replicas to scale EBS CSI Controller to (target replica count)
16+
- name: cluster-name
17+
description: The name of the EKS cluster
18+
- name: endpoint
19+
description: EKS cluster endpoint URL (optional)
20+
default: ""
21+
- name: aws-region
22+
description: AWS region where the cluster is located
23+
default: "us-west-2"
24+
- name: timeout
25+
description: Timeout for rollout status in seconds
26+
default: "1800"
27+
workspaces:
28+
- name: config
29+
mountPath: /config/
30+
stepTemplate:
31+
env:
32+
- name: KUBECONFIG
33+
value: /config/kubeconfig
34+
steps:
35+
- name: update-kubeconfig
36+
image: alpine/k8s:1.35.0
37+
script: |
38+
ENDPOINT_FLAG=""
39+
if [ -n "$(params.endpoint)" ]; then
40+
ENDPOINT_FLAG="--endpoint $(params.endpoint)"
41+
fi
42+
aws eks $ENDPOINT_FLAG update-kubeconfig --name $(params.cluster-name) --region $(params.aws-region)
43+
- name: scale-coredns
44+
image: alpine/k8s:1.35.0
45+
script: |
46+
kubectl scale deployment coredns -n kube-system --replicas=$(params.coredns-replicas)
47+
kubectl rollout status deployment/coredns -n kube-system --timeout=$(params.timeout)s
48+
- name: scale-ebs-csi-controller
49+
image: alpine/k8s:1.35.0
50+
script: |
51+
kubectl scale deployment ebs-csi-controller -n kube-system --replicas=$(params.ebs-csi-replicas)
52+
kubectl rollout status deployment/ebs-csi-controller -n kube-system --timeout=$(params.timeout)s
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
---
2+
apiVersion: tekton.dev/v1beta1
3+
kind: Task
4+
metadata:
5+
name: scale-karpenter
6+
namespace: scalability
7+
spec:
8+
description: "Scale Karpenter deployment"
9+
params:
10+
- name: cluster-name
11+
description: The name of the cluster
12+
- name: endpoint
13+
description: eks endpoint to use
14+
- name: aws-region
15+
description: AWS region where the cluster is located
16+
default: us-west-2
17+
- name: namespace
18+
description: Namespace where karpenter is installed
19+
default: karpenter
20+
- name: replicas
21+
description: Number of replicas to scale to
22+
steps:
23+
- name: scale-karpenter
24+
image: alpine/k8s:1.35.0
25+
script: |
26+
aws eks update-kubeconfig --name $(params.cluster-name) --endpoint $(params.endpoint) --region $(params.aws-region)
27+
kubectl scale deployment karpenter -n $(params.namespace) --replicas=$(params.replicas)
28+
echo "Karpenter scaled to $(params.replicas) replicas"

0 commit comments

Comments
 (0)