Skip to content

Conversation

@chithreshazad
Copy link
Contributor

@chithreshazad chithreshazad commented Jan 26, 2026

Description / Motivation:
Addons like coredns and ebs-csi-controllers have PDB set due to which Karpenter is not able to scale down the nodepools. Added Task to scale these addons replicas to 0 before scaling down nodepools in aiml-load pipeline.

In order to stop karpenter logs Task after nodepools are scaled down, I am adding another step in the Task which sets the Karpenter replicas to zero. This is done to force stop karpenter logs which keeps running even when job is done and prevents the Teardown step to run.

Desktop Testing: Tested by triggering Tekton test run.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

kind: Task
name: helm-karpenter-install
- name: get-karp-logs
timeout: "4h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why explicit timeout 4h ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for cases where the pipeline fails in any of the steps before the stop-karpenter-logs step. If we don't timeout then this step will keep running which blocks the Teardown step. 4h is more than enough time for the pipeline to successfully finish based on my analysis of the runs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha!

If we don't timeout then this step will keep running which blocks the Teardown step

default timeout of pipeline will kick in if not set.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of arbitrary timeout, given we scale down karp pods, what we can do is, in get-karp-logs task, we can check if karp pods are deleted and exit the task.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it is embedded here already -

# Follow logs continuously - will exit when pod is deleted
kubectl logs "$pod" -n $(params.namespace) -f &

Were you not seeing task exiting even after scale-down karp step executed ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were you not seeing task exiting even after scale-down karp step executed ?

Yes it does exit after scale-down karp step executed. As mentioned previously the timeout is only for scenario if we don't reach this scale-down karp step (like it failed somewhere before) though I haven't seen that in my testing so far.

I will remove the timeout for now. We can deal with it if we see problems with this step when running this Pipeline in prod.

…oad Pipeline

Description / Motivation:
Addons like coredns and ebs-csi-controllers have PDB set due to which Karpenter is not able to scale down the nodepools. Added Task to scale these addons replicas to 0 before scaling down nodepools in aiml-load pipeline.

In order to stop karpenter logs Task after nodepools are scaled down, I am adding another step in the Task which sets the Karpenter replicas to zero. This is done to force stop karpenter logs which keeps running even when job is done and prevents the Teardown step to run.

Desktop Testing: Tested by triggering Tekton test run.
@hakuna-matatah hakuna-matatah merged commit 3c5dfb8 into awslabs:main Jan 28, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants