-
Notifications
You must be signed in to change notification settings - Fork 48
Add Task to scale down replicas for addons and karpenter in order to fix the aiml-load Pipeline #576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| kind: Task | ||
| name: helm-karpenter-install | ||
| - name: get-karp-logs | ||
| timeout: "4h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why explicit timeout 4h ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for cases where the pipeline fails in any of the steps before the stop-karpenter-logs step. If we don't timeout then this step will keep running which blocks the Teardown step. 4h is more than enough time for the pipeline to successfully finish based on my analysis of the runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha!
If we don't timeout then this step will keep running which blocks the Teardown step
default timeout of pipeline will kick in if not set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think instead of arbitrary timeout, given we scale down karp pods, what we can do is, in get-karp-logs task, we can check if karp pods are deleted and exit the task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it is embedded here already -
Lines 48 to 49 in 3ec214d
| # Follow logs continuously - will exit when pod is deleted | |
| kubectl logs "$pod" -n $(params.namespace) -f & |
Were you not seeing task exiting even after scale-down karp step executed ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were you not seeing task exiting even after scale-down karp step executed ?
Yes it does exit after scale-down karp step executed. As mentioned previously the timeout is only for scenario if we don't reach this scale-down karp step (like it failed somewhere before) though I haven't seen that in my testing so far.
I will remove the timeout for now. We can deal with it if we see problems with this step when running this Pipeline in prod.
tests/tekton-resources/tasks/teardown/karpenter/kubectl-karpenter-scale.yaml
Outdated
Show resolved
Hide resolved
97d7857 to
d8256ee
Compare
…oad Pipeline Description / Motivation: Addons like coredns and ebs-csi-controllers have PDB set due to which Karpenter is not able to scale down the nodepools. Added Task to scale these addons replicas to 0 before scaling down nodepools in aiml-load pipeline. In order to stop karpenter logs Task after nodepools are scaled down, I am adding another step in the Task which sets the Karpenter replicas to zero. This is done to force stop karpenter logs which keeps running even when job is done and prevents the Teardown step to run. Desktop Testing: Tested by triggering Tekton test run.
d8256ee to
cdab558
Compare
Description / Motivation:
Addons like coredns and ebs-csi-controllers have PDB set due to which Karpenter is not able to scale down the nodepools. Added Task to scale these addons replicas to 0 before scaling down nodepools in aiml-load pipeline.
In order to stop karpenter logs Task after nodepools are scaled down, I am adding another step in the Task which sets the Karpenter replicas to zero. This is done to force stop karpenter logs which keeps running even when job is done and prevents the Teardown step to run.
Desktop Testing: Tested by triggering Tekton test run.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.