-
Notifications
You must be signed in to change notification settings - Fork 104
Add sidecar container mode to the in-place restart agent #1140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add sidecar container mode to the in-place restart agent #1140
Conversation
✅ Deploy Preview for kubernetes-sigs-jobset ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: GiuseppeTT The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
| for _, pod := range associatedPods.Items { | ||
| // Skip it if the pod is failed | ||
| // Failed Pods might persist while their new copy already exists | ||
| if pod.Status.Phase == corev1.PodFailed { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was this a bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I am not sure how I didn't catch this in earlier tests 😅 .
Example of how the bug might manifest
Starting case. Workload has only 2 Pods.
pod-0: Runningpod-1: Running
pod-0 fails.
pod-0: Failedpod-1: Running
Job controller creates the replacement pod-0-new because backoffLimit > 0.
pod-0: Failedpod-0-new: Waiting at the barrierpod-1: Running
pod-0 is fully terminated but the object might persist for a long time. The JobSet controller should ignore pod-0 and consider only pod-0-new and pod-1 for in-place restart. This change makes sure this is true.
…e" to "sidecar container mode"
What type of PR is this?
/kind feature
What this PR does / why we need it:
tl;dr: Allow the in-place restart agent to run as a sidecar (sidecar container mode) instead of the main container (main container mode).
Up until now, the in-place restart agent could only be used in the "main container container" mode (see
site/static/examples/in-place-restart/jobset-main-container-mode.yaml) in which the worker process runs in the container of the agent. In-place restarts are done by restarting only the agent-worker container. The barrier is done by making the agent start the worker process only when the barrier is lifted.This PR adds the "sidecar" mode (see
site/static/examples/in-place-restart/jobset-sidecar-container-mode.yaml) in which the agent runs as a sidecar while the worker runs as the main container. In-place restarts are done using theRestartAllContainersfeature from upstream (see https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#restart-all-containers). This feature was introduced in 1.35 and its feature gate will be enabled by default in 1.36. The barrier is done by making the agent succeed a start up probe when the barrier is lifted, which allows the worker container to start running. The main advantage of the sidecar container mode is that it is easier to set up (add a new sidecar vs figure out how to build an image that contains both the agent and the worker).API-wise, the main container mode is used when the env variable
WORKER_COMMANDis used. Otherwise, the agent runs in sidecar container mode. The start up probe has defaults but can be configured by the env varsSTARTUP_PROBE_PATHandSTARTUP_PROBE_PORT.The agent image (for the sidecar container mode) can be built with
IN_PLACE_RESTART_AGENT_IMAGE_REGISTRY=<CHANGE ME> make in-place-restart-agent-image-pushwhich uses the Dockerfilecmd/in-place-restart-agent/Dockerfile. I plan to add the required pipeline to automatically build and push the agent image to the JobSet registry in a follow up PR.I also made sure to update the examples in
site/static/examples/in-place-restart/. Documentation will be added in a follow up PR.Which issue(s) this PR fixes:
Part of #467
Special notes for your reviewer:
None
Does this PR introduce a user-facing change?