Skip to content

Conversation

@GiuseppeTT
Copy link
Contributor

@GiuseppeTT GiuseppeTT commented Jan 27, 2026

What type of PR is this?

/kind feature

What this PR does / why we need it:

tl;dr: Allow the in-place restart agent to run as a sidecar (sidecar container mode) instead of the main container (main container mode).

Up until now, the in-place restart agent could only be used in the "main container container" mode (see site/static/examples/in-place-restart/jobset-main-container-mode.yaml) in which the worker process runs in the container of the agent. In-place restarts are done by restarting only the agent-worker container. The barrier is done by making the agent start the worker process only when the barrier is lifted.

This PR adds the "sidecar" mode (see site/static/examples/in-place-restart/jobset-sidecar-container-mode.yaml) in which the agent runs as a sidecar while the worker runs as the main container. In-place restarts are done using the RestartAllContainers feature from upstream (see https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#restart-all-containers). This feature was introduced in 1.35 and its feature gate will be enabled by default in 1.36. The barrier is done by making the agent succeed a start up probe when the barrier is lifted, which allows the worker container to start running. The main advantage of the sidecar container mode is that it is easier to set up (add a new sidecar vs figure out how to build an image that contains both the agent and the worker).

API-wise, the main container mode is used when the env variable WORKER_COMMAND is used. Otherwise, the agent runs in sidecar container mode. The start up probe has defaults but can be configured by the env vars STARTUP_PROBE_PATH and STARTUP_PROBE_PORT.

The agent image (for the sidecar container mode) can be built with IN_PLACE_RESTART_AGENT_IMAGE_REGISTRY=<CHANGE ME> make in-place-restart-agent-image-push which uses the Dockerfile cmd/in-place-restart-agent/Dockerfile. I plan to add the required pipeline to automatically build and push the agent image to the JobSet registry in a follow up PR.

I also made sure to update the examples in site/static/examples/in-place-restart/. Documentation will be added in a follow up PR.

Which issue(s) this PR fixes:

Part of #467

Special notes for your reviewer:

None

Does this PR introduce a user-facing change?

Add the sidecar container mode to the in-place restart agent.

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 27, 2026
@netlify
Copy link

netlify bot commented Jan 27, 2026

Deploy Preview for kubernetes-sigs-jobset ready!

Name Link
🔨 Latest commit 7149f4d
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-jobset/deploys/6980f4fd24de67000876afc8
😎 Deploy Preview https://deploy-preview-1140--kubernetes-sigs-jobset.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot requested a review from ahg-g January 27, 2026 21:51
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: GiuseppeTT

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 27, 2026
@GiuseppeTT
Copy link
Contributor Author

cc @ahg-g @kannon92 @andreyvelich

@GiuseppeTT GiuseppeTT changed the title In place restart sidecar Add sidecar mode to the in-place restart agent Jan 28, 2026
for _, pod := range associatedPods.Items {
// Skip it if the pod is failed
// Failed Pods might persist while their new copy already exists
if pod.Status.Phase == corev1.PodFailed {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this a bug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I am not sure how I didn't catch this in earlier tests 😅 .

Example of how the bug might manifest

Starting case. Workload has only 2 Pods.

  • pod-0 : Running
  • pod-1 : Running

pod-0 fails.

  • pod-0 : Failed
  • pod-1 : Running

Job controller creates the replacement pod-0-new because backoffLimit > 0.

  • pod-0 : Failed
  • pod-0-new : Waiting at the barrier
  • pod-1 : Running

pod-0 is fully terminated but the object might persist for a long time. The JobSet controller should ignore pod-0 and consider only pod-0-new and pod-1 for in-place restart. This change makes sure this is true.

@GiuseppeTT GiuseppeTT changed the title Add sidecar mode to the in-place restart agent Add sidecar container mode to the in-place restart agent Feb 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants