Skip to content

feat: Implement SandboxWarmPool recreate on template updates#347

Open
shrutiyam-glitch wants to merge 13 commits intokubernetes-sigs:mainfrom
shrutiyam-glitch:swp-rollout
Open

feat: Implement SandboxWarmPool recreate on template updates#347
shrutiyam-glitch wants to merge 13 commits intokubernetes-sigs:mainfrom
shrutiyam-glitch:swp-rollout

Conversation

@shrutiyam-glitch
Copy link
Contributor

@shrutiyam-glitch shrutiyam-glitch commented Feb 26, 2026

Fixes #323

This PR implements the rollout logic for SandboxWarmPool when its associated SandboxTemplate is updated.

This change adds two update strategies—Recreate and OnReplenish—allowing users to control how stale pods are handled.

Changes:

1. Update Strategy Definition
Added a new UpdateStrategy field to the SandboxWarmPool specification.

  • Recreate (Default): Ensures the pool contains only fresh sandboxes by immediately deleting stale sandboxes when the template is updated.
  • OnReplenish: Retains existing sandboxes even if they are stale; they are only replaced after being manually deleted or claimed from the pool. This is applicable for any changes in the associated SandboxTemplate as well as the change of name of sandboxTemplateRef in SandboxWarmpool.

2. Template Versioning and Tracking

  • Template Hashing: Implemented a hashing mechanism for name and spec.podTemplate of the SandboxTemplate content to detect changes accurately.
  • Semantic Equality Check: If the template hashes of the current template and the label value in the pod are not matching, then the Semantic equality between the spec.podTemplate.spec of the current template and spec.podTemplate.spec of the current sandbox are compared. This comparison boolean value is stored with the hash as the key, to avoid this comparison for every sandbox in the warmpool.
  • Pod Labeling: Every pod created by the warm pool is now labeled with a hash of the specific template version used during its creation.

3. Controller Logic Updates

  • Template Watches: The SandboxWarmPool controller now watches SandboxTemplate resources. It triggers a reconciliation for any SandboxWarmPool that references a modified template. It uses an EnqueueRequestsFromMapFunc to identify and reconcile all warmpools referencing a modified template.
  • Staleness Detection: During reconciliation, the controller identifies "stale" pods by comparing their template hash label against the current template's hash.
  • Automated Rollout: If the strategy is set to Recreate, the controller deletes identified stale pods, allowing the standard reconciliation loop to replace them with pods based on the new template.

Testing Performed:

  • Verified that updating a SandboxTemplate image triggers the deletion of idle sandboxes in the associated SandboxWarmPool.
  • Verified that sandboxes already bound to a SandboxClaim are NOT deleted during a template update.
  • Verified that changing the templateRef (with new spec) on the SandboxWarmPool itself triggers a full pool rotation.
  • Verified that changing the templateRef (with old spec) on the SandboxWarmPool itself triggers a full pool rotation.

Unit tests added:

  • TestReconcilePool_TemplateUpdateRollout
  • TestFindWarmPoolsForTemplate

Steps followed:

  • Created SandboxTemplate, SandboxWarmpool with default Recreate strategy and SandboxClaim
$ kubectl get sandbox,sandboxclaim,sandboxtemplate -n sandbox-test
NAME                                                AGE
sandbox.agents.x-k8s.io/python-sdk-warmpool-kvjrz   3m
sandbox.agents.x-k8s.io/python-sdk-warmpool-wltr6   3m11s
sandbox.agents.x-k8s.io/python-sdk-warmpool-xhxwg   3m11s

NAME                                                          AGE
sandboxclaim.extensions.agents.x-k8s.io/sandbox-claim-new-1   3m2s

NAME                                                                 AGE
sandboxtemplate.extensions.agents.x-k8s.io/pct                       111m
sandboxtemplate.extensions.agents.x-k8s.io/python-counter-template   23h

Sandbox python-sdk-warmpool-wltr6 is adopted by the sandbox-claim.

  • Updated the spec of python-counter-template and applied.
$ kubectl get sandbox -n sandbox-test
NAME                        AGE
python-sdk-warmpool-wk89c   22s
python-sdk-warmpool-wltr6   4m24s
python-sdk-warmpool-zt7mk   21s

The unclaimed warmpool sandboxes are recreated with the new updated template spec.

  • Updated the spec.sandboxTemplateRef in the SandboxWarmPool manifest to use the sandbox template pct. All the unclaimed sandboxes are by default recreated.
$ kubectl get sandbox -n sandbox-test
NAME                        AGE
python-sdk-warmpool-2pdfp   5s
python-sdk-warmpool-5cqbv   4s
python-sdk-warmpool-wltr6   5m59s

@netlify
Copy link

netlify bot commented Feb 26, 2026

Deploy Preview for agent-sandbox canceled.

Name Link
🔨 Latest commit 1af3e86
🔍 Latest deploy log https://app.netlify.com/projects/agent-sandbox/deploys/69bda4a74b99310008cfe0e5

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 26, 2026
@k8s-ci-robot
Copy link
Contributor

Hi @shrutiyam-glitch. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 26, 2026
@shrutiyam-glitch shrutiyam-glitch changed the title feat: Implement SandboxWarmPool rollout on template updates feat: Implement SandboxWarmPool recreate on template updates Feb 26, 2026
Copy link

@dhenkel92 dhenkel92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this feature. It's painful to roll out changes to a warm pool right now 🙂

@janetkuo janetkuo added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 27, 2026
@shrutiyam-glitch
Copy link
Contributor Author

/retest

Copy link
Contributor

@igooch igooch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall good work automating the warmpool pod recreation via the new spec hash label.

A few minor suggestions for performance and edge cases in the inline comments.

@lyj7890
Copy link

lyj7890 commented Mar 4, 2026

This feature would be really helpful for our use case! Looking forward to the merge. 👍

@janetkuo janetkuo self-assigned this Mar 9, 2026
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: shrutiyam-glitch
Once this PR has been reviewed and has the lgtm label, please ask for approval from janetkuo. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 10, 2026
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Mar 10, 2026
@shrutiyam-glitch shrutiyam-glitch requested a review from igooch March 10, 2026 17:53
Copy link
Contributor

@igooch igooch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work! A few comments, but the bulk of the logic looks good.

It would be good to have an e2e test for this, although that does not need to be a part of this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the contents of this file are only used within the Warmpool controller I'd recommend moving them back into that file.


// SandboxWarmPool update strategies
RecreateStrategy = "Recreate"
OnDeleteStrategy = "OnDelete"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend putting these in the api/v1alpha1 type definition, with similar naming pattern as StatefulSet, so that API consumers can use the values.

https://github.com/kubernetes/kubernetes/blob/802b3f744b2378b8dc4e880e326b8ad620ffc2fe/pkg/apis/apps/types.go#L77-L92

}

if isOrphan {
// Pod has no controller - adopt it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only happens if the strategy is OnDelete, correct? Could you update the comment to reflect this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if the strategy is Recreate and if the orphaned pod is not stale, the controller will go ahead and adopt the pod.

}
}

func TestReconcilePool_TemplateUpdateRollout(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update these to use the testCases := []struct{} ... for _, tc := range testCases {} pattern here?

}

// reconcilePool ensures the correct number of pods exist in the pool
func (r *SandboxWarmPoolReconciler) reconcilePool(ctx context.Context, warmPool *extensionsv1alpha1.SandboxWarmPool) error {
Copy link
Member

@janetkuo janetkuo Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is getting really long and hard to review/maintain. Break it up into smaller helper functions. See go/small-functions

template, tmplErr := r.getTemplate(ctx, warmPool)
var currentTemplateHash string
if tmplErr == nil {
currentTemplateHash = computeTemplateHash(template)
Copy link
Member

@janetkuo janetkuo Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not safe to directly compute template hash for comparison, because pod schema changes in upstream will affect the value even if the template hasn't changed. This will end up deleting all warmed resources after a cluster upgrade. The value should only be computed once and added as a label, and later get the hash value from labels to compare. The same pattern is used in Deployment with pod-template-hash label.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per discussion, if the Warmpool can create sandboxes directly (#390), we can use sandbox spec to compare directly.

@brandonroyal
Copy link
Contributor

This is a relatively small nit but I'd suggest we reconsider the name OnDelete here as it implies that manual deletion is the event that triggers the new SandboxTemplate version to be used. Instead, we might consider something like OnReplenish which means that the new SandboxTemplate pods are deployed in the pool as they're replenished. That could be from a delete event or from the pod being claimed.

@shrutiyam-glitch
Copy link
Contributor Author

/hold
dependent on #395

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Mar 19, 2026
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 20, 2026
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 20, 2026
@shrutiyam-glitch
Copy link
Contributor Author

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] SandboxWarmPool rollout on template updates

8 participants