fix: verify pod ownership before operating on annotated pods#419
fix: verify pod ownership before operating on annotated pods#419ArmandoHerra wants to merge 8 commits intokubernetes-sigs:mainfrom
Conversation
✅ Deploy Preview for agent-sandbox canceled.
|
|
Welcome @ArmandoHerra! |
|
Hi @ArmandoHerra. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/ok-to-test |
controllers/sandbox_controller.go
Outdated
| if pod.Labels == nil { | ||
| pod.Labels = make(map[string]string) | ||
| } | ||
| pod.Labels[sandboxLabel] = nameHash |
There was a problem hiding this comment.
nit: consider applying the label after the switch statement
There was a problem hiding this comment.
Comment addressed, please re-review
| "Owner.Kind", controllerRef.Kind, "Owner.Name", controllerRef.Name) | ||
|
|
||
| if _, exists := sandbox.Annotations[SandboxPodNameAnnotation]; exists { | ||
| patch := client.MergeFrom(sandbox.DeepCopy()) |
There was a problem hiding this comment.
Please consider adding an informational log here, e.g., log.Info("Removing pod name annotation from sandbox", "Sandbox.Name", sandbox.Name). This would keep the logging behavior consistent with how the annotation is removed in the replicas=0 path above.
There was a problem hiding this comment.
Comment addressed, please re-review
| wantSandboxAnnotations: map[string]string{"other-annotation": "keep-me"}, | ||
| }, | ||
| { | ||
| name: "refuses to adopt annotated pod owned by a different controller", |
There was a problem hiding this comment.
Since this test asserts the behavior of refusing to adopt an annotated pod owned by a different controller, we should also verify that the malicious SandboxPodNameAnnotation was successfully stripped.
Do you mind please adding a wantSandboxAnnotations: map[string]string{} to this test case to enforce that the controller correctly cleans up the annotation ?
There was a problem hiding this comment.
Comment addressed, please re-review
…ntefb - Deduplicate label init and assignment by moving after switch block - Add log.Info before annotation removal in podOwnedByOther for consistency - Assert wantSandboxAnnotations in "refuses to adopt" test case - Normalize nil/empty annotation map comparison in test harness
|
Comments were addressed, @vicentefb. Please re-review and let me know if there is anything else to fix. Thanks! |
controllers/sandbox_controller.go
Outdated
|
|
||
| // checkPodOwnership determines whether a Pod is owned by the given Sandbox, | ||
| // has no controller, or is owned by a different controller. | ||
| func checkPodOwnership(pod *corev1.Pod, sandbox *sandboxv1alpha1.Sandbox) podOwnership { |
There was a problem hiding this comment.
Thinking more about this method...
In handleSandboxExpiry(), the controller issues a Delete request for a pod and service named sandbox.Name without verifying ownership. Someone could name their sandbox after a "victim" pod and set ShutdownTime to trigger an immediate unauthorized deletion.
WDYT of changing the signature to func checkOwnership(obj client.Object, sandbox *sandboxv1alpha1.Sandbox) resourceOwnership so you can reuse this verification logic in handleSandboxExpiry() to safely check ownership of both the pod and service before issuing any Delete requests.
There was a problem hiding this comment.
Great catch, @vicentefb. I agree completely, and I think this is the right call.
The current handleSandboxExpiry() blindly deletes by name, which means someone could create a Sandbox named after a victim pod, set an expired ShutdownTime, and trigger an unauthorized deletion on the next reconcile. That's a real possible escalation vector in multi-tenant namespaces.
Here's what I'm planning:
- Generalize the ownership check
Rename podOwnership → resourceOwnership and change the signature to:
func checkOwnership(obj client.Object, sandbox *sandboxv1alpha1.Sandbox) (resourceOwnership, *metav1.OwnerReference)
Using client.Object instead of *corev1.Pod lets us reuse this for both Pods and Services. Returning the *metav1.OwnerReference as the second value also addresses your optimization comments we can use the returned controllerRef directly in the podOwnedByOther log messages instead of calling metav1.GetControllerOf() again (covers both the replicas=0 path and the adoption path).
- Harden
handleSandboxExpiry()
Before each r.Delete(), do an r.Get() on the live object, run checkOwnership(), and only proceed if resourceOwnedBySandbox. For unowned or foreign-owned resources, log a warning and skip the deletion.
Same pattern for the Service deletion.
- Tests
I'll add expiry-specific test cases covering the attack scenario (foreign-owned pod survives expiry), unowned resources, and the happy path. The existing expiry tests will need OwnerReferences added to their initialObjs so they still pass under the stricter ownership check.
I will push a follow-up commit shortly.
controllers/sandbox_controller.go
Outdated
| log.Info("Refusing to delete pod: pod has no controllerRef pointing to this sandbox", | ||
| "Pod.Name", pod.Name, "Sandbox.Name", sandbox.Name) | ||
| case podOwnedByOther: | ||
| controllerRef := metav1.GetControllerOf(pod) |
There was a problem hiding this comment.
optimization: metav1.GetControllerOf(pod) is re-fetched here after checkPodOwnership() already extracted it. You can update checkPodOwnership() to return (podOwnership, *metav1.OwnerReference) to supply the controllerRef directly, avoiding redundant lookups.
controllers/sandbox_controller.go
Outdated
| ownership := checkPodOwnership(pod, sandbox) | ||
| switch ownership { | ||
| case podOwnedByOther: | ||
| controllerRef := metav1.GetControllerOf(pod) |
There was a problem hiding this comment.
optimization: Similar to the comment above, metav1.GetControllerOf(pod) is invoked again. Returning the controllerRef directly from the checkPodOwnership() helper would streamline these paths.
vicentefb
left a comment
There was a problem hiding this comment.
Thanks @ArmandoHerra i just left a couple of optimization/perf comments, happy to lgtm afterwards
|
Comments addressed @vicentefb, let me know if any other concerns or optimizations come to your mind. Thanks! |
controllers/sandbox_controller.go
Outdated
| }, | ||
| // Delete pod only if owned by this sandbox | ||
| pod := &corev1.Pod{} | ||
| if err := r.Get(ctx, types.NamespacedName{Name: sandbox.Name, Namespace: sandbox.Namespace}, pod); err != nil { |
There was a problem hiding this comment.
the pod lookup strictly uses sandbox.Name here, however, if the sandbox had successfully adopted a warm pool pod, its name would be tracked via the agents.x-k8s.io/pod-name annotation. Because this annotation is not evaluated here, the adopted pod will not be correctly deleted on expiry when ShutdownPolicy is Retain (leading to a compute resource leak).
Try extracting the pod name resolution logic into a shared helper function (e.g. resolvePodName(sandbox) string) and use it here
|
Hey @vicentefb — I just pushed the changes addressing your latest round of feedback. Here's a summary of everything in the last two commits.
You caught that Good catch, that's a silent compute resource leak when I extracted the pod name resolution logic into a shared I added
While working through your feedback, I did a security pass across all the resource reconciliation paths to make sure we weren't leaving any other confused-deputy gaps open. Turns out The deletion path in The attack surface is narrower than the pod case since it doesn't lead to unauthorized deletion, but it does allow a sandbox to claim a foreign service in its status ( Same CWE-863 class as Issue #265. The fix mirrors
Added All 36 tests pass, build and vet clean. With these two commits, every resource path in the controller ( |
| service.Name, controllerRef.Kind, controllerRef.Name, sandbox.Name) | ||
|
|
||
| case resourceUnowned: | ||
| log.Info("Adopting unowned service", "Service.Name", service.Name, "Sandbox.Name", sandbox.Name) |
There was a problem hiding this comment.
Not sure if i'm overthinking this but when the service adoption happens we aren't verifying the selector or ports... should we enforce the intended spec so that it doesn't lead to "hijacks" to the sandbox's traffic ?
controllers/sandbox_controller.go
Outdated
| case resourceOwnedByOther: | ||
| log.Info("Refusing to delete pod: pod is owned by a different controller", | ||
| "Pod.Name", pod.Name, "Sandbox.Name", sandbox.Name, | ||
| "Owner.Kind", controllerRef.Kind, "Owner.Name", controllerRef.Name) |
There was a problem hiding this comment.
should we log controllerRef.UID here as well ?
|
/assign @janetkuo |
|
Hey @vicentefb I pushed two more commits addressing the remaining feedback and a proactive security hardeningg.
You raised a good point about the service adoption path not verifying the selector or ports. I agree, adopting a service as-is without enforcing the intended spec could lead to traffic being routed somewhere unexpected. The
Also applied your suggestion to log
While performing another security pass across the controller to ensure all resource paths had consistent ownership checks, I found that The naming scheme is deterministic, so an attacker with namespace access could pre-create a PVC with the right name before the sandbox controller runs. The sandbox pod would then mount attacker-controlled storage, opening the door for data injection (pre-populate the volume with malicious configs or binaries the workload trusts) or data exfiltration (read back whatever the sandbox writes). Same confused deputy class as the original Issue #265, just applied to PVCs instead of Pods. The fix follows the exact same
Added |
|
@ArmandoHerra: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
The Sandbox controller's reconcilePod() used the user-controlled agents.x-k8s.io/pod-name annotation as the sole source of truth for pod operations. This allowed privilege escalation where a malicious Sandbox could delete or adopt any pod in the namespace. Add checkPodOwnership() helper that validates the pod's controllerRef UID against the Sandbox UID before any pod operation. Pods owned by different controller are rejected. Pods with no controllerRef are only adopted (not deleted). Only pods with a controllerRef UID matching the requesting Sandbox can be deleted.
Generated during e2e test runs via `make test-e2e` and Kind cluster deployments. These directories contain test logs, pod dumps, and Python virtual environments that should not be tracked.
…ntefb - Deduplicate label init and assignment by moving after switch block - Add log.Info before annotation removal in podOwnedByOther for consistency - Assert wantSandboxAnnotations in "refuses to adopt" test case - Normalize nil/empty annotation map comparison in test harness
- Rename podOwnership → resourceOwnership and generalize checkPodOwnership() → checkOwnership(client.Object) returning (resourceOwnership, *metav1.OwnerReference) to eliminate redundant GetControllerOf() calls in reconcilePod() - Harden handleSandboxExpiry() to Get + checkOwnership before every Delete, preventing unauthorized deletion of pods/services not owned by the sandbox - Add TestCheckOwnership covering Pod and Service in all ownership states, 3 new expiry attack-prevention test cases, and wantSurvivingObjs field in TestReconcile
- Add resolvePodName() that resolves pod name from the agents.x-k8s.io/pod-name annotation, falling back to sandbox.Name - Fix handleSandboxExpiry() to use resolvePodName() instead of hard-coded sandbox.Name, preventing adopted warm pool pods from leaking when ShutdownPolicy is Retain - Refactor reconcilePod() to use the shared helper - Add TestResolvePodName and warm pool expiry integration test
- Add checkOwnership() call in reconcileService() when an existing service is found, mirroring the pattern in reconcilePod() - Reject services owned by a different controller with logged warning - Adopt unowned services via SetControllerReference + Update - Add TestReconcileService with 4 table-driven test cases
…logs - Validate ClusterIP is headless before adopting unowned services, refuse with error if immutable ClusterIP mismatches - Enforce sandbox label and selector on adopted services to prevent traffic hijack - Add Owner.UID to all resourceOwnedByOther log.Info and fmt.Errorf messages across reconcilePod, reconcileService, and handleSandboxExpiry for forensic debuggability
- Add checkOwnership() call in reconcilePVCs() when an existing PVC is found, closing the last confused deputy gap - Reject PVCs owned by a different controller with Owner.UID logging - Adopt unowned PVCs via SetControllerReference + Update - Add TestReconcilePVCs with 4 table-driven test cases - All three resource types (Pod, Service, PVC) now have consistent ownership verification
3d2418c to
985a58d
Compare
|
New changes are detected. LGTM label has been removed. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ArmandoHerra, vicentefb The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@vicentefb Rebased onto current main to pick up the 30 commits that landed since this branch was created, notably PR #395 (warm pool refactor) and PR #438 (kubeapilinter fix). |
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
codebot-robot
left a comment
There was a problem hiding this comment.
Overall, the PR looks excellent. It systematically addresses the vulnerability by introducing centralized ownership verification (checkOwnership), ensuring that pods, services, and PVCs are only deleted or adopted when appropriate. The modifications to handleSandboxExpiry perfectly prevent unauthorized deletions when an expired Sandbox points to a foreign Pod.
I've left several comments focusing primarily on observability (logging), minor edge cases (like trailing spaces, handling of empty uids, and mismatched service/PVC specs during adoption), and potential test suite enhancements. The core logic is sound and heavily fortified.
(This review was generated by Overseer)
|
|
||
| const ( | ||
| // resourceOwnedBySandbox indicates the resource's controllerRef points to this Sandbox. | ||
| resourceOwnedBySandbox resourceOwnership = iota |
There was a problem hiding this comment.
Consider initializing the first enum value (e.g., resourceOwnershipUnknown = iota) or start resourceOwnedBySandbox at 1 to prevent an uninitialized variable from defaulting to a valid state.
| if controllerRef == nil { | ||
| return resourceUnowned, nil | ||
| } | ||
| if controllerRef.UID == sandbox.UID { |
There was a problem hiding this comment.
For robustness, consider adding a check to ensure sandbox.UID != "" to prevent mistakenly matching an empty controllerRef.UID in edge cases where the sandbox is malformed.
| // If the sandbox has adopted a warm pool pod, the pod name is tracked in the | ||
| // agents.x-k8s.io/pod-name annotation and may differ from sandbox.Name. | ||
| func resolvePodName(sandbox *sandboxv1alpha1.Sandbox) string { | ||
| if name, ok := sandbox.Annotations[SandboxPodNameAnnotation]; ok && name != "" { |
There was a problem hiding this comment.
Using strings.TrimSpace(name) might be helpful to safeguard against trailing whitespace if the annotation was set manually.
|
|
||
| case resourceUnowned: | ||
| // ClusterIP is immutable — refuse adoption if the service is not headless. | ||
| if service.Spec.ClusterIP != corev1.ClusterIPNone && service.Spec.ClusterIP != "" { |
There was a problem hiding this comment.
This is a great defensive check. An existing unowned Service might have had its IP assigned by the API server if it wasn't explicitly set to None upon creation.
| log.Info("Adopting unowned service", "Service.Name", service.Name, "Sandbox.Name", sandbox.Name) | ||
|
|
||
| // Enforce intended labels and selector to prevent traffic hijack. | ||
| if service.Labels == nil { |
There was a problem hiding this comment.
If the unowned Service was created maliciously or accidentally by a user, it might contain unexpected labels or annotations. Consider explicitly clearing them or creating a fresh map containing only the sandboxLabel.
| allErrors = errors.Join(allErrors, fmt.Errorf("failed to delete pod: %w", err)) | ||
| } | ||
| case resourceUnowned: | ||
| log.Info("Skipping pod deletion during expiry: pod has no controllerRef pointing to this sandbox", |
There was a problem hiding this comment.
Skipping pod deletion here safely neutralizes the unauthorized deletion vector. As a defense-in-depth measure, consider also stripping the malicious pod annotation here so the Sandbox object state is fully sanitized before it enters the Expired phase.
| // Check if there's an annotation with a non-empty value | ||
| if annotatedPod, exists := tc.sandbox.Annotations[SandboxPodNameAnnotation]; exists && annotatedPod != "" { | ||
| podName = annotatedPod | ||
| if tc.wantPodSurvives != "" { |
There was a problem hiding this comment.
This is a great, robust assertion. To be completely thorough, you could also assert that livePod.GetOwnerReferences() matches its initial state, proving the Sandbox controller didn't accidentally adopt it.
| wantStatusServiceFQDN: sandboxName + "." + sandboxNs + ".svc.cluster.local", | ||
| }, | ||
| { | ||
| name: "uses existing service owned by this sandbox", |
There was a problem hiding this comment.
It would be ideal to provide wantService for this test case. Even though the controller just returns the existing service, asserting its full state ensures that properties like labels and selector haven't been inadvertently clobbered.
| } | ||
| } | ||
|
|
||
| func TestCheckOwnership(t *testing.T) { |
There was a problem hiding this comment.
The test coverage is solid. You might want to include an edge case where the sandbox.UID is "" or controllerRef.UID is "" to ensure checkOwnership handles malformed inputs gracefully.
| } | ||
| } | ||
|
|
||
| func TestReconcilePVCs(t *testing.T) { |
There was a problem hiding this comment.
This test does a fantastic job covering all PVC ownership paths. For maximum coverage, consider adding a test case containing multiple VolumeClaimTemplates, where one PVC is already owned and the next is unowned, validating that the loop properly evaluates each PVC.
Summary
Fixes #265: Privilege escalation via
agents.x-k8s.io/pod-nameannotation.checkPodOwnership()helper that validates the pod's controllerRef UID against the Sandbox UID before any pod operationFollows maintainer guidance from @janetkuo: the annotation cannot be the sole source of truth; controllerRef must always be checked.
Test Plan
make buildsucceedsmake lint-gopassesmake test-unitpasses (12/12 TestReconcilePod subtests)