Skip to content

feat: implement agent_sandboxes point-in-time metrics#410

Open
chw120 wants to merge 9 commits intokubernetes-sigs:mainfrom
chw120:agent-sandboxes-point-in-time-metrics-12557458534412660193
Open

feat: implement agent_sandboxes point-in-time metrics#410
chw120 wants to merge 9 commits intokubernetes-sigs:mainfrom
chw120:agent-sandboxes-point-in-time-metrics-12557458534412660193

Conversation

@chw120
Copy link
Contributor

@chw120 chw120 commented Mar 13, 2026

Overview

This PR implements a new Prometheus metric, agent_sandboxes, to monitor the point-in-time state of sandboxes within the cluster. Stated in #245

Key Changes

  • Custom Prometheus Collector: Introduced SandboxCollector in controllers/sandbox_controller.go. This collector dynamically fetches and counts sandboxes during a metrics scrape, which avoids the overhead of updating a GaugeVec during every reconcile loop.
  • Rich Labeling: The agent_sandboxes gauge includes labels for namespace, ready_condition, expired, launch_type, and sandbox_template.
    • namespace: the namespace of the sandbox
    • ready_condition: "true" or "false"
    • expired: "true" or "false"
    • launch_type: "warm" or "cold"
    • sandbox_template: The name of the template used to create the sandbox.
  • Annotation Tracking: Extracted constants for sandbox annotations, specifically SandboxPodNameAnnotation and SandboxTemplateRefAnnotation.
  • Controller Integration: Updated createSandbox in the SandboxClaimReconciler to explicitly inject the template's name into the Sandbox object's annotations at creation time, ensuring accurate metric reporting.

Testing

  • Added table-driven tests in controllers/sandbox_controller_test.go to verify the correct formatting and grouping of metrics across various label combinations.
  • Also tested in the local kind cluster.

- Added , a custom Prometheus  inside  to fetch and count sandboxes dynamically during a metrics scrape. This avoids the heavy  overhead that would be caused by updating a GaugeVec on every Reconcile loop.
- The metrics gauge exposes labels: , , , and  to accurately track Sandboxes states over time.
- Extracted constants for sandbox annotations (, ).
- Updated  to explicitly inject the template's name via the newly created  into the Sandbox object at creation time.
- Added exhaustive table-driven tests to verify correct formatting and grouping of metrics by label combinations.
@netlify
Copy link

netlify bot commented Mar 13, 2026

Deploy Preview for agent-sandbox canceled.

Name Link
🔨 Latest commit f159f11
🔍 Latest deploy log https://app.netlify.com/projects/agent-sandbox/deploys/69bddfbc960c870008332b59

@k8s-ci-robot k8s-ci-robot requested review from igooch and janetkuo March 13, 2026 20:08
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 13, 2026
@k8s-ci-robot
Copy link
Contributor

Hi @chw120. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 13, 2026
@igooch
Copy link
Contributor

igooch commented Mar 13, 2026

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 13, 2026
Copy link
Contributor

@igooch igooch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code itself looks good.

Main recommendation is to refactor the code out of the sandbox_controller.go.

@chw120 chw120 force-pushed the agent-sandboxes-point-in-time-metrics-12557458534412660193 branch from c0675c1 to 75153eb Compare March 16, 2026 23:57
Copy link
Contributor

@igooch igooch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 19, 2026
@igooch
Copy link
Contributor

igooch commented Mar 19, 2026

@janetkuo PR looks good to me. Trade offs on the custom collector vs a gauge seem reasonable.

Could you please review?

@yongruilin
Copy link
Contributor

/retest

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 19, 2026
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 19, 2026
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 19, 2026
@chw120
Copy link
Contributor Author

chw120 commented Mar 20, 2026

/retest

ch <- c.agentSandboxesDesc
}

// Collect fetches sandboxes, calculates labels, and sends metrics to the channel.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be used to DDOS the reconciler ?
Like have a scraper collect the metric in a tight loop an cause the controller to run out of mem ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a TODO in sandbox_collector.go to acknowledge the O(N) list concern and the potential for DDoS/OOM at scale, and should be replaced with a better implementation.

Copy link
Member

@janetkuo janetkuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

Approving only the annotation addition to _types.go
Please address my comment before merging, and avoid adding other API changes

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aditya-shantanu, chw120, igooch, janetkuo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 20, 2026
@barney-s
Copy link
Contributor

My biggest concern is having a List call in a metrics Collect method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants