feat: implement agent_sandboxes point-in-time metrics by chw120 · Pull Request #410 · kubernetes-sigs/agent-sandbox

chw120 · 2026-03-13T20:08:43Z

Overview

This PR implements a new Prometheus metric, agent_sandboxes, to monitor the point-in-time state of sandboxes within the cluster. Stated in #245

Key Changes

Custom Prometheus Collector: Introduced SandboxCollector in controllers/sandbox_controller.go. This collector dynamically fetches and counts sandboxes during a metrics scrape, which avoids the overhead of updating a GaugeVec during every reconcile loop.
Rich Labeling: The agent_sandboxes gauge includes labels for namespace, ready_condition, expired, launch_type, and sandbox_template.
- namespace: the namespace of the sandbox
- ready_condition: "true" or "false"
- expired: "true" or "false"
- launch_type: "warm" or "cold"
- sandbox_template: The name of the template used to create the sandbox.
Annotation Tracking: Extracted constants for sandbox annotations, specifically SandboxPodNameAnnotation and SandboxTemplateRefAnnotation.
Controller Integration: Updated createSandbox in the SandboxClaimReconciler to explicitly inject the template's name into the Sandbox object's annotations at creation time, ensuring accurate metric reporting.

Testing

Added table-driven tests in controllers/sandbox_controller_test.go to verify the correct formatting and grouping of metrics across various label combinations.
Also tested in the local kind cluster.

- Added , a custom Prometheus inside to fetch and count sandboxes dynamically during a metrics scrape. This avoids the heavy overhead that would be caused by updating a GaugeVec on every Reconcile loop. - The metrics gauge exposes labels: , , , and to accurately track Sandboxes states over time. - Extracted constants for sandbox annotations (, ). - Updated to explicitly inject the template's name via the newly created into the Sandbox object at creation time. - Added exhaustive table-driven tests to verify correct formatting and grouping of metrics by label combinations.

netlify · 2026-03-13T20:08:49Z

✅ Deploy Preview for agent-sandbox canceled.

Name	Link
🔨 Latest commit	`f159f11`
🔍 Latest deploy log	https://app.netlify.com/projects/agent-sandbox/deploys/69bddfbc960c870008332b59

k8s-ci-robot · 2026-03-13T20:08:53Z

Hi @chw120. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

igooch · 2026-03-13T20:51:28Z

/ok-to-test

igooch

The code itself looks good.

Main recommendation is to refactor the code out of the sandbox_controller.go.

controllers/sandbox_controller.go

internal/metrics/metrics.go

controllers/sandbox_controller.go

internal/metrics/metrics.go

extensions/controllers/sandboxclaim_controller.go

controllers/sandbox_controller_test.go

controllers/sandbox_controller.go

…e-metrics-12557458534412660193

igooch

/lgtm

igooch · 2026-03-19T04:23:37Z

@janetkuo PR looks good to me. Trade offs on the custom collector vs a gauge seem reasonable.

Could you please review?

yongruilin · 2026-03-19T20:31:03Z

/retest

…58534412660193

k8s-ci-robot · 2026-03-19T21:05:07Z

New changes are detected. LGTM label has been removed.

…e-metrics-12557458534412660193

chw120 · 2026-03-20T07:18:54Z

/retest

api/v1alpha1/annotations.go

barney-s · 2026-03-20T18:16:23Z

internal/metrics/sandbox_collector.go

+	ch <- c.agentSandboxesDesc
+}
+
+// Collect fetches sandboxes, calculates labels, and sends metrics to the channel.


Can this be used to DDOS the reconciler ?
Like have a scraper collect the metric in a tight loop an cause the controller to run out of mem ?

Added a TODO in sandbox_collector.go to acknowledge the O(N) list concern and the potential for DDoS/OOM at scale, and should be replaced with a better implementation.

janetkuo

/approve

Approving only the annotation addition to _types.go
Please address my comment before merging, and avoid adding other API changes

k8s-ci-robot · 2026-03-20T18:17:58Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aditya-shantanu, chw120, igooch, janetkuo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [janetkuo]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

barney-s · 2026-03-20T18:42:41Z

My biggest concern is having a List call in a metrics Collect method.

k8s-ci-robot requested review from igooch and janetkuo March 13, 2026 20:08

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 13, 2026

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 13, 2026

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 13, 2026

igooch reviewed Mar 16, 2026

View reviewed changes

controllers/sandbox_controller.go Outdated Show resolved Hide resolved

controllers/sandbox_controller.go Outdated Show resolved Hide resolved

internal/metrics/metrics.go Show resolved Hide resolved

controllers/sandbox_controller.go Outdated Show resolved Hide resolved

aditya-shantanu reviewed Mar 16, 2026

View reviewed changes

address reviewer feedback: refactor metrics and fix dependency coupling

75153eb

chw120 force-pushed the agent-sandboxes-point-in-time-metrics-12557458534412660193 branch from c0675c1 to 75153eb Compare March 16, 2026 23:57

aditya-shantanu approved these changes Mar 17, 2026

View reviewed changes

Merge branch 'kubernetes-sigs:main' into agent-sandboxes-point-in-tim…

e258221

…e-metrics-12557458534412660193

igooch approved these changes Mar 19, 2026

View reviewed changes

k8s-ci-robot assigned igooch Mar 19, 2026

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 19, 2026

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 19, 2026

Merge branch 'main' into agent-sandboxes-point-in-time-metrics-125574…

0f3e5f0

…58534412660193

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 19, 2026

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 19, 2026

chw120 added 3 commits March 19, 2026 23:57

Merge branch 'kubernetes-sigs:main' into agent-sandboxes-point-in-tim…

ed3fe25

…e-metrics-12557458534412660193

address reviewer feedback: refactor metrics and fix dependency coupling

8ca3332

Resolve conflicts.

dbc38f0

janetkuo reviewed Mar 20, 2026

View reviewed changes

api/v1alpha1/annotations.go Outdated Show resolved Hide resolved

barney-s reviewed Mar 20, 2026

View reviewed changes

janetkuo reviewed Mar 20, 2026

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 20, 2026

chw120 added 2 commits March 20, 2026 21:59

moved annotations to types.go

a0db936

added a TODO in SandboxCollector to address the List call concern

f159f11

Conversation

chw120 commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Key Changes

Testing

Uh oh!

netlify bot commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for agent-sandbox canceled.

Uh oh!

k8s-ci-robot commented Mar 13, 2026

Uh oh!

igooch commented Mar 13, 2026

Uh oh!

igooch left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

igooch left a comment

Choose a reason for hiding this comment

Uh oh!

igooch commented Mar 19, 2026

Uh oh!

yongruilin commented Mar 19, 2026

Uh oh!

k8s-ci-robot commented Mar 19, 2026

Uh oh!

chw120 commented Mar 20, 2026

Uh oh!

Uh oh!

barney-s Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

chw120 Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

janetkuo left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Mar 20, 2026

Uh oh!

barney-s commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

chw120 commented Mar 13, 2026 •

edited

Loading

netlify bot commented Mar 13, 2026 •

edited

Loading

janetkuo left a comment •

edited

Loading