feat: implement agent_sandboxes point-in-time metrics#410
feat: implement agent_sandboxes point-in-time metrics#410chw120 wants to merge 9 commits intokubernetes-sigs:mainfrom
Conversation
- Added , a custom Prometheus inside to fetch and count sandboxes dynamically during a metrics scrape. This avoids the heavy overhead that would be caused by updating a GaugeVec on every Reconcile loop. - The metrics gauge exposes labels: , , , and to accurately track Sandboxes states over time. - Extracted constants for sandbox annotations (, ). - Updated to explicitly inject the template's name via the newly created into the Sandbox object at creation time. - Added exhaustive table-driven tests to verify correct formatting and grouping of metrics by label combinations.
✅ Deploy Preview for agent-sandbox canceled.
|
|
Hi @chw120. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/ok-to-test |
igooch
left a comment
There was a problem hiding this comment.
The code itself looks good.
Main recommendation is to refactor the code out of the sandbox_controller.go.
c0675c1 to
75153eb
Compare
…e-metrics-12557458534412660193
|
@janetkuo PR looks good to me. Trade offs on the custom collector vs a gauge seem reasonable. Could you please review? |
|
/retest |
|
New changes are detected. LGTM label has been removed. |
|
/retest |
| ch <- c.agentSandboxesDesc | ||
| } | ||
|
|
||
| // Collect fetches sandboxes, calculates labels, and sends metrics to the channel. |
There was a problem hiding this comment.
Can this be used to DDOS the reconciler ?
Like have a scraper collect the metric in a tight loop an cause the controller to run out of mem ?
There was a problem hiding this comment.
Added a TODO in sandbox_collector.go to acknowledge the O(N) list concern and the potential for DDoS/OOM at scale, and should be replaced with a better implementation.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: aditya-shantanu, chw120, igooch, janetkuo The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
My biggest concern is having a List call in a metrics Collect method. |
Overview
This PR implements a new Prometheus metric,
agent_sandboxes, to monitor the point-in-time state of sandboxes within the cluster. Stated in #245Key Changes
SandboxCollectorincontrollers/sandbox_controller.go. This collector dynamically fetches and counts sandboxes during a metrics scrape, which avoids the overhead of updating aGaugeVecduring every reconcile loop.agent_sandboxesgauge includes labels fornamespace,ready_condition,expired,launch_type, andsandbox_template.namespace: the namespace of the sandboxready_condition: "true" or "false"expired: "true" or "false"launch_type: "warm" or "cold"sandbox_template: The name of the template used to create the sandbox.SandboxPodNameAnnotationandSandboxTemplateRefAnnotation.createSandboxin theSandboxClaimReconcilerto explicitly inject the template's name into theSandboxobject's annotations at creation time, ensuring accurate metric reporting.Testing
controllers/sandbox_controller_test.goto verify the correct formatting and grouping of metrics across various label combinations.