Skip to content

Implements agent_sandbox_warmpool_size gauge metrics#383

Open
Oneimu wants to merge 2 commits intokubernetes-sigs:mainfrom
Oneimu:warmpool-metrics
Open

Implements agent_sandbox_warmpool_size gauge metrics#383
Oneimu wants to merge 2 commits intokubernetes-sigs:mainfrom
Oneimu:warmpool-metrics

Conversation

@Oneimu
Copy link
Contributor

@Oneimu Oneimu commented Mar 10, 2026

Implements agent_sandbox_warmpool_size metric to monitor the status of warm pools in the agent-sandbox, and track the available (ready) warmpool.

changes includes:

  • internal/metrics/metrics.go: Added WarmPoolSize gauge metric, UpdateWarmPoolMetrics, and DeleteWarmPoolMetrics functions.
  • internal/metrics/metrics_test.go: Added unit tests for the new metrics and registration.
  • extensions/controllers/sandboxwarmpool_controller.go:
    • Integrated metrics cleanup using a finalizer (metrics.agents.x-k8s.io/cleanup).
    • Updated reconcilePool to call UpdateWarmPoolMetrics with the final set of active pods.
    • Refactored createPoolPod to return the created pod for accurate metric tracking.
  • extensions/controllers/sandboxwarmpool_controller_test.go: Added TestReconcilePoolMetrics to verify metric updates during reconciliation.

@k8s-ci-robot k8s-ci-robot requested a review from janetkuo March 10, 2026 00:58
@netlify
Copy link

netlify bot commented Mar 10, 2026

Deploy Preview for agent-sandbox ready!

Name Link
🔨 Latest commit 4e8ac65
🔍 Latest deploy log https://app.netlify.com/projects/agent-sandbox/deploys/69b09ee26915e500088506ac
😎 Deploy Preview https://deploy-preview-383--agent-sandbox.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot requested a review from justinsb March 10, 2026 00:58
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 10, 2026
@k8s-ci-robot
Copy link
Contributor

Hi @Oneimu. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 10, 2026
@Oneimu
Copy link
Contributor Author

Oneimu commented Mar 10, 2026

/assign @igooch

@aditya-shantanu
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 10, 2026
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Oneimu
Once this PR has been reviewed and has the lgtm label, please ask for approval from igooch. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Name: "agent_sandbox_warmpool_size",
Help: "Monitor the point-in-time status of the warmpool. Purpose is to be able to alert on WarmPool exhaustion.",
},
[]string{"pod_status", "warmpool_name", "sandbox_template"},

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the cardinality of this metric? seems like the cardinality of warmpool_name and sandbox_template for example could be very high?

@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 14, 2026
prometheus.GaugeOpts{
Name: "agent_sandbox_warmpool_size",
Help: "Monitor the point-in-time status of the warmpool. Purpose is to be able to alert on WarmPool exhaustion.",
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CRITICAL: The SandboxWarmPool is a namespaced resource, but this metric doesn't include a namespace label. If two namespaces have a WarmPool with the exact same name, their metrics will clash and randomly overwrite each other.
Please add "namespace" to the label keys here, and update all WithLabelValues and DeleteLabelValues calls to include wp.Namespace.

asmetrics "sigs.k8s.io/agent-sandbox/internal/metrics"
)

const (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a finalizer solely to clean up in-memory Prometheus metrics is an anti-pattern. Finalizers block API object deletion, and if the controller is ever uninstalled or unavailable, the object will be stuck in a terminating state forever.
Instead, you can remove the finalizer completely. To clean up metrics when the object is fully deleted, check for apierrors.IsNotFound(err) at the beginning of Reconcile (after r.Get) and call asmetrics.WarmPoolSize.DeletePartialMatch(...) using req.Name and req.Namespace.


// Handle deletion
if !warmPool.DeletionTimestamp.IsZero() {
if controllerutil.ContainsFinalizer(warmPool, metricsFinalizer) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you drop the finalizer approach as suggested, this entire finalizer removal block can be deleted. If you decide to keep it, DeleteWarmPoolMetrics is currently vulnerable to label leaks if the TemplateRef.Name was changed before deletion.

}

// Add finalizer if it doesn't exist
if !controllerutil.ContainsFinalizer(warmPool, metricsFinalizer) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you end up keeping the finalizer addition (though it is recommended to remove it), it's generally safer to return early (e.g., return ctrl.Result{Requeue: true}, nil) after a successful r.Update. This ensures Reconcile is invoked again with the newly updated ResourceVersion from the cache, avoiding conflicts.


templateName := wp.Spec.TemplateRef.Name
for status, count := range counts {
WarmPoolSize.WithLabelValues(status, wp.Name, templateName).Set(count)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a user updates the TemplateRef.Name of an existing WarmPool, UpdateWarmPoolMetrics will start publishing metrics with the new sandbox_template label, but the old metrics with the previous sandbox_template label will leak indefinitely.
Consider calling WarmPoolSize.DeletePartialMatch(prometheus.Labels{"warmpool_name": wp.Name, "namespace": wp.Namespace}) at the beginning of this function to clear out all stale labels before setting the new metric values.


for _, status := range statuses {
WarmPoolSize.DeleteLabelValues(status, wpName, templateName)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of manually iterating over a hardcoded statuses array, you can use WarmPoolSize.DeletePartialMatch(prometheus.Labels{"warmpool_name": wpName}) (and namespace if added). This is cleaner, atomic, and guarantees all statuses and templates are cleared even if new custom phases are added in the future.

if err != nil {
log.Error(err, "Failed to create pod")
allErrors = errors.Join(allErrors, err)
} else if pod != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because r.Create only populates the pod's metadata and not its apiserver-defaulted Status, the pod object here will have an empty Status.Phase.
UpdateWarmPoolMetrics translates "" to unknown, meaning newly created pods will temporarily spike the unknown bucket for one reconcile loop until they are fetched again. This is functionally fine, but worth being aware of.

activePods = remainingPods
}

// Update metrics at the very end with the final set of active pods
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling UpdateWarmPoolMetrics at the very end works well because the current logic doesn't return early on errors (it joins them into allErrors). If early returns are ever introduced in this function, wrapping this in a defer block might be necessary to ensure metrics are always accurate.

ctx := context.Background()
err := r.reconcilePool(ctx, warmPool)
require.NoError(t, err)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To ensure consistency if the constant cases ever change, consider using strings.ToLower(string(corev1.PodPending)) instead of hardcoding the "pending" string here.


if poolName == "test-pool" && template == "test-template" {
if status == PodStatusReady {
assert.Equal(t, 1.0, metric.GetGauge().GetValue())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future tests, consider using the testutil package (github.com/prometheus/client_golang/prometheus/testutil) to compare metrics directly against expected outputs. It significantly reduces boilerplate around collecting and parsing metric channels.

@k8s-ci-robot
Copy link
Contributor

@Oneimu: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
presubmit-agent-sandbox-lint-api 4e8ac65 link true /test presubmit-agent-sandbox-lint-api

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants