Implements agent_sandbox_warmpool_size gauge metrics#383
Implements agent_sandbox_warmpool_size gauge metrics#383Oneimu wants to merge 2 commits intokubernetes-sigs:mainfrom
Conversation
✅ Deploy Preview for agent-sandbox ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
Hi @Oneimu. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/assign @igooch |
|
/ok-to-test |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Oneimu The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
| Name: "agent_sandbox_warmpool_size", | ||
| Help: "Monitor the point-in-time status of the warmpool. Purpose is to be able to alert on WarmPool exhaustion.", | ||
| }, | ||
| []string{"pod_status", "warmpool_name", "sandbox_template"}, |
There was a problem hiding this comment.
what is the cardinality of this metric? seems like the cardinality of warmpool_name and sandbox_template for example could be very high?
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
| prometheus.GaugeOpts{ | ||
| Name: "agent_sandbox_warmpool_size", | ||
| Help: "Monitor the point-in-time status of the warmpool. Purpose is to be able to alert on WarmPool exhaustion.", | ||
| }, |
There was a problem hiding this comment.
CRITICAL: The SandboxWarmPool is a namespaced resource, but this metric doesn't include a namespace label. If two namespaces have a WarmPool with the exact same name, their metrics will clash and randomly overwrite each other.
Please add "namespace" to the label keys here, and update all WithLabelValues and DeleteLabelValues calls to include wp.Namespace.
| asmetrics "sigs.k8s.io/agent-sandbox/internal/metrics" | ||
| ) | ||
|
|
||
| const ( |
There was a problem hiding this comment.
Using a finalizer solely to clean up in-memory Prometheus metrics is an anti-pattern. Finalizers block API object deletion, and if the controller is ever uninstalled or unavailable, the object will be stuck in a terminating state forever.
Instead, you can remove the finalizer completely. To clean up metrics when the object is fully deleted, check for apierrors.IsNotFound(err) at the beginning of Reconcile (after r.Get) and call asmetrics.WarmPoolSize.DeletePartialMatch(...) using req.Name and req.Namespace.
|
|
||
| // Handle deletion | ||
| if !warmPool.DeletionTimestamp.IsZero() { | ||
| if controllerutil.ContainsFinalizer(warmPool, metricsFinalizer) { |
There was a problem hiding this comment.
If you drop the finalizer approach as suggested, this entire finalizer removal block can be deleted. If you decide to keep it, DeleteWarmPoolMetrics is currently vulnerable to label leaks if the TemplateRef.Name was changed before deletion.
| } | ||
|
|
||
| // Add finalizer if it doesn't exist | ||
| if !controllerutil.ContainsFinalizer(warmPool, metricsFinalizer) { |
There was a problem hiding this comment.
If you end up keeping the finalizer addition (though it is recommended to remove it), it's generally safer to return early (e.g., return ctrl.Result{Requeue: true}, nil) after a successful r.Update. This ensures Reconcile is invoked again with the newly updated ResourceVersion from the cache, avoiding conflicts.
|
|
||
| templateName := wp.Spec.TemplateRef.Name | ||
| for status, count := range counts { | ||
| WarmPoolSize.WithLabelValues(status, wp.Name, templateName).Set(count) |
There was a problem hiding this comment.
If a user updates the TemplateRef.Name of an existing WarmPool, UpdateWarmPoolMetrics will start publishing metrics with the new sandbox_template label, but the old metrics with the previous sandbox_template label will leak indefinitely.
Consider calling WarmPoolSize.DeletePartialMatch(prometheus.Labels{"warmpool_name": wp.Name, "namespace": wp.Namespace}) at the beginning of this function to clear out all stale labels before setting the new metric values.
|
|
||
| for _, status := range statuses { | ||
| WarmPoolSize.DeleteLabelValues(status, wpName, templateName) | ||
| } |
There was a problem hiding this comment.
Instead of manually iterating over a hardcoded statuses array, you can use WarmPoolSize.DeletePartialMatch(prometheus.Labels{"warmpool_name": wpName}) (and namespace if added). This is cleaner, atomic, and guarantees all statuses and templates are cleared even if new custom phases are added in the future.
| if err != nil { | ||
| log.Error(err, "Failed to create pod") | ||
| allErrors = errors.Join(allErrors, err) | ||
| } else if pod != nil { |
There was a problem hiding this comment.
Because r.Create only populates the pod's metadata and not its apiserver-defaulted Status, the pod object here will have an empty Status.Phase.
UpdateWarmPoolMetrics translates "" to unknown, meaning newly created pods will temporarily spike the unknown bucket for one reconcile loop until they are fetched again. This is functionally fine, but worth being aware of.
| activePods = remainingPods | ||
| } | ||
|
|
||
| // Update metrics at the very end with the final set of active pods |
There was a problem hiding this comment.
Calling UpdateWarmPoolMetrics at the very end works well because the current logic doesn't return early on errors (it joins them into allErrors). If early returns are ever introduced in this function, wrapping this in a defer block might be necessary to ensure metrics are always accurate.
| ctx := context.Background() | ||
| err := r.reconcilePool(ctx, warmPool) | ||
| require.NoError(t, err) | ||
|
|
There was a problem hiding this comment.
To ensure consistency if the constant cases ever change, consider using strings.ToLower(string(corev1.PodPending)) instead of hardcoding the "pending" string here.
|
|
||
| if poolName == "test-pool" && template == "test-template" { | ||
| if status == PodStatusReady { | ||
| assert.Equal(t, 1.0, metric.GetGauge().GetValue()) |
There was a problem hiding this comment.
For future tests, consider using the testutil package (github.com/prometheus/client_golang/prometheus/testutil) to compare metrics directly against expected outputs. It significantly reduces boilerplate around collecting and parsing metric channels.
|
@Oneimu: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Implements agent_sandbox_warmpool_size metric to monitor the status of warm pools in the agent-sandbox, and track the available (ready) warmpool.
changes includes:
internal/metrics/metrics.go: Added WarmPoolSize gauge metric, UpdateWarmPoolMetrics, and DeleteWarmPoolMetrics functions.internal/metrics/metrics_test.go: Added unit tests for the new metrics and registration.extensions/controllers/sandboxwarmpool_controller.go:extensions/controllers/sandboxwarmpool_controller_test.go: Added TestReconcilePoolMetrics to verify metric updates during reconciliation.