[Feature] Add cleanup for terminated RayJob/RayCluster metrics #3923

phantom5125 · 2025-08-07T19:00:53Z

Why are these changes needed?

Since some of our metrics are permanently stored in Prometheus, that might cause the /metrics endpoint to become slow or time out, we need a lifecycle-based cleanup.

Quick-Check

Related issue number

#3820

End-to-end test example

$kubectl apply -f ray-operator/config/samples/ray-job.sample.yaml

# $kubectl port-forward <kuberay-operator-pod-name> 8080:8080

$curl -s 127.0.0.1:8080/metrics | grep kuberay_
# HELP kuberay_cluster_condition_provisioned Indicates whether the RayCluster is provisioned
# TYPE kuberay_cluster_condition_provisioned gauge
kuberay_cluster_condition_provisioned{condition="true",name="rayjob-sample-clwvk",namespace="default"} 1
# HELP kuberay_cluster_info Metadata information about RayCluster custom resources
# TYPE kuberay_cluster_info gauge
kuberay_cluster_info{name="rayjob-sample-clwvk",namespace="default",owner_kind="RayJob"} 1
# HELP kuberay_cluster_provisioned_duration_seconds The time, in seconds, when a RayCluster's `RayClusterProvisioned` status transitions from false (or unset) to true
# TYPE kuberay_cluster_provisioned_duration_seconds gauge
kuberay_cluster_provisioned_duration_seconds{name="rayjob-sample-clwvk",namespace="default"} 1259.406597953
...

After we clean CR, there will be no more metrics

$kubectl delete rayjob rayjob-sample
$curl -s 127.0.0.1:8080/metrics | grep kuberay_

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

troychiu · 2025-08-07T21:37:13Z

Hi @phantom5125 , thank you for creating the PR. Just want to make sure we are in the same page before you start polishing the PR. Do we really need the TTL-based cleanup? I was thinking cleaning up the metric as long as the CR is deleted.

phantom5125 · 2025-08-08T02:39:10Z

Hi @phantom5125 , thank you for creating the PR. Just want to make sure we are in the same page before you start polishing the PR. Do we really need the TTL-based cleanup? I was thinking cleaning up the metric as long as the CR is deleted.

Thanks for your notice!

From my perspective, the independent metricsTTL is primarily intended to address the scenario where JobTTLSeconds is set to 0. In this case, the RayJob CR is deleted immediately after the job finishes. Then metrics like kuberay_job_execution_duration_seconds may not be collected, because it will likely to be deleted as soon as the metric is produced.

troychiu · 2025-08-09T08:57:28Z

Hi @phantom5125 , thank you for creating the PR. Just want to make sure we are in the same page before you start polishing the PR. Do we really need the TTL-based cleanup? I was thinking cleaning up the metric as long as the CR is deleted.

Thanks for your notice!

From my perspective, the independent metricsTTL is primarily intended to address the scenario where JobTTLSeconds is set to 0. In this case, the RayJob CR is deleted immediately after the job finishes. Then metrics like kuberay_job_execution_duration_seconds may not be collected, because it will likely to be deleted as soon as the metric is produced.

I think introducing TTL-based cleanup is overkill for this scenario. Instead, we can simply document that setting JobTTLSeconds to a value smaller than the Prometheus scrape interval may cause metrics to be deleted before Prometheus can collect them. I just think we can start with a simpler implementation. What do you think?

phantom5125 · 2025-08-09T17:31:28Z

I think introducing TTL-based cleanup is overkill for this scenario. Instead, we can simply document that setting JobTTLSeconds to a value smaller than the Prometheus scrape interval may cause metrics to be deleted before Prometheus can collect them. I just think we can start with a simpler implementation. What do you think?

Ok, I will take your suggestion and update the PR soon!

phantom5125 · 2025-08-10T05:51:08Z

@troychiu PTAL, thanks!

troychiu

thank you for the contribution!

troychiu · 2025-08-12T19:22:21Z

ray-operator/controllers/ray/metrics/ray_cluster_metrics.go

@@ -89,6 +89,12 @@ func (r *RayClusterMetricsManager) ObserveRayClusterProvisionedDuration(name, na
 	r.rayClusterProvisionedDurationSeconds.WithLabelValues(name, namespace).Set(duration)
 }

+// DeleteRayClusterMetrics removes metrics that belongs to the specified RayCluster.
+func (r *RayClusterMetricsManager) DeleteRayClusterMetrics(name, namespace string) {
+	numCleanedUpMetrics := r.rayClusterProvisionedDurationSeconds.DeletePartialMatch(prometheus.Labels{"name": name, "namespace": namespace})


Is there a reason we use DeletePartialMatch instead of Delete?

troychiu · 2025-08-12T19:26:30Z

ray-operator/controllers/ray/raycluster_controller.go

@@ -135,6 +135,7 @@ func (r *RayClusterReconciler) Reconcile(ctx context.Context, request ctrl.Reque
 	if errors.IsNotFound(err) {
 		// Clear all related expectations
 		r.rayClusterScaleExpectation.Delete(instance.Name, instance.Namespace)
+		r.options.RayClusterMetricsManager.DeleteRayClusterMetrics(request.Name, request.Namespace)


RayClusterMetricsManager may be nil so a check would be required. I would suggest a similar approach to metric emission. That is, have a helper function in controller to

check if metric manager is nil

call metric manager to actually delete the metric

see

kuberay/ray-operator/controllers/ray/raycluster_controller.go

Line 1560 in a0c9341

func emitRayClusterMetrics(rayClusterMetricsManager *metrics.RayClusterMetricsManager, clusterName, namespace string, oldStatus, newStatus rayv1.RayClusterStatus, creationTimestamp time.Time) {

phantom5125 marked this pull request as draft August 7, 2025 19:02

kevin85421 assigned troychiu Aug 9, 2025

phantom5125 added 2 commits August 10, 2025 02:01

[Feature] Add cleanup for terminated RayJob/RayCluster metrics

a0c9341

[Tests] Add test code

2623dd6

phantom5125 force-pushed the master branch from 8df4ef7 to 2623dd6 Compare August 9, 2025 19:31

phantom5125 changed the title ~~[Feature] Add TTL-based cleanup for terminated RayJob/RayCluster metrics~~ [Feature] Add cleanup for terminated RayJob/RayCluster metrics Aug 9, 2025

phantom5125 marked this pull request as ready for review August 9, 2025 19:56

troychiu reviewed Aug 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Add cleanup for terminated RayJob/RayCluster metrics #3923

[Feature] Add cleanup for terminated RayJob/RayCluster metrics #3923

phantom5125 commented Aug 7, 2025 •

edited

Loading

Uh oh!

troychiu commented Aug 7, 2025

Uh oh!

phantom5125 commented Aug 8, 2025

Uh oh!

troychiu commented Aug 9, 2025 •

edited

Loading

Uh oh!

phantom5125 commented Aug 9, 2025

Uh oh!

phantom5125 commented Aug 10, 2025

Uh oh!

troychiu left a comment

Uh oh!

troychiu Aug 12, 2025

Uh oh!

troychiu Aug 12, 2025

Uh oh!

Uh oh!

[Feature] Add cleanup for terminated RayJob/RayCluster metrics #3923

Are you sure you want to change the base?

[Feature] Add cleanup for terminated RayJob/RayCluster metrics #3923

Conversation

phantom5125 commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

End-to-end test example

Checks

Uh oh!

troychiu commented Aug 7, 2025

Uh oh!

phantom5125 commented Aug 8, 2025

Uh oh!

troychiu commented Aug 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

phantom5125 commented Aug 9, 2025

Uh oh!

phantom5125 commented Aug 10, 2025

Uh oh!

troychiu left a comment

Choose a reason for hiding this comment

Uh oh!

troychiu Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

troychiu Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

phantom5125 commented Aug 7, 2025 •

edited

Loading

troychiu commented Aug 9, 2025 •

edited

Loading