[Feature] Add cleanup for terminated RayJob/RayCluster metrics #3923

phantom5125 · 2025-08-07T19:00:53Z

Why are these changes needed?

Since some of our metrics are permanently stored in Prometheus, that might cause the /metrics endpoint to become slow or time out, we need a lifecycle-based cleanup.

Related issue number

Closes #3820

End-to-end test example

$kubectl apply -f ray-operator/config/samples/ray-job.sample.yaml

# $kubectl port-forward <kuberay-operator-pod-name> 8080:8080

$curl -s 127.0.0.1:8080/metrics | grep kuberay_
# HELP kuberay_cluster_condition_provisioned Indicates whether the RayCluster is provisioned
# TYPE kuberay_cluster_condition_provisioned gauge
kuberay_cluster_condition_provisioned{condition="true",name="rayjob-sample-clwvk",namespace="default"} 1
# HELP kuberay_cluster_info Metadata information about RayCluster custom resources
# TYPE kuberay_cluster_info gauge
kuberay_cluster_info{name="rayjob-sample-clwvk",namespace="default",owner_kind="RayJob"} 1
# HELP kuberay_cluster_provisioned_duration_seconds The time, in seconds, when a RayCluster's `RayClusterProvisioned` status transitions from false (or unset) to true
# TYPE kuberay_cluster_provisioned_duration_seconds gauge
kuberay_cluster_provisioned_duration_seconds{name="rayjob-sample-clwvk",namespace="default"} 1259.406597953
...

After we clean CR, there will be no more metrics

$kubectl delete rayjob rayjob-sample
$curl -s 127.0.0.1:8080/metrics | grep kuberay_

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

troychiu · 2025-08-07T21:37:13Z

Hi @phantom5125 , thank you for creating the PR. Just want to make sure we are in the same page before you start polishing the PR. Do we really need the TTL-based cleanup? I was thinking cleaning up the metric as long as the CR is deleted.

phantom5125 · 2025-08-08T02:39:10Z

Hi @phantom5125 , thank you for creating the PR. Just want to make sure we are in the same page before you start polishing the PR. Do we really need the TTL-based cleanup? I was thinking cleaning up the metric as long as the CR is deleted.

Thanks for your notice!

From my perspective, the independent metricsTTL is primarily intended to address the scenario where JobTTLSeconds is set to 0. In this case, the RayJob CR is deleted immediately after the job finishes. Then metrics like kuberay_job_execution_duration_seconds may not be collected, because it will likely to be deleted as soon as the metric is produced.

troychiu · 2025-08-09T08:57:28Z

Hi @phantom5125 , thank you for creating the PR. Just want to make sure we are in the same page before you start polishing the PR. Do we really need the TTL-based cleanup? I was thinking cleaning up the metric as long as the CR is deleted.

Thanks for your notice!

From my perspective, the independent metricsTTL is primarily intended to address the scenario where JobTTLSeconds is set to 0. In this case, the RayJob CR is deleted immediately after the job finishes. Then metrics like kuberay_job_execution_duration_seconds may not be collected, because it will likely to be deleted as soon as the metric is produced.

I think introducing TTL-based cleanup is overkill for this scenario. Instead, we can simply document that setting JobTTLSeconds to a value smaller than the Prometheus scrape interval may cause metrics to be deleted before Prometheus can collect them. I just think we can start with a simpler implementation. What do you think?

phantom5125 · 2025-08-09T17:31:28Z

I think introducing TTL-based cleanup is overkill for this scenario. Instead, we can simply document that setting JobTTLSeconds to a value smaller than the Prometheus scrape interval may cause metrics to be deleted before Prometheus can collect them. I just think we can start with a simpler implementation. What do you think?

Ok, I will take your suggestion and update the PR soon!

phantom5125 · 2025-08-10T05:51:08Z

@troychiu PTAL, thanks!

troychiu

thank you for the contribution!

ray-operator/controllers/ray/metrics/ray_cluster_metrics.go

ray-operator/controllers/ray/raycluster_controller.go

ray-operator/controllers/ray/metrics/ray_cluster_metrics_test.go

ray-operator/test/support/metrics.go

ray-operator/controllers/ray/metrics/ray_cluster_metrics_test.go

troychiu

LGTM Thank you! cc @kevin85421

kevin85421 · 2025-08-20T16:32:46Z

cc @rueian could you review this PR? Thanks!

phantom5125 · 2025-08-24T06:08:47Z

@rueian PTAL, thanks!

rueian · 2025-08-25T17:36:44Z

ray-operator/controllers/ray/metrics/ray_cluster_metrics.go

+// NOTE: Uses Delete() as metric has only "name" and "namespace" labels.
+// If more labels are added, switch to DeletePartialMatch(), otherwise it may not clean up all metrics correctly.
+func (r *RayClusterMetricsManager) DeleteRayClusterMetrics(name, namespace string) {
+	numCleanedUpMetrics := r.rayClusterProvisionedDurationSeconds.Delete(prometheus.Labels{"name": name, "namespace": namespace})


Hi @phantom5125, I think we should use your original proposal, DeletePartialMatch, now. Having a comment here doesn't help us switch to DeletePartialMatch in the future, I believe.

@rueian Thank you for the review! Fixed in 40e1831

phantom5125 · 2025-08-29T08:05:05Z

This PR has passed CI and received approval, I believe this is ready to merge.
@kevin85421 could you please take a look? Thanks!

rueian · 2025-08-29T16:11:14Z

Thank you @phantom5125

phantom5125 marked this pull request as draft August 7, 2025 19:02

kevin85421 assigned troychiu Aug 9, 2025

phantom5125 added 2 commits August 10, 2025 02:01

[Feature] Add cleanup for terminated RayJob/RayCluster metrics

a0c9341

[Tests] Add test code

2623dd6

phantom5125 force-pushed the master branch from 8df4ef7 to 2623dd6 Compare August 9, 2025 19:31

phantom5125 changed the title ~~[Feature] Add TTL-based cleanup for terminated RayJob/RayCluster metrics~~ [Feature] Add cleanup for terminated RayJob/RayCluster metrics Aug 9, 2025

phantom5125 marked this pull request as ready for review August 9, 2025 19:56

troychiu reviewed Aug 12, 2025

View reviewed changes

ray-operator/controllers/ray/metrics/ray_cluster_metrics.go Show resolved Hide resolved

ray-operator/controllers/ray/raycluster_controller.go Outdated Show resolved Hide resolved

phantom5125 added 2 commits August 14, 2025 00:12

Use Delete API for rayClusterProvisionedDurationSeconds

fdd6ec2

Add helper function to avoid nil MetricsManager

8daf870

phantom5125 requested a review from troychiu August 13, 2025 16:17

win5923 reviewed Aug 13, 2025

View reviewed changes

phantom5125 added 2 commits August 14, 2025 00:31

Fix wrong annotation

091125b

test: add helper function for executing metrics requests

0aa7816

troychiu reviewed Aug 13, 2025

View reviewed changes

ray-operator/test/support/metrics.go Outdated Show resolved Hide resolved

ray-operator/controllers/ray/metrics/ray_cluster_metrics_test.go Show resolved Hide resolved

phantom5125 added 3 commits August 14, 2025 20:06

Add comment for Delete API

0196b10

refactor the helper

ad4809b

Merge branch 'ray-project:master' into master

e7440da

phantom5125 requested a review from troychiu August 14, 2025 17:20

troychiu approved these changes Aug 20, 2025

View reviewed changes

phantom5125 added 2 commits August 21, 2025 01:06

fix linting issue

924b0af

Merge branch 'ray-project:master' into master

3ec0479

phantom5125 requested review from kevin85421 and andrewsykim as code owners August 24, 2025 04:46

phantom5125 requested review from rueian and MortalHappiness as code owners August 24, 2025 04:46

rueian reviewed Aug 25, 2025

View reviewed changes

revert Delete API

40e1831

phantom5125 requested a review from rueian August 28, 2025 11:29

rueian approved these changes Aug 29, 2025

View reviewed changes

Merge branch 'ray-project:master' into master

7c8690d

rueian merged commit d56356b into ray-project:master Aug 29, 2025
12 checks passed

troychiu mentioned this pull request Aug 31, 2025

[Feature] Include CR UID in kuberay metrics #4003

Open

4 tasks

[Feature] Add cleanup for terminated RayJob/RayCluster metrics #3923

[Feature] Add cleanup for terminated RayJob/RayCluster metrics #3923

Uh oh!

Conversation

phantom5125 commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

End-to-end test example

Checks

Uh oh!

troychiu commented Aug 7, 2025

Uh oh!

phantom5125 commented Aug 8, 2025

Uh oh!

troychiu commented Aug 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

phantom5125 commented Aug 9, 2025

Uh oh!

phantom5125 commented Aug 10, 2025

Uh oh!

troychiu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

troychiu left a comment

Choose a reason for hiding this comment

Uh oh!

kevin85421 commented Aug 20, 2025

Uh oh!

phantom5125 commented Aug 24, 2025

Uh oh!

rueian Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

phantom5125 Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

phantom5125 commented Aug 29, 2025

Uh oh!

Uh oh!

rueian commented Aug 29, 2025

Uh oh!

Uh oh!

phantom5125 commented Aug 7, 2025 •

edited

Loading

troychiu commented Aug 9, 2025 •

edited

Loading