Skip to content

[Feature] Add cleanup for terminated RayJob/RayCluster metrics #3923

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

phantom5125
Copy link

@phantom5125 phantom5125 commented Aug 7, 2025

Why are these changes needed?

Since some of our metrics are permanently stored in Prometheus, that might cause the /metrics endpoint to become slow or time out, we need a lifecycle-based cleanup.

Quick-Check

Related issue number

#3820

End-to-end test example

$kubectl apply -f ray-operator/config/samples/ray-job.sample.yaml

# $kubectl port-forward <kuberay-operator-pod-name> 8080:8080

$curl -s 127.0.0.1:8080/metrics | grep kuberay_
# HELP kuberay_cluster_condition_provisioned Indicates whether the RayCluster is provisioned
# TYPE kuberay_cluster_condition_provisioned gauge
kuberay_cluster_condition_provisioned{condition="true",name="rayjob-sample-clwvk",namespace="default"} 1
# HELP kuberay_cluster_info Metadata information about RayCluster custom resources
# TYPE kuberay_cluster_info gauge
kuberay_cluster_info{name="rayjob-sample-clwvk",namespace="default",owner_kind="RayJob"} 1
# HELP kuberay_cluster_provisioned_duration_seconds The time, in seconds, when a RayCluster's `RayClusterProvisioned` status transitions from false (or unset) to true
# TYPE kuberay_cluster_provisioned_duration_seconds gauge
kuberay_cluster_provisioned_duration_seconds{name="rayjob-sample-clwvk",namespace="default"} 1259.406597953
...

After we clean CR, there will be no more metrics

$kubectl delete rayjob rayjob-sample
$curl -s 127.0.0.1:8080/metrics | grep kuberay_

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@phantom5125 phantom5125 marked this pull request as draft August 7, 2025 19:02
@troychiu
Copy link
Contributor

troychiu commented Aug 7, 2025

Hi @phantom5125 , thank you for creating the PR. Just want to make sure we are in the same page before you start polishing the PR. Do we really need the TTL-based cleanup? I was thinking cleaning up the metric as long as the CR is deleted.

@phantom5125
Copy link
Author

Hi @phantom5125 , thank you for creating the PR. Just want to make sure we are in the same page before you start polishing the PR. Do we really need the TTL-based cleanup? I was thinking cleaning up the metric as long as the CR is deleted.

Thanks for your notice!

From my perspective, the independent metricsTTL is primarily intended to address the scenario where JobTTLSeconds is set to 0. In this case, the RayJob CR is deleted immediately after the job finishes. Then metrics like kuberay_job_execution_duration_seconds may not be collected, because it will likely to be deleted as soon as the metric is produced.

@troychiu
Copy link
Contributor

troychiu commented Aug 9, 2025

Hi @phantom5125 , thank you for creating the PR. Just want to make sure we are in the same page before you start polishing the PR. Do we really need the TTL-based cleanup? I was thinking cleaning up the metric as long as the CR is deleted.

Thanks for your notice!

From my perspective, the independent metricsTTL is primarily intended to address the scenario where JobTTLSeconds is set to 0. In this case, the RayJob CR is deleted immediately after the job finishes. Then metrics like kuberay_job_execution_duration_seconds may not be collected, because it will likely to be deleted as soon as the metric is produced.

I think introducing TTL-based cleanup is overkill for this scenario. Instead, we can simply document that setting JobTTLSeconds to a value smaller than the Prometheus scrape interval may cause metrics to be deleted before Prometheus can collect them. I just think we can start with a simpler implementation. What do you think?

@phantom5125
Copy link
Author

I think introducing TTL-based cleanup is overkill for this scenario. Instead, we can simply document that setting JobTTLSeconds to a value smaller than the Prometheus scrape interval may cause metrics to be deleted before Prometheus can collect them. I just think we can start with a simpler implementation. What do you think?

Ok, I will take your suggestion and update the PR soon!

@phantom5125 phantom5125 changed the title [Feature] Add TTL-based cleanup for terminated RayJob/RayCluster metrics [Feature] Add cleanup for terminated RayJob/RayCluster metrics Aug 9, 2025
@phantom5125 phantom5125 marked this pull request as ready for review August 9, 2025 19:56
@phantom5125
Copy link
Author

@troychiu PTAL, thanks!

Copy link
Contributor

@troychiu troychiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the contribution!

@@ -89,6 +89,12 @@ func (r *RayClusterMetricsManager) ObserveRayClusterProvisionedDuration(name, na
r.rayClusterProvisionedDurationSeconds.WithLabelValues(name, namespace).Set(duration)
}

// DeleteRayClusterMetrics removes metrics that belongs to the specified RayCluster.
func (r *RayClusterMetricsManager) DeleteRayClusterMetrics(name, namespace string) {
numCleanedUpMetrics := r.rayClusterProvisionedDurationSeconds.DeletePartialMatch(prometheus.Labels{"name": name, "namespace": namespace})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we use DeletePartialMatch instead of Delete?

@@ -135,6 +135,7 @@ func (r *RayClusterReconciler) Reconcile(ctx context.Context, request ctrl.Reque
if errors.IsNotFound(err) {
// Clear all related expectations
r.rayClusterScaleExpectation.Delete(instance.Name, instance.Namespace)
r.options.RayClusterMetricsManager.DeleteRayClusterMetrics(request.Name, request.Namespace)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RayClusterMetricsManager may be nil so a check would be required. I would suggest a similar approach to metric emission. That is, have a helper function in controller to

  1. check if metric manager is nil
  2. call metric manager to actually delete the metric

see

func emitRayClusterMetrics(rayClusterMetricsManager *metrics.RayClusterMetricsManager, clusterName, namespace string, oldStatus, newStatus rayv1.RayClusterStatus, creationTimestamp time.Time) {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants