-
Notifications
You must be signed in to change notification settings - Fork 611
[Feature] Add cleanup for terminated RayJob/RayCluster metrics #3923
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi @phantom5125 , thank you for creating the PR. Just want to make sure we are in the same page before you start polishing the PR. Do we really need the TTL-based cleanup? I was thinking cleaning up the metric as long as the CR is deleted. |
Thanks for your notice! From my perspective, the independent metricsTTL is primarily intended to address the scenario where JobTTLSeconds is set to 0. In this case, the RayJob CR is deleted immediately after the job finishes. Then metrics like |
I think introducing TTL-based cleanup is overkill for this scenario. Instead, we can simply document that setting JobTTLSeconds to a value smaller than the Prometheus scrape interval may cause metrics to be deleted before Prometheus can collect them. I just think we can start with a simpler implementation. What do you think? |
Ok, I will take your suggestion and update the PR soon! |
@troychiu PTAL, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for the contribution!
ray-operator/controllers/ray/metrics/ray_cluster_metrics_test.go
Outdated
Show resolved
Hide resolved
ray-operator/controllers/ray/metrics/ray_cluster_metrics_test.go
Outdated
Show resolved
Hide resolved
ray-operator/controllers/ray/metrics/ray_cluster_metrics_test.go
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM Thank you! cc @kevin85421
cc @rueian could you review this PR? Thanks! |
@rueian PTAL, thanks! |
// NOTE: Uses Delete() as metric has only "name" and "namespace" labels. | ||
// If more labels are added, switch to DeletePartialMatch(), otherwise it may not clean up all metrics correctly. | ||
func (r *RayClusterMetricsManager) DeleteRayClusterMetrics(name, namespace string) { | ||
numCleanedUpMetrics := r.rayClusterProvisionedDurationSeconds.Delete(prometheus.Labels{"name": name, "namespace": namespace}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @phantom5125, I think we should use your original proposal, DeletePartialMatch
, now. Having a comment here doesn't help us switch to DeletePartialMatch
in the future, I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR has passed CI and received approval, I believe this is ready to merge. |
Thank you @phantom5125 |
Why are these changes needed?
Since some of our metrics are permanently stored in Prometheus, that might cause the
/metrics
endpoint to become slow or time out, we need a lifecycle-based cleanup.Related issue number
Closes #3820
End-to-end test example
After we clean CR, there will be no more metrics
Checks