-
Notifications
You must be signed in to change notification settings - Fork 594
[Feature] Add cleanup for terminated RayJob/RayCluster metrics #3923
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Hi @phantom5125 , thank you for creating the PR. Just want to make sure we are in the same page before you start polishing the PR. Do we really need the TTL-based cleanup? I was thinking cleaning up the metric as long as the CR is deleted. |
Thanks for your notice! From my perspective, the independent metricsTTL is primarily intended to address the scenario where JobTTLSeconds is set to 0. In this case, the RayJob CR is deleted immediately after the job finishes. Then metrics like |
I think introducing TTL-based cleanup is overkill for this scenario. Instead, we can simply document that setting JobTTLSeconds to a value smaller than the Prometheus scrape interval may cause metrics to be deleted before Prometheus can collect them. I just think we can start with a simpler implementation. What do you think? |
Ok, I will take your suggestion and update the PR soon! |
@troychiu PTAL, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for the contribution!
@@ -89,6 +89,12 @@ func (r *RayClusterMetricsManager) ObserveRayClusterProvisionedDuration(name, na | |||
r.rayClusterProvisionedDurationSeconds.WithLabelValues(name, namespace).Set(duration) | |||
} | |||
|
|||
// DeleteRayClusterMetrics removes metrics that belongs to the specified RayCluster. | |||
func (r *RayClusterMetricsManager) DeleteRayClusterMetrics(name, namespace string) { | |||
numCleanedUpMetrics := r.rayClusterProvisionedDurationSeconds.DeletePartialMatch(prometheus.Labels{"name": name, "namespace": namespace}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we use DeletePartialMatch instead of Delete?
@@ -135,6 +135,7 @@ func (r *RayClusterReconciler) Reconcile(ctx context.Context, request ctrl.Reque | |||
if errors.IsNotFound(err) { | |||
// Clear all related expectations | |||
r.rayClusterScaleExpectation.Delete(instance.Name, instance.Namespace) | |||
r.options.RayClusterMetricsManager.DeleteRayClusterMetrics(request.Name, request.Namespace) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RayClusterMetricsManager may be nil so a check would be required. I would suggest a similar approach to metric emission. That is, have a helper function in controller to
- check if metric manager is nil
- call metric manager to actually delete the metric
see
func emitRayClusterMetrics(rayClusterMetricsManager *metrics.RayClusterMetricsManager, clusterName, namespace string, oldStatus, newStatus rayv1.RayClusterStatus, creationTimestamp time.Time) { |
Why are these changes needed?
Since some of our metrics are permanently stored in Prometheus, that might cause the
/metrics
endpoint to become slow or time out, we need a lifecycle-based cleanup.Quick-Check
Related issue number
#3820
End-to-end test example
After we clean CR, there will be no more metrics
Checks