[Feature] Include CR UID in kuberay metrics #4003

YuxiaoWang-520 · 2025-08-28T07:22:32Z

Why are these changes needed?

This PR adds the Custom Resource (CR) UID field to KubeRay metrics to distinguish between custom resources with the same name. Previously, metrics only included name and namespace labels, which could cause ambiguity when a RayCluster or RayJob is deleted and later recreated with the same name in the same namespace.

The key improvements include:

Added uid label to all KubeRay metrics (RayCluster, RayJob and RayService metrics)
Updated metric collection methods to include the CR UID parameter
Enhanced metric uniqueness for better observability and monitoring

This change enables users to:

Distinguish between different instances of resources with identical names

Related issue number

Closes #3754

Testing Results

I have tested this feature with actual RayCluster and RayJob resources. Here are some examples from curl http://0.0.0.0:8080/metrics | grep uid:
RayCluster metrics with UID:

kuberay_cluster_info{name="test-metrics-v5",namespace="yanquan-test",owner_kind="None",uid="f1878f09-c183-4f51-b7a5-c4ebb9d19d94"} 1

RayJob metrics with UID:

kuberay_job_info{name="dc-samplev5",namespace="yanquan-test",uid="551484de-b938-4608-abf7-b019db0cd692"} 1

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

kevin85421 · 2025-08-28T18:01:39Z

cc @troychiu @win5923 @owenowenisme Would you mind reviewing this PR? Thanks.

Future-Outlier

LGTM, It looks pretty nice!

owenowenisme

LGTM, Thanks!

Future-Outlier

Can you provide a screenshot with UID in kuberay metrics? this will help a lot!

YuxiaoWang-520 · 2025-08-29T11:23:45Z

Can you provide a screenshot with UID in kuberay metrics? this will help a lot!

Hi there! I've captured 3 screenshots that correspond to three different metrics, and I can confirm that the UID is displaying correctly across all of them.

win5923

Thanks! I will open a follow up PR for ray doc and Grafana Dashboard.

rueian · 2025-08-30T18:29:39Z

Ping @troychiu for review.

troychiu

Thank you!

troychiu

Since #3923 has been merged, I think it would be better to include uid when we reset metrics.

kuberay/ray-operator/controllers/ray/metrics/ray_cluster_metrics.go

Line 94 in d56356b

    
           numCleanedUpMetrics := r.rayClusterProvisionedDurationSeconds.DeletePartialMatch(prometheus.Labels{"name": name, "namespace": namespace})

YuxiaoWang-520 · 2025-09-01T05:53:00Z

Since #3923 has been merged, I think it would be better to include uid when we reset metrics.

kuberay/ray-operator/controllers/ray/metrics/ray_cluster_metrics.go

Line 94 in d56356b

numCleanedUpMetrics := r.rayClusterProvisionedDurationSeconds.DeletePartialMatch(prometheus.Labels{"name": name, "namespace": namespace})

Hi! I've reviewed the metrics cleanup PR, and I'd like to discuss whether modifications are necessary.

In the metrics cleanup PR, whenever a CR finishes, it cleans up its metrics using the name and namespace. This approach seems sufficient since only one CR with the same name and namespace can exist, so there shouldn't be a need for additional UID-based differentiation.

My PR simply appends the UID value to each metric for distinction purposes and should be decoupled from the metrics cleanup feature. Perhaps we could address this in a follow-up PR to keep the concerns separated.

troychiu · 2025-09-01T17:49:24Z

I think what you are saying is correct, but I am not sure if there is any difference in performance. Are you aware of any?

IMO, It won't be too much work but I am fine with a follow-up PR.

YuxiaoWang-520 · 2025-09-02T03:38:01Z

I think what you are saying is correct, but I am not sure if there is any difference in performance. Are you aware of any?

IMO, It won't be too much work but I am fine with a follow-up PR.

Hi! Thanks for your feedback! 👍 Here are my thoughts:

Performance Impact: I agree that the name + namespace combination is typically unique, and the metrics will be cleaned up immediately when the CR is terminated. Unless we have clear evidence showing issues with the current approach, I'd suggest keeping the existing code unchanged for now. The current implementation using DeletePartialMatch should be sufficient for most use cases.

Implementation Complexity: If we were to make changes, I think it wouldn't be straightforward to implement cleanly. Do you happen to have a simpler approach in mind?

The main issue is that in the Reconcile function, we call cleanUpRayClusterMetrics, but the request parameter only contains name and namespace. To get the uid, we'd need some workarounds like caching a name/namespace → uid mapping in the RayClusterReconciler struct, which feels unnecessary and adds complexity.

kuberay/ray-operator/controllers/ray/raycluster_controller.go

Lines 113 to 133 in 3858146

    
           func (r *RayClusterReconciler) Reconcile(ctx context.Context, request ctrl.Request) (ctrl.Result, error) { 
        
           	logger := ctrl.LoggerFrom(ctx) 
        
           	var err error 
        
           	// Try to fetch the RayCluster instance 
        
           	instance := &rayv1.RayCluster{} 
        
           	if err = r.Get(ctx, request.NamespacedName, instance); err == nil { 
        
           		return r.rayClusterReconcile(ctx, instance) 
        
           	} 
        
           	// No match found 
        
           	if errors.IsNotFound(err) { 
        
           		// Clear all related expectations 
        
           		r.rayClusterScaleExpectation.Delete(instance.Name, instance.Namespace) 
        
           		cleanUpRayClusterMetrics(r.options.RayClusterMetricsManager, request.Name, request.Namespace) 
        
           	} else { 
        
           		logger.Error(err, "Read request instance error!") 
        
           	} 
        
           	// Error reading the object - requeue the request. 
        
           	return ctrl.Result{}, client.IgnoreNotFound(err) 
        
           }

I'm totally fine with a follow-up PR approach if you think there's value in exploring this further. What do you think? 😊

YuxiaoWang-520 added 3 commits August 26, 2025 16:26

[Feature] Include CR UID in kuberay metrics

6f5df00

Merge branch 'ray-project:master' into feature/AddUID

a0d5005

Merge branch 'ray-project:master' into feature/AddUID

c221778

YuxiaoWang-520 requested review from kevin85421, andrewsykim, rueian and MortalHappiness as code owners August 28, 2025 07:22

Future-Outlier approved these changes Aug 29, 2025

View reviewed changes

owenowenisme approved these changes Aug 29, 2025

View reviewed changes

Future-Outlier reviewed Aug 29, 2025

View reviewed changes

win5923 approved these changes Aug 29, 2025

View reviewed changes

rueian approved these changes Aug 30, 2025

View reviewed changes

troychiu approved these changes Aug 31, 2025

View reviewed changes

troychiu suggested changes Aug 31, 2025

View reviewed changes

Merge branch 'ray-project:master' into feature/AddUID

e06cd73

YuxiaoWang-520 added 2 commits September 1, 2025 14:11

fix test bug

21f7bc3

fix test bug

12e09a5

YuxiaoWang-520 added 2 commits September 2, 2025 10:45

fix test bug

7848f6d

Merge branch 'ray-project:master' into feature/AddUID

4ab7d58

YuxiaoWang-520 closed this Sep 2, 2025

YuxiaoWang-520 reopened this Sep 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Include CR UID in kuberay metrics #4003

[Feature] Include CR UID in kuberay metrics #4003

Uh oh!

YuxiaoWang-520 commented Aug 28, 2025 •

edited

Loading

Uh oh!

kevin85421 commented Aug 28, 2025

Uh oh!

Future-Outlier left a comment

Uh oh!

owenowenisme left a comment

Uh oh!

Future-Outlier left a comment

Uh oh!

YuxiaoWang-520 commented Aug 29, 2025

Uh oh!

win5923 left a comment

Uh oh!

rueian commented Aug 30, 2025

Uh oh!

troychiu left a comment

Uh oh!

troychiu left a comment •

edited

Loading

Uh oh!

YuxiaoWang-520 commented Sep 1, 2025 •

edited

Loading

Uh oh!

troychiu commented Sep 1, 2025

Uh oh!

YuxiaoWang-520 commented Sep 2, 2025

Uh oh!

Uh oh!

[Feature] Include CR UID in kuberay metrics #4003

Are you sure you want to change the base?

[Feature] Include CR UID in kuberay metrics #4003

Uh oh!

Conversation

YuxiaoWang-520 commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Testing Results

Checks

Uh oh!

kevin85421 commented Aug 28, 2025

Uh oh!

Future-Outlier left a comment

Choose a reason for hiding this comment

Uh oh!

owenowenisme left a comment

Choose a reason for hiding this comment

Uh oh!

Future-Outlier left a comment

Choose a reason for hiding this comment

Uh oh!

YuxiaoWang-520 commented Aug 29, 2025

Uh oh!

win5923 left a comment

Choose a reason for hiding this comment

Uh oh!

rueian commented Aug 30, 2025

Uh oh!

troychiu left a comment

Choose a reason for hiding this comment

Uh oh!

troychiu left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

YuxiaoWang-520 commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

troychiu commented Sep 1, 2025

Uh oh!

YuxiaoWang-520 commented Sep 2, 2025

Uh oh!

Uh oh!

YuxiaoWang-520 commented Aug 28, 2025 •

edited

Loading

troychiu left a comment •

edited

Loading

YuxiaoWang-520 commented Sep 1, 2025 •

edited

Loading