Skip to content

Conversation

YuxiaoWang-520
Copy link

@YuxiaoWang-520 YuxiaoWang-520 commented Aug 28, 2025

Why are these changes needed?

This PR adds the Custom Resource (CR) UID field to KubeRay metrics to distinguish between custom resources with the same name. Previously, metrics only included name and namespace labels, which could cause ambiguity when a RayCluster or RayJob is deleted and later recreated with the same name in the same namespace.

The key improvements include:

  • Added uid label to all KubeRay metrics (RayCluster, RayJob and RayService metrics)
  • Updated metric collection methods to include the CR UID parameter
  • Enhanced metric uniqueness for better observability and monitoring

This change enables users to:

  • Distinguish between different instances of resources with identical names

Related issue number

Closes #3754

Testing Results

I have tested this feature with actual RayCluster and RayJob resources. Here are some examples from curl http://0.0.0.0:8080/metrics | grep uid:
RayCluster metrics with UID:

  • kuberay_cluster_info{name="test-metrics-v5",namespace="yanquan-test",owner_kind="None",uid="f1878f09-c183-4f51-b7a5-c4ebb9d19d94"} 1

RayJob metrics with UID:

  • kuberay_job_info{name="dc-samplev5",namespace="yanquan-test",uid="551484de-b938-4608-abf7-b019db0cd692"} 1

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@kevin85421
Copy link
Member

cc @troychiu @win5923 @owenowenisme Would you mind reviewing this PR? Thanks.

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, It looks pretty nice!

Copy link
Contributor

@owenowenisme owenowenisme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thanks!

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide a screenshot with UID in kuberay metrics? this will help a lot!

@YuxiaoWang-520
Copy link
Author

Can you provide a screenshot with UID in kuberay metrics? this will help a lot!

Hi there! I've captured 3 screenshots that correspond to three different metrics, and I can confirm that the UID is displaying correctly across all of them.
image
image
image

Copy link
Contributor

@win5923 win5923 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I will open a follow up PR for ray doc and Grafana Dashboard.

@rueian
Copy link
Contributor

rueian commented Aug 30, 2025

Ping @troychiu for review.

Copy link
Contributor

@troychiu troychiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Copy link
Contributor

@troychiu troychiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since #3923 has been merged, I think it would be better to include uid when we reset metrics.

numCleanedUpMetrics := r.rayClusterProvisionedDurationSeconds.DeletePartialMatch(prometheus.Labels{"name": name, "namespace": namespace})

@YuxiaoWang-520
Copy link
Author

YuxiaoWang-520 commented Sep 1, 2025

Since #3923 has been merged, I think it would be better to include uid when we reset metrics.

numCleanedUpMetrics := r.rayClusterProvisionedDurationSeconds.DeletePartialMatch(prometheus.Labels{"name": name, "namespace": namespace})

Hi! I've reviewed the metrics cleanup PR, and I'd like to discuss whether modifications are necessary.

In the metrics cleanup PR, whenever a CR finishes, it cleans up its metrics using the name and namespace. This approach seems sufficient since only one CR with the same name and namespace can exist, so there shouldn't be a need for additional UID-based differentiation.

My PR simply appends the UID value to each metric for distinction purposes and should be decoupled from the metrics cleanup feature. Perhaps we could address this in a follow-up PR to keep the concerns separated.

@troychiu
Copy link
Contributor

troychiu commented Sep 1, 2025

I think what you are saying is correct, but I am not sure if there is any difference in performance. Are you aware of any?

IMO, It won't be too much work but I am fine with a follow-up PR.

@YuxiaoWang-520
Copy link
Author

I think what you are saying is correct, but I am not sure if there is any difference in performance. Are you aware of any?

IMO, It won't be too much work but I am fine with a follow-up PR.

Hi! Thanks for your feedback! 👍 Here are my thoughts:

  1. Performance Impact: I agree that the name + namespace combination is typically unique, and the metrics will be cleaned up immediately when the CR is terminated. Unless we have clear evidence showing issues with the current approach, I'd suggest keeping the existing code unchanged for now. The current implementation using DeletePartialMatch should be sufficient for most use cases.

  2. Implementation Complexity: If we were to make changes, I think it wouldn't be straightforward to implement cleanly. Do you happen to have a simpler approach in mind?

    The main issue is that in the Reconcile function, we call cleanUpRayClusterMetrics, but the request parameter only contains name and namespace. To get the uid, we'd need some workarounds like caching a name/namespace → uid mapping in the RayClusterReconciler struct, which feels unnecessary and adds complexity.

    func (r *RayClusterReconciler) Reconcile(ctx context.Context, request ctrl.Request) (ctrl.Result, error) {
    logger := ctrl.LoggerFrom(ctx)
    var err error
    // Try to fetch the RayCluster instance
    instance := &rayv1.RayCluster{}
    if err = r.Get(ctx, request.NamespacedName, instance); err == nil {
    return r.rayClusterReconcile(ctx, instance)
    }
    // No match found
    if errors.IsNotFound(err) {
    // Clear all related expectations
    r.rayClusterScaleExpectation.Delete(instance.Name, instance.Namespace)
    cleanUpRayClusterMetrics(r.options.RayClusterMetricsManager, request.Name, request.Namespace)
    } else {
    logger.Error(err, "Read request instance error!")
    }
    // Error reading the object - requeue the request.
    return ctrl.Result{}, client.IgnoreNotFound(err)
    }

    I'm totally fine with a follow-up PR approach if you think there's value in exploring this further. What do you think? 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Include CR UID in kuberay metrics
7 participants