Skip to content

Comments

feat: Add placement status metrics#1077

Closed
britaniar wants to merge 19 commits intoAzure:mainfrom
britaniar:addEdgeMetrics
Closed

feat: Add placement status metrics#1077
britaniar wants to merge 19 commits intoAzure:mainfrom
britaniar:addEdgeMetrics

Conversation

@britaniar
Copy link
Contributor

@britaniar britaniar commented Mar 12, 2025

Description of your changes

Fixes #

I have: added a metric to emit crp condition status' along with a test.

  • Run make reviewable to ensure this PR is ready for review.

How has this code been tested

Special notes for your reviewer

@britaniar britaniar marked this pull request as ready for review March 12, 2025 23:28
@britaniar britaniar force-pushed the addEdgeMetrics branch 4 times, most recently from c25f067 to 42093bd Compare March 13, 2025 21:03
Copy link
Contributor

@zhiying-lin zhiying-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we emit the last condition if the crp is not completed? since your metrics type is gauge. WE only need to know the current status.

@ryanzhang-oss
Copy link
Contributor

@britaniar britaniar force-pushed the addEdgeMetrics branch 3 times, most recently from d7683e2 to 6f85e45 Compare March 29, 2025 00:32
@kaito-pr-agent
Copy link

kaito-pr-agent bot commented Apr 3, 2025

Title

(Describe updated until commit d3e408d)

feat: Emit detailed metrics for workload placements


Description

  • Added new metrics to track the completion status and last timestamp of workload placements.

  • Updated test cases to validate the newly added metrics.

  • Refactored metric emission logic into separate functions for better readability and maintainability.

  • Added support for emitting metrics for different generations of CRP specs.

  • Enhanced test coverage to include scenarios with different CRP states and conditions.


Changes walkthrough 📝

Relevant files
Enhancement
controller.go
Add metric emission and deletion logic                                     

pkg/controllers/clusterresourceplacement/controller.go

  • Added new functions emitPlacementStatusMetric and
    checkPlacementCompleteMetric
  • Updated metric deletion logic in handleDelete and handleUpdate
  • Refactored metric emission in handleUpdate
  • +34/-3   
    metrics.go
    Add new Prometheus metrics                                                             

    pkg/utils/controller/metrics/metrics.go

  • Added new Prometheus metrics
    FleetPlacementCompleteLastTimeStampSeconds and
    FleetPlacementStatusLastTimeStampSeconds
  • Updated metric registration in init
  • +13/-5   
    metrics.go
    Add utility package for metrics comparison                             

    test/utils/metrics/metrics.go

  • Created new package metrics with utility functions for comparing
    Prometheus metrics
  • +30/-0   
    Tests
    controller_integration_test.go
    Add integration tests for new metrics                                       

    pkg/controllers/clusterresourceplacement/controller_integration_test.go

  • Added new test cases for different CRP states and conditions
  • Updated existing test cases to validate new metrics
  • Created helper functions createAvailableClusterResourceBinding,
    updateClusterResourceBindingWithReportDiff, and
    checkPlacementStatusMetric
  • +1310/-8

    Need help?
  • Type /help how to ... in the comments thread for any questions about PR-Agent usage.
  • Check out the documentation for more information.
  • @kaito-pr-agent
    Copy link

    kaito-pr-agent bot commented Apr 3, 2025

    PR Reviewer Guide 🔍

    (Review updated until commit d3e408d)

    Here are some key observations to aid the review process:

    🎫 Ticket compliance analysis 🔶

    500 - Partially compliant

    Compliant requirements:

    • Emit metrics for CRP status changes
    • Ensure metrics are emitted correctly based on CRP conditions

    Non-compliant requirements:

    • None

    Requires further human verification:

    • None
     Estimated effort to review: 3 🔵🔵🔵⚪⚪
    🧪 PR contains tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review

    Potential Metric Emission Issue

    The code adds new metrics but lacks proper initialization and registration. Ensure these metrics are properly registered in the init function.

    	klog.ErrorS(controller.NewUnexpectedBehaviorError(err), "We have encountered a fatal error that can't be retried, requeue after a day")
    	return ctrl.Result{}, nil // ignore this unexpected error
    }
    startTime := time.Now()
    klog.V(2).InfoS("ClusterResourcePlacement reconciliation starts", "clusterResourcePlacement", name)
    defer func() {
    	latency := time.Since(startTime).Milliseconds()
    	klog.V(2).InfoS("ClusterResourcePlacement reconciliation ends", "clusterResourcePlacement", name, "latency", latency)
    }()
    
    crp := fleetv1beta1.ClusterResourcePlacement{}
    if err := r.Client.Get(ctx, types.NamespacedName{Name: name}, &crp); err != nil {
    	if apierrors.IsNotFound(err) {
    		klog.V(4).InfoS("Ignoring NotFound clusterResourcePlacement", "clusterResourcePlacement", name)
    		return ctrl.Result{}, nil
    	}
    	klog.ErrorS(err, "Failed to get clusterResourcePlacement", "clusterResourcePlacement", name)
    	return ctrl.Result{}, controller.NewAPIServerError(true, err)
    Missing Metric Initialization

    The new metrics are declared but not initialized. Ensure they are properly initialized in the init function.

    // of active workers per controller.
    FleetActiveWorkers = prometheus.NewGaugeVec(prometheus.GaugeOpts{
    	Name: "fleet_workload_active_workers",
    	Help: "Number of currently used workers per controller",
    }, []string{"controller"})
    
    // FleetPlacementCompleteLastTimeStampSeconds is a prometheus metric which keeps track if the placement is complete.
    FleetPlacementCompleteLastTimeStampSeconds = prometheus.NewGaugeVec(prometheus.GaugeOpts{
    	Name: "fleet_workload_placement_complete_last_timestamp_seconds",
    	Help: "Timestamp in seconds of the current completion status of crp. The 'isCompleted' label indicates whether CRP completion is true or false.",
    }, []string{"name", "isCompleted"})
    
    // FleetPlacementStatusLastTimeStampSeconds is a prometheus metric which keeps track of the last placement status.
    FleetPlacementStatusLastTimeStampSeconds = prometheus.NewGaugeVec(prometheus.GaugeOpts{
    	Name: "fleet_workload_placement_status_last_timestamp_seconds",
    	Help: "Timestamp in seconds of the last current placement status condition of crp.",
    }, []string{"name", "generation", "conditionType", "status"})

    @kaito-pr-agent
    Copy link

    kaito-pr-agent bot commented Apr 3, 2025

    Persistent review updated to latest commit 8a89ce8

    @kaito-pr-agent
    Copy link

    kaito-pr-agent bot commented Apr 3, 2025

    Persistent review updated to latest commit 0200684

    @kaito-pr-agent
    Copy link

    kaito-pr-agent bot commented Apr 3, 2025

    Persistent review updated to latest commit c6a8490

    @kaito-pr-agent
    Copy link

    kaito-pr-agent bot commented Apr 4, 2025

    Persistent review updated to latest commit 222867e

    jwtty
    jwtty previously approved these changes Apr 7, 2025
    Copy link
    Contributor

    @jwtty jwtty left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    LGTM

    @kaito-pr-agent
    Copy link

    kaito-pr-agent bot commented Apr 9, 2025

    Persistent review updated to latest commit c74a770

    @kaito-pr-agent
    Copy link

    kaito-pr-agent bot commented Apr 9, 2025

    Persistent review updated to latest commit 4ed96c6

    @kaito-pr-agent
    Copy link

    kaito-pr-agent bot commented Apr 9, 2025

    Persistent review updated to latest commit d446d00

    @kaito-pr-agent
    Copy link

    Persistent review updated to latest commit fef4bb6

    @kaito-pr-agent
    Copy link

    Persistent review updated to latest commit 07ffce0

    @kaito-pr-agent
    Copy link

    Persistent review updated to latest commit 1cec5f4

    @kaito-pr-agent
    Copy link

    Persistent review updated to latest commit 9bbdf16

    metrics.FleetPlacementStatusLastTimeStampSeconds.WithLabelValues(crp.Name, strconv.FormatInt(crp.Generation, 10), string(condType.ClusterResourcePlacementConditionType()), status).SetToCurrentTime()
    return
    }
    }
    Copy link
    Contributor

    @zhiying-lin zhiying-lin Apr 11, 2025

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    There are two options:

    1. two metrics, like the current one, completedMetrics + incompletedPlacementStatus metrics
    2. one metrics, PlacementStatus metrics to record the last condition of CPR. isCompleted = ReportDiff == true || available condition true

    Personally i prefer option 2

    Copy link
    Contributor Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    I personally prefer option 1. The completed metrics makes it a lot easier to query whether the CRP is completed or not for dashboards. If in the future, not sure if we will, we add more final conditions (like Available and ReportDiff) then we will have to continue changing the alert to include these cases. The second metric was just to provide more information in the alert, while the first helps with dashboard and easier to filter out through the logs.

    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    The current status metric never has a "good" status. How are we going to use this metric? I feel this metric is not independent: we always need to use it along with the completed metric, we always need to verify if crp is still incomplete first and then refer to this metric to see why it's still incomplete.

    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    agree with Wantong, they seem coupled, not sure how to join them but with different timestamp.

    Copy link
    Contributor Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Talked to Ryan. We will couple them.

    gotCRP = retrieveAndValidateClusterResourcePlacement(testCRPName, wantCRP)

    By("Ensure placement complete metric was emitted with isCompleted True")
    checkPlacementCompleteMetric(customRegistry, testCRPName, true, 2)
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    we also need to validate the crp status metrics

    Copy link
    Contributor Author

    @britaniar britaniar Apr 11, 2025

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    For a complete crp, the test is very flaky on figuring out what statuses will be emitted depending on how fast the conditions update since we don't have all the controllers working properly in the integration test. Originally, we wanted the placementstatus to record the last status for incomplete crp to see where they stopped to include in the alert.

    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    i see, could you please add a comment there to clarify why we don't validate the status metrics here.

    @kaito-pr-agent
    Copy link

    Persistent review updated to latest commit 43250ec

    @kaito-pr-agent
    Copy link

    Persistent review updated to latest commit d3e408d

    if a.Name == nil || b.Name == nil {
    return a.Name == nil
    }
    return *a.Name < *b.Name // Sort by label
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    We could use a.GetName() < b.GetName() to avoid nil comparison. I did in #1107

    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    is this function a dup?

    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Yea, I just don't know whether this PR or #1107 will merge first. The latter one can always reuse this util.

    metrics.FleetPlacementStatusLastTimeStampSeconds.WithLabelValues(crp.Name, strconv.FormatInt(crp.Generation, 10), string(condType.ClusterResourcePlacementConditionType()), status).SetToCurrentTime()
    return
    }
    }
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    The current status metric never has a "good" status. How are we going to use this metric? I feel this metric is not independent: we always need to use it along with the completed metric, we always need to verify if crp is still incomplete first and then refer to this metric to see why it's still incomplete.

    klog.ErrorS(updateErr, "Failed to update the status", "clusterResourcePlacement", crpKObj)
    return ctrl.Result{}, controller.NewUpdateIgnoreConflictError(updateErr)
    }
    metrics.FleetPlacementStatusLastTimeStampSeconds.DeletePartialMatch(prometheus.Labels{"name": crp.Name})
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Here you can call emitPlacementStatusMetric too.

    r.Recorder.Event(crp, corev1.EventTypeNormal, i.EventReasonForTrue(), i.EventMessageForTrue())
    }
    }
    emitPlacementStatusMetric(crp)
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    I think it's better to call this metrics emit function in a defer function because we have so many returns in handleUpdate. We may easily miss cases if we call this function ad-hoc.

    retrieveAndValidateClusterResourcePlacement(testCRPName, wantCRP)

    By("Ensure placement complete metric was emitted with isCompleted False")
    wantCompleteMetrics := []*prometheusclientmodel.Metric{
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    This incomplete metric can be refactored into a function. There's code duplication.

    }
    Expect(k8sClient.Status().Update(ctx, gotPolicySnapshot)).Should(Succeed(), "Failed to update the policy snapshot status")

    By("By creating clusterResourceBinding on member-1")
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Suggested change
    By("By creating clusterResourceBinding on member-1")
    By("Create clusterResourceBinding on member-1")

    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    I actually prefer using "Creating", but I see none of the other code use "ing".

    By("By creating clusterResourceBinding on member-1")
    member1Binding = createOverriddenClusterResourceBinding(member1Name, gotPolicySnapshot, gotResourceSnapshot)

    By("By validating the CRP status")
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Suggested change
    By("By validating the CRP status")
    By("Validate the CRP status")

    }
    checkPlacementStatusMetric(customRegistry, wantStatusMetrics)

    By("By creating a synchronized clusterResourceBinding on member-2")
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Suggested change
    By("By creating a synchronized clusterResourceBinding on member-2")
    By("Create a synchronized clusterResourceBinding on member-2")

    Expect(k8sClient.Update(ctx, crp)).Should(Succeed(), "Failed to update crp")
    Expect(k8sClient.Get(ctx, types.NamespacedName{Name: testCRPName}, crp)).Should(BeNil(), "Get() clusterResourcePlacement mismatch")

    By("By validating the CRP status with new spec")
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Suggested change
    By("By validating the CRP status with new spec")
    By("Validate the CRP status with new spec")

    Spec: crp.Spec,
    Status: placementv1beta1.ClusterResourcePlacementStatus{
    ObservedResourceIndex: "0",
    Conditions: []metav1.Condition{
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    I think this piece of code can also be refactored.

    if a.Name == nil || b.Name == nil {
    return a.Name == nil
    }
    return *a.Name < *b.Name // Sort by label
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    is this function a dup?

    Comment on lines +97 to +98
    metrics.FleetPlacementCompleteLastTimeStampSeconds.DeletePartialMatch(prometheus.Labels{"name": crp.Name})
    metrics.FleetPlacementStatusLastTimeStampSeconds.DeletePartialMatch(prometheus.Labels{"name": crp.Name})
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    I wonder why move the delete earlier instead of replacing them at line 105?

    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    I asked to move these earlier. If the pod stops when finalizer is removed but before metrics are removed, they would be left never deleted.

    @michaelawyu
    Copy link
    Contributor

    Hi Britania! I am closing this PR as part of the CNCF repo migration process; please consider moving (re-creating) this PR in the new repo once the sync PR is merged. If there's any question/concern, please let me know. Thanks 🙏

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    5 participants