Skip to content

Comments

feat: add updaterun metrics and update Progressing condition#1107

Closed
jwtty wants to merge 10 commits intoAzure:mainfrom
jwtty:updaterun-metrics
Closed

feat: add updaterun metrics and update Progressing condition#1107
jwtty wants to merge 10 commits intoAzure:mainfrom
jwtty:updaterun-metrics

Conversation

@jwtty
Copy link
Contributor

@jwtty jwtty commented Apr 7, 2025

Description of your changes

  1. When updateRun is stuck waiting for a cluster update to finish or waiting for after-stage tasks to complete, we mark Progressing condition to false and add corresponding reacon.
  2. Add updateRun status metrics based on updateRun condition. The metric contains 3 labels, name, generation, and status. Status can be one of "progressing", "stuck", "waiting", "completed", and "failed". The value is the timestamp when the status metric is emitted.

Generated metrics look like:

fleet_workload_update_run_status_last_timestamp_seconds{condition="Progressing",generation="1",name="example-run",reason="UpdateRunStarted",status="True"} 1.744335068438203e+09
fleet_workload_update_run_status_last_timestamp_seconds{condition="Progressing",generation="1",name="example-run",reason="UpdateRunWaiting",status="False"} 1.7443350222811694e+09
fleet_workload_update_run_status_last_timestamp_seconds{condition="Succeeded",generation="1",name="example-run",reason="UpdateRunSucceeded",status="True"} 1.7443350684599657e+09

Fixes #

I have:

  • Run make reviewable to ensure this PR is ready for review.

How has this code been tested

Special notes for your reviewer

@kaito-pr-agent
Copy link

kaito-pr-agent bot commented Apr 7, 2025

Failed to generate code suggestions for PR

@jwtty jwtty force-pushed the updaterun-metrics branch from 41a344e to c2efbf4 Compare April 7, 2025 18:33
@kaito-pr-agent
Copy link

kaito-pr-agent bot commented Apr 7, 2025

Title

(Describe updated until commit 2ab2781)

feat: add updaterun metrics and update Progressing condition


Description

  • Added updaterun metrics to track the status of update runs.

  • Implemented logic to emit metrics based on the status conditions of update runs.

  • Created integration tests to validate the emission and removal of update run status metrics.

  • Updated condition reasons to include new statuses such as stuck and waiting.

  • Refactored code into separate modules for better organization and maintainability.


Changes walkthrough 📝

Relevant files
Enhancement
5 files
controller.go
Add updaterun status metric emission logic                             
+41/-1   
execution.go
Implement logic to mark update run as stuck or waiting     
+36/-0   
condition.go
Add new condition reasons for stuck and waiting states     
+6/-0     
metrics.go
Add new metric for update run status timestamps                   
+8/-0     
metrics.go
Add utility functions for comparing Prometheus metrics     
+29/-0   
Tests
2 files
controller_integration_test.go
Add integration tests for update run status metrics           
+126/-0 
execution_integration_test.go
Add tests to validate update run status metrics emission 
+123/-0 
Bug fix
2 files
initialization_integration_test.go
Ensure initialization failure metrics are emitted               
+50/-0   
validation_integration_test.go
Ensure validation failure metrics are emitted                       
+31/-0   

Need help?
  • Type /help how to ... in the comments thread for any questions about PR-Agent usage.
  • Check out the documentation for more information.
  • @kaito-pr-agent
    Copy link

    kaito-pr-agent bot commented Apr 7, 2025

    PR Reviewer Guide 🔍

    (Review updated until commit 2ab2781)

    Here are some key observations to aid the review process:

    🎫 Ticket compliance analysis 🔶

    500 - Partially compliant

    Compliant requirements:

    • Add updaterun metrics
    • Update Progressing condition

    Non-compliant requirements:

    • None

    Requires further human verification:

    • None
     Estimated effort to review: 3 🔵🔵🔵⚪⚪
    🧪 PR contains tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review

    Potential Bug

    The emitUpdateRunStatusMetric function may not correctly handle cases where multiple conditions are simultaneously true, leading to incorrect metric emission.

    	}
    	// enqueue to the updaterun controller queue.
    	q.Add(reconcile.Request{
    		NamespacedName: types.NamespacedName{Name: updateRun},
    	})
    }
    
    // emitUpdateRunStatusMetric emits the update run status metric based on status conditions in the updateRun.
    func emitUpdateRunStatusMetric(updateRun *placementv1beta1.ClusterStagedUpdateRun) {
    	generation := updateRun.Generation
    	genStr := strconv.FormatInt(generation, 10)
    
    	succeedCond := meta.FindStatusCondition(updateRun.Status.Conditions, string(placementv1beta1.StagedUpdateRunConditionSucceeded))
    	if condition.IsConditionStatusTrue(succeedCond, generation) || condition.IsConditionStatusFalse(succeedCond, generation) {
    		metrics.FleetUpdateRunStatusLastTimestampSeconds.WithLabelValues(updateRun.Name, genStr,
    			string(placementv1beta1.StagedUpdateRunConditionSucceeded), string(succeedCond.Status), succeedCond.Reason).SetToCurrentTime()
    		return
    	}
    
    	progressingCond := meta.FindStatusCondition(updateRun.Status.Conditions, string(placementv1beta1.StagedUpdateRunConditionProgressing))
    	if condition.IsConditionStatusTrue(progressingCond, generation) || condition.IsConditionStatusFalse(progressingCond, generation) {
    		metrics.FleetUpdateRunStatusLastTimestampSeconds.WithLabelValues(updateRun.Name, genStr,
    			string(placementv1beta1.StagedUpdateRunConditionProgressing), string(progressingCond.Status), progressingCond.Reason).SetToCurrentTime()
    		return
    	}
    
    	initializedCond := meta.FindStatusCondition(updateRun.Status.Conditions, string(placementv1beta1.StagedUpdateRunConditionInitialized))
    	if condition.IsConditionStatusTrue(initializedCond, generation) || condition.IsConditionStatusFalse(initializedCond, generation) {
    		metrics.FleetUpdateRunStatusLastTimestampSeconds.WithLabelValues(updateRun.Name, genStr,
    			string(placementv1beta1.StagedUpdateRunConditionInitialized), string(initializedCond.Status), initializedCond.Reason).SetToCurrentTime()
    		return
    	}
    
    	// We should rarely reach here, it can only happen when updating updateRun status fails.
    	klog.V(2).InfoS("There's no valid status condition on updateRun, status updating failed possibly", "updateRun", klog.KObj(updateRun))
    }
    Potential Bug

    The validateUpdateRunMetricsEmitted function may not accurately verify the expected metrics if the order of metric emission varies slightly.

    			validateApprovalRequestCount(ctx, 1)
    		})
    
    	})
    })
    
    func initializeUpdateRunMetricsRegistry() *prometheus.Registry {
    	// Create a test registry
    	customRegistry := prometheus.NewRegistry()
    	Expect(customRegistry.Register(metrics.FleetUpdateRunStatusLastTimestampSeconds)).Should(Succeed())
    	// Reset metrics before each test
    	metrics.FleetUpdateRunStatusLastTimestampSeconds.Reset()
    	return customRegistry
    }
    
    func unregisterUpdateRunMetrics(registry *prometheus.Registry) {
    	Expect(registry.Unregister(metrics.FleetUpdateRunStatusLastTimestampSeconds)).Should(BeTrue())
    }
    
    // validateUpdateRunMetricsEmitted validates the update run status metrics are emitted and are emitted in the correct order.
    func validateUpdateRunMetricsEmitted(registry *prometheus.Registry, wantMetrics ...*prometheusclientmodel.Metric) {
    	Eventually(func() error {
    		metricFamilies, err := registry.Gather()
    		if err != nil {
    			return fmt.Errorf("failed to gather metrics: %w", err)
    		}
    		var gotMetrics []*prometheusclientmodel.Metric
    		for _, mf := range metricFamilies {
    			if mf.GetName() == "fleet_workload_update_run_status_last_timestamp_seconds" {
    				gotMetrics = mf.GetMetric()
    			}
    		}
    
    		if diff := cmp.Diff(gotMetrics, wantMetrics, metricsutils.MetricsCmpOptions...); diff != "" {
    			return fmt.Errorf("update run status metrics mismatch (-got, +want):\n%s", diff)
    		}
    
    		return nil

    @kaito-pr-agent
    Copy link

    kaito-pr-agent bot commented Apr 9, 2025

    Persistent review updated to latest commit 24feb17

    @jwtty jwtty force-pushed the updaterun-metrics branch from 24feb17 to 87adbf7 Compare April 9, 2025 00:36
    @kaito-pr-agent
    Copy link

    kaito-pr-agent bot commented Apr 9, 2025

    Persistent review updated to latest commit 87adbf7

    @jwtty jwtty force-pushed the updaterun-metrics branch from 87adbf7 to 04012b1 Compare April 9, 2025 03:45
    @kaito-pr-agent
    Copy link

    kaito-pr-agent bot commented Apr 9, 2025

    Persistent review updated to latest commit 04012b1

    @kaito-pr-agent
    Copy link

    Persistent review updated to latest commit 5d3a27a

    @jwtty jwtty changed the title feat: add updaterun metrics feat: add updaterun metrics and update Progressing condition Apr 10, 2025
    @jwtty jwtty force-pushed the updaterun-metrics branch from 145a4c0 to 9daeb77 Compare April 11, 2025 00:43
    @kaito-pr-agent
    Copy link

    Persistent review updated to latest commit 9daeb77

    1 similar comment
    @kaito-pr-agent
    Copy link

    Persistent review updated to latest commit 9daeb77

    @kaito-pr-agent
    Copy link

    Persistent review updated to latest commit 934db55

    }

    // We should not reach here, as reconcile should NOT return when the updateRun is still initializing or initialized but not progressing.
    klog.V(2).ErrorS(controller.NewUnexpectedBehaviorError(fmt.Errorf("updateRun does not have valid conditions when emitting updateRun status metric")),
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    it is possible, for example, init has failed, but the controller fails to update the condition because of api server error. in this case, we need to emit unknown.

    Copy link
    Contributor Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Yeah, this can happen when status updating fails. I do not want to emit an unknown metric as this can mess up with the actual unknown status on the updateRun. I want to log first as status updating failure should be retried.

    @kaito-pr-agent
    Copy link

    Persistent review updated to latest commit 86451f9

    @jwtty jwtty force-pushed the updaterun-metrics branch from 86451f9 to 997f183 Compare April 12, 2025 00:00
    @jwtty jwtty force-pushed the updaterun-metrics branch from 997f183 to a11c447 Compare April 12, 2025 00:00
    @kaito-pr-agent
    Copy link

    Persistent review updated to latest commit a11c447

    1 similar comment
    @kaito-pr-agent
    Copy link

    Persistent review updated to latest commit a11c447

    @jwtty jwtty mentioned this pull request Apr 14, 2025
    1 task
    @kaito-pr-agent
    Copy link

    Persistent review updated to latest commit 10783fa

    @kaito-pr-agent
    Copy link

    Persistent review updated to latest commit f48e547

    @jwtty jwtty force-pushed the updaterun-metrics branch from f48e547 to 2ab2781 Compare April 14, 2025 17:55
    @kaito-pr-agent
    Copy link

    Persistent review updated to latest commit 2ab2781

    continue
    } else {
    // If cluster update has been running for more than 1 minute, mark the update run as stuck.
    timeElapsed := time.Since(clusterStartedCond.LastTransitionTime.Time)
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    checkClusterUpdateResult wait until all the resources become available which depends on the "timeToReady" setting by the user. Therefore, 'updateRunStuckThreshold' has to be the greater of 60 seconds or that number

    Copy link
    Contributor Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    I realized one thing: "timeToReady" is a rolling update specific. It's under RollingUpdateConfig and cannot be set when rolling type is set to External. The wait is done in the rollingUpdate controller, not in the workApplier. For untrackable resources, workApplier simply sets it as Available with reason set to "Availability not trackable".

    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    good point, does it mean that we need to add this to the update run API too or create some wait between clusters when there are untrackable resources?

    Copy link
    Contributor

    @zhiying-lin zhiying-lin Apr 15, 2025

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    i notice this problem when reviewing the code. Should we keep the default value as 1 min to keep it consistent with the rollingUpdate config? later, we can support the customized value?

    Copy link
    Contributor Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Yea, looks like we need to do similar thing in updateRun. That will be a new feature.

    stageUpdatingWaitTime = 60 * time.Second

    // updateRunStuckThreshold is the time to wait on a single cluster update before marking update run as stuck.
    updateRunStuckThreshold = 60 * time.Second
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    this seems to be a bit too aggressive, a normal deployment with many replicas can take a few mins to get ready

    Copy link
    Contributor Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Let me set it as 5 minutes for now. It will not cause updateRun to stop. Just mark it in the status so users can investigate CRP and see if there's issue. If issue is fixed or recovered automatically, updateRun can continue and the status is removed. We also add metrics to track so that we could have a better understanding and treak it.

    @michaelawyu
    Copy link
    Contributor

    Hi Wantong! I am closing this PR as part of the CNCF repo migration process; please consider moving (re-creating) this PR in the new repo once the sync PR is merged. If there's any question/concern, please let me know. Thanks 🙏

    @jwtty jwtty deleted the updaterun-metrics branch August 27, 2025 22:44
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    5 participants