feat: add updaterun metrics and update Progressing condition by jwtty · Pull Request #1107 · Azure/fleet

jwtty · 2025-04-07T02:49:07Z

Description of your changes

When updateRun is stuck waiting for a cluster update to finish or waiting for after-stage tasks to complete, we mark Progressing condition to false and add corresponding reacon.
Add updateRun status metrics based on updateRun condition. The metric contains 3 labels, name, generation, and status. Status can be one of "progressing", "stuck", "waiting", "completed", and "failed". The value is the timestamp when the status metric is emitted.

Generated metrics look like:

fleet_workload_update_run_status_last_timestamp_seconds{condition="Progressing",generation="1",name="example-run",reason="UpdateRunStarted",status="True"} 1.744335068438203e+09
fleet_workload_update_run_status_last_timestamp_seconds{condition="Progressing",generation="1",name="example-run",reason="UpdateRunWaiting",status="False"} 1.7443350222811694e+09
fleet_workload_update_run_status_last_timestamp_seconds{condition="Succeeded",generation="1",name="example-run",reason="UpdateRunSucceeded",status="True"} 1.7443350684599657e+09

Fixes #

I have:

Run make reviewable to ensure this PR is ready for review.

How has this code been tested

Special notes for your reviewer

kaito-pr-agent · 2025-04-07T02:51:26Z

Failed to generate code suggestions for PR

kaito-pr-agent · 2025-04-07T18:34:01Z

Title

(Describe updated until commit `2ab2781`)

feat: add updaterun metrics and update Progressing condition

Description

Added updaterun metrics to track the status of update runs.
Implemented logic to emit metrics based on the status conditions of update runs.
Created integration tests to validate the emission and removal of update run status metrics.
Updated condition reasons to include new statuses such as stuck and waiting.
Refactored code into separate modules for better organization and maintainability.

Changes walkthrough 📝

Relevant files

Enhancement

5 files

controller.go `Add updaterun status metric emission logic`	+41/-1
execution.go `Implement logic to mark update run as stuck or waiting`	+36/-0
condition.go `Add new condition reasons for stuck and waiting states`	+6/-0
metrics.go `Add new metric for update run status timestamps`	+8/-0
metrics.go `Add utility functions for comparing Prometheus metrics`	+29/-0

Tests

2 files

controller_integration_test.go `Add integration tests for update run status metrics`	+126/-0
execution_integration_test.go `Add tests to validate update run status metrics emission`	+123/-0

Bug fix

2 files

initialization_integration_test.go `Ensure initialization failure metrics are emitted`	+50/-0
validation_integration_test.go `Ensure validation failure metrics are emitted`	+31/-0

Need help?
Type /help how to ... in the comments thread for any questions about PR-Agent usage.
Check out the documentation for more information.

kaito-pr-agent · 2025-04-07T18:34:20Z

PR Reviewer Guide 🔍

(Review updated until commit `2ab2781`)

Here are some key observations to aid the review process:

🎫 Ticket compliance analysis 🔶 500 - Partially compliant Compliant requirements: Add updaterun metrics Update Progressing condition Non-compliant requirements: None Requires further human verification: None
Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Potential Bug The `emitUpdateRunStatusMetric` function may not correctly handle cases where multiple conditions are simultaneously true, leading to incorrect metric emission. } // enqueue to the updaterun controller queue. q.Add(reconcile.Request{ NamespacedName: types.NamespacedName{Name: updateRun}, }) } // emitUpdateRunStatusMetric emits the update run status metric based on status conditions in the updateRun. func emitUpdateRunStatusMetric(updateRun placementv1beta1.ClusterStagedUpdateRun) { generation := updateRun.Generation genStr := strconv.FormatInt(generation, 10) succeedCond := meta.FindStatusCondition(updateRun.Status.Conditions, string(placementv1beta1.StagedUpdateRunConditionSucceeded)) if condition.IsConditionStatusTrue(succeedCond, generation) \|\| condition.IsConditionStatusFalse(succeedCond, generation) { metrics.FleetUpdateRunStatusLastTimestampSeconds.WithLabelValues(updateRun.Name, genStr, string(placementv1beta1.StagedUpdateRunConditionSucceeded), string(succeedCond.Status), succeedCond.Reason).SetToCurrentTime() return } progressingCond := meta.FindStatusCondition(updateRun.Status.Conditions, string(placementv1beta1.StagedUpdateRunConditionProgressing)) if condition.IsConditionStatusTrue(progressingCond, generation) \|\| condition.IsConditionStatusFalse(progressingCond, generation) { metrics.FleetUpdateRunStatusLastTimestampSeconds.WithLabelValues(updateRun.Name, genStr, string(placementv1beta1.StagedUpdateRunConditionProgressing), string(progressingCond.Status), progressingCond.Reason).SetToCurrentTime() return } initializedCond := meta.FindStatusCondition(updateRun.Status.Conditions, string(placementv1beta1.StagedUpdateRunConditionInitialized)) if condition.IsConditionStatusTrue(initializedCond, generation) \|\| condition.IsConditionStatusFalse(initializedCond, generation) { metrics.FleetUpdateRunStatusLastTimestampSeconds.WithLabelValues(updateRun.Name, genStr, string(placementv1beta1.StagedUpdateRunConditionInitialized), string(initializedCond.Status), initializedCond.Reason).SetToCurrentTime() return } // We should rarely reach here, it can only happen when updating updateRun status fails. klog.V(2).InfoS("There's no valid status condition on updateRun, status updating failed possibly", "updateRun", klog.KObj(updateRun)) } Potential Bug* The `validateUpdateRunMetricsEmitted` function may not accurately verify the expected metrics if the order of metric emission varies slightly. validateApprovalRequestCount(ctx, 1) }) }) }) func initializeUpdateRunMetricsRegistry() prometheus.Registry { // Create a test registry customRegistry := prometheus.NewRegistry() Expect(customRegistry.Register(metrics.FleetUpdateRunStatusLastTimestampSeconds)).Should(Succeed()) // Reset metrics before each test metrics.FleetUpdateRunStatusLastTimestampSeconds.Reset() return customRegistry } func unregisterUpdateRunMetrics(registry prometheus.Registry) { Expect(registry.Unregister(metrics.FleetUpdateRunStatusLastTimestampSeconds)).Should(BeTrue()) } // validateUpdateRunMetricsEmitted validates the update run status metrics are emitted and are emitted in the correct order. func validateUpdateRunMetricsEmitted(registry prometheus.Registry, wantMetrics ...prometheusclientmodel.Metric) { Eventually(func() error { metricFamilies, err := registry.Gather() if err != nil { return fmt.Errorf("failed to gather metrics: %w", err) } var gotMetrics []*prometheusclientmodel.Metric for _, mf := range metricFamilies { if mf.GetName() == "fleet_workload_update_run_status_last_timestamp_seconds" { gotMetrics = mf.GetMetric() } } if diff := cmp.Diff(gotMetrics, wantMetrics, metricsutils.MetricsCmpOptions...); diff != "" { return fmt.Errorf("update run status metrics mismatch (-got, +want):\n%s", diff) } return nil

pkg/controllers/updaterun/controller.go

pkg/controllers/updaterun/execution.go

pkg/utils/controller/metrics/metrics.go

pkg/controllers/updaterun/controller.go

kaito-pr-agent · 2025-04-09T00:30:01Z

Persistent review updated to latest commit 24feb17

kaito-pr-agent · 2025-04-09T00:36:44Z

Persistent review updated to latest commit 87adbf7

kaito-pr-agent · 2025-04-09T03:45:50Z

Persistent review updated to latest commit 04012b1

kaito-pr-agent · 2025-04-10T01:50:03Z

Persistent review updated to latest commit 5d3a27a

kaito-pr-agent · 2025-04-11T00:44:29Z

Persistent review updated to latest commit 9daeb77

kaito-pr-agent · 2025-04-11T00:44:46Z

Persistent review updated to latest commit 9daeb77

kaito-pr-agent · 2025-04-11T02:15:43Z

Persistent review updated to latest commit 934db55

zhiying-lin · 2025-04-11T08:48:10Z

pkg/controllers/updaterun/controller.go

+	}
+
+	// We should not reach here, as reconcile should NOT return when the updateRun is still initializing or initialized but not progressing.
+	klog.V(2).ErrorS(controller.NewUnexpectedBehaviorError(fmt.Errorf("updateRun does not have valid conditions when emitting updateRun status metric")),


it is possible, for example, init has failed, but the controller fails to update the condition because of api server error. in this case, we need to emit unknown.

Yeah, this can happen when status updating fails. I do not want to emit an unknown metric as this can mess up with the actual unknown status on the updateRun. I want to log first as status updating failure should be retried.

pkg/controllers/updaterun/controller_integration_test.go

pkg/controllers/updaterun/execution.go

pkg/controllers/updaterun/controller_integration_test.go

pkg/utils/test_util.go

kaito-pr-agent · 2025-04-11T23:54:17Z

Persistent review updated to latest commit 86451f9

kaito-pr-agent · 2025-04-12T00:01:08Z

Persistent review updated to latest commit a11c447

kaito-pr-agent · 2025-04-12T00:01:49Z

Persistent review updated to latest commit a11c447

kaito-pr-agent · 2025-04-14T02:59:46Z

Persistent review updated to latest commit 10783fa

pkg/utils/metrics/metrics.go

kaito-pr-agent · 2025-04-14T17:43:48Z

Persistent review updated to latest commit f48e547

kaito-pr-agent · 2025-04-14T17:56:32Z

Persistent review updated to latest commit 2ab2781

ryanzhang-oss · 2025-04-14T20:52:43Z

pkg/controllers/updaterun/execution.go

 			continue
+		} else {
+			// If cluster update has been running for more than 1 minute, mark the update run as stuck.
+			timeElapsed := time.Since(clusterStartedCond.LastTransitionTime.Time)


checkClusterUpdateResult wait until all the resources become available which depends on the "timeToReady" setting by the user. Therefore, 'updateRunStuckThreshold' has to be the greater of 60 seconds or that number

I realized one thing: "timeToReady" is a rolling update specific. It's under RollingUpdateConfig and cannot be set when rolling type is set to External. The wait is done in the rollingUpdate controller, not in the workApplier. For untrackable resources, workApplier simply sets it as Available with reason set to "Availability not trackable".

good point, does it mean that we need to add this to the update run API too or create some wait between clusters when there are untrackable resources?

i notice this problem when reviewing the code. Should we keep the default value as 1 min to keep it consistent with the rollingUpdate config? later, we can support the customized value?

Yea, looks like we need to do similar thing in updateRun. That will be a new feature.

ryanzhang-oss · 2025-04-14T21:41:50Z

pkg/controllers/updaterun/execution.go

 	stageUpdatingWaitTime = 60 * time.Second
+
+	// updateRunStuckThreshold is the time to wait on a single cluster update before marking update run as stuck.
+	updateRunStuckThreshold = 60 * time.Second


this seems to be a bit too aggressive, a normal deployment with many replicas can take a few mins to get ready

Let me set it as 5 minutes for now. It will not cause updateRun to stop. Just mark it in the status so users can investigate CRP and see if there's issue. If issue is fixed or recovered automatically, updateRun can continue and the status is removed. We also add metrics to track so that we could have a better understanding and treak it.

michaelawyu · 2025-04-15T15:35:55Z

Hi Wantong! I am closing this PR as part of the CNCF repo migration process; please consider moving (re-creating) this PR in the new repo once the sync PR is merged. If there's any question/concern, please let me know. Thanks 🙏

jwtty force-pushed the updaterun-metrics branch from 41a344e to c2efbf4 Compare April 7, 2025 18:33

zhiying-lin reviewed Apr 8, 2025

View reviewed changes

Arvindthiru reviewed Apr 8, 2025

View reviewed changes

pkg/controllers/updaterun/controller.go Outdated Show resolved Hide resolved

jwtty force-pushed the updaterun-metrics branch from 24feb17 to 87adbf7 Compare April 9, 2025 00:36

jwtty force-pushed the updaterun-metrics branch from 87adbf7 to 04012b1 Compare April 9, 2025 03:45

jwtty changed the title ~~feat: add updaterun metrics~~ feat: add updaterun metrics and update Progressing condition Apr 10, 2025

jwtty force-pushed the updaterun-metrics branch from 145a4c0 to 9daeb77 Compare April 11, 2025 00:43

zhiying-lin reviewed Apr 11, 2025

View reviewed changes

jwtty force-pushed the updaterun-metrics branch from 86451f9 to 997f183 Compare April 12, 2025 00:00

jwtty added 6 commits April 12, 2025 00:00

feat: add updaterun metrics

29e041a

resolve comments

c42963b

use condition to derive metrics for updaterun

7557f26

use condition directly

b2e7813

add metrics cmp to util

71a7cd5

fix comments

a11c447

jwtty force-pushed the updaterun-metrics branch from 997f183 to a11c447 Compare April 12, 2025 00:00

jwtty mentioned this pull request Apr 14, 2025

feat: Add placement status metrics #1077

Closed

1 task

make metrics a separate module

10783fa

zhiying-lin reviewed Apr 14, 2025

View reviewed changes

pkg/utils/metrics/metrics.go Show resolved Hide resolved

pkg/utils/metrics/metrics.go Show resolved Hide resolved

resolve comments

2ab2781

jwtty force-pushed the updaterun-metrics branch from f48e547 to 2ab2781 Compare April 14, 2025 17:55

ryanzhang-oss reviewed Apr 14, 2025

View reviewed changes

jwtty added 2 commits April 14, 2025 22:23

remove waiting or stuck condition after it's not true

b27396b

set stuck threshold as 5min

e09fbbe

michaelawyu closed this Apr 15, 2025

jwtty deleted the updaterun-metrics branch August 27, 2025 22:44

Comments

Conversation

jwtty commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of your changes

How has this code been tested

Special notes for your reviewer

Uh oh!

kaito-pr-agent bot commented Apr 7, 2025

Uh oh!

kaito-pr-agent bot commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Title

(Describe updated until commit 2ab2781)

Description

Changes walkthrough 📝

Uh oh!

kaito-pr-agent bot commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Reviewer Guide 🔍

(Review updated until commit 2ab2781)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kaito-pr-agent bot commented Apr 9, 2025

Uh oh!

kaito-pr-agent bot commented Apr 9, 2025

Uh oh!

kaito-pr-agent bot commented Apr 9, 2025

Uh oh!

kaito-pr-agent bot commented Apr 10, 2025

Uh oh!

kaito-pr-agent bot commented Apr 11, 2025

Uh oh!

kaito-pr-agent bot commented Apr 11, 2025

Uh oh!

kaito-pr-agent bot commented Apr 11, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kaito-pr-agent bot commented Apr 11, 2025

Uh oh!

kaito-pr-agent bot commented Apr 12, 2025

Uh oh!

kaito-pr-agent bot commented Apr 12, 2025

Uh oh!

kaito-pr-agent bot commented Apr 14, 2025

Uh oh!

Uh oh!

Uh oh!

kaito-pr-agent bot commented Apr 14, 2025

Uh oh!

kaito-pr-agent bot commented Apr 14, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhiying-lin Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jwtty commented Apr 7, 2025 •

edited

Loading

kaito-pr-agent bot commented Apr 7, 2025 •

edited

Loading

(Describe updated until commit `2ab2781`)

kaito-pr-agent bot commented Apr 7, 2025 •

edited

Loading

(Review updated until commit `2ab2781`)

zhiying-lin Apr 15, 2025 •

edited

Loading