feat: add updaterun metrics and update Progressing condition#1107
feat: add updaterun metrics and update Progressing condition#1107jwtty wants to merge 10 commits intoAzure:mainfrom
Conversation
|
Failed to generate code suggestions for PR |
Title(Describe updated until commit 2ab2781)feat: add updaterun metrics and update Progressing condition Description
Changes walkthrough 📝
|
PR Reviewer Guide 🔍(Review updated until commit 2ab2781)Here are some key observations to aid the review process:
|
|
Persistent review updated to latest commit 24feb17 |
|
Persistent review updated to latest commit 87adbf7 |
|
Persistent review updated to latest commit 04012b1 |
|
Persistent review updated to latest commit 5d3a27a |
145a4c0 to
9daeb77
Compare
|
Persistent review updated to latest commit 9daeb77 |
1 similar comment
|
Persistent review updated to latest commit 9daeb77 |
|
Persistent review updated to latest commit 934db55 |
| } | ||
|
|
||
| // We should not reach here, as reconcile should NOT return when the updateRun is still initializing or initialized but not progressing. | ||
| klog.V(2).ErrorS(controller.NewUnexpectedBehaviorError(fmt.Errorf("updateRun does not have valid conditions when emitting updateRun status metric")), |
There was a problem hiding this comment.
it is possible, for example, init has failed, but the controller fails to update the condition because of api server error. in this case, we need to emit unknown.
There was a problem hiding this comment.
Yeah, this can happen when status updating fails. I do not want to emit an unknown metric as this can mess up with the actual unknown status on the updateRun. I want to log first as status updating failure should be retried.
|
Persistent review updated to latest commit 86451f9 |
86451f9 to
997f183
Compare
997f183 to
a11c447
Compare
|
Persistent review updated to latest commit a11c447 |
1 similar comment
|
Persistent review updated to latest commit a11c447 |
|
Persistent review updated to latest commit 10783fa |
|
Persistent review updated to latest commit f48e547 |
f48e547 to
2ab2781
Compare
|
Persistent review updated to latest commit 2ab2781 |
| continue | ||
| } else { | ||
| // If cluster update has been running for more than 1 minute, mark the update run as stuck. | ||
| timeElapsed := time.Since(clusterStartedCond.LastTransitionTime.Time) |
There was a problem hiding this comment.
checkClusterUpdateResult wait until all the resources become available which depends on the "timeToReady" setting by the user. Therefore, 'updateRunStuckThreshold' has to be the greater of 60 seconds or that number
There was a problem hiding this comment.
I realized one thing: "timeToReady" is a rolling update specific. It's under RollingUpdateConfig and cannot be set when rolling type is set to External. The wait is done in the rollingUpdate controller, not in the workApplier. For untrackable resources, workApplier simply sets it as Available with reason set to "Availability not trackable".
There was a problem hiding this comment.
good point, does it mean that we need to add this to the update run API too or create some wait between clusters when there are untrackable resources?
There was a problem hiding this comment.
i notice this problem when reviewing the code. Should we keep the default value as 1 min to keep it consistent with the rollingUpdate config? later, we can support the customized value?
There was a problem hiding this comment.
Yea, looks like we need to do similar thing in updateRun. That will be a new feature.
| stageUpdatingWaitTime = 60 * time.Second | ||
|
|
||
| // updateRunStuckThreshold is the time to wait on a single cluster update before marking update run as stuck. | ||
| updateRunStuckThreshold = 60 * time.Second |
There was a problem hiding this comment.
this seems to be a bit too aggressive, a normal deployment with many replicas can take a few mins to get ready
There was a problem hiding this comment.
Let me set it as 5 minutes for now. It will not cause updateRun to stop. Just mark it in the status so users can investigate CRP and see if there's issue. If issue is fixed or recovered automatically, updateRun can continue and the status is removed. We also add metrics to track so that we could have a better understanding and treak it.
|
Hi Wantong! I am closing this PR as part of the CNCF repo migration process; please consider moving (re-creating) this PR in the new repo once the sync PR is merged. If there's any question/concern, please let me know. Thanks 🙏 |
Description of your changes
Generated metrics look like:
Fixes #
I have:
make reviewableto ensure this PR is ready for review.How has this code been tested
Special notes for your reviewer