Conversation
f0ded74 to
fc22495
Compare
c25f067 to
42093bd
Compare
zhiying-lin
left a comment
There was a problem hiding this comment.
Should we emit the last condition if the crp is not completed? since your metrics type is gauge. WE only need to know the current status.
d7683e2 to
6f85e45
Compare
pkg/controllers/clusterresourceplacement/controller_integration_test.go
Outdated
Show resolved
Hide resolved
6f85e45 to
8486603
Compare
Title(Describe updated until commit d3e408d)feat: Emit detailed metrics for workload placements Description
Changes walkthrough 📝
|
PR Reviewer Guide 🔍(Review updated until commit d3e408d)Here are some key observations to aid the review process:
|
|
Persistent review updated to latest commit 8a89ce8 |
|
Persistent review updated to latest commit 0200684 |
|
Persistent review updated to latest commit c6a8490 |
pkg/controllers/clusterresourceplacement/controller_integration_test.go
Outdated
Show resolved
Hide resolved
pkg/controllers/clusterresourceplacement/controller_integration_test.go
Outdated
Show resolved
Hide resolved
pkg/controllers/clusterresourceplacement/controller_integration_test.go
Outdated
Show resolved
Hide resolved
pkg/controllers/clusterresourceplacement/controller_integration_test.go
Outdated
Show resolved
Hide resolved
|
Persistent review updated to latest commit 222867e |
|
Persistent review updated to latest commit c74a770 |
pkg/controllers/clusterresourceplacement/controller_integration_test.go
Outdated
Show resolved
Hide resolved
pkg/controllers/clusterresourceplacement/controller_integration_test.go
Outdated
Show resolved
Hide resolved
pkg/controllers/clusterresourceplacement/controller_integration_test.go
Outdated
Show resolved
Hide resolved
pkg/controllers/clusterresourceplacement/controller_integration_test.go
Outdated
Show resolved
Hide resolved
pkg/controllers/clusterresourceplacement/controller_integration_test.go
Outdated
Show resolved
Hide resolved
|
Persistent review updated to latest commit 4ed96c6 |
|
Persistent review updated to latest commit d446d00 |
pkg/controllers/clusterresourceplacement/controller_integration_test.go
Outdated
Show resolved
Hide resolved
pkg/controllers/clusterresourceplacement/controller_integration_test.go
Outdated
Show resolved
Hide resolved
pkg/controllers/clusterresourceplacement/controller_integration_test.go
Outdated
Show resolved
Hide resolved
|
Persistent review updated to latest commit fef4bb6 |
|
Persistent review updated to latest commit 07ffce0 |
|
Persistent review updated to latest commit 1cec5f4 |
|
Persistent review updated to latest commit 9bbdf16 |
pkg/controllers/clusterresourceplacement/controller_integration_test.go
Outdated
Show resolved
Hide resolved
pkg/controllers/clusterresourceplacement/controller_integration_test.go
Outdated
Show resolved
Hide resolved
pkg/controllers/clusterresourceplacement/controller_integration_test.go
Outdated
Show resolved
Hide resolved
pkg/controllers/clusterresourceplacement/controller_integration_test.go
Outdated
Show resolved
Hide resolved
| metrics.FleetPlacementStatusLastTimeStampSeconds.WithLabelValues(crp.Name, strconv.FormatInt(crp.Generation, 10), string(condType.ClusterResourcePlacementConditionType()), status).SetToCurrentTime() | ||
| return | ||
| } | ||
| } |
There was a problem hiding this comment.
There are two options:
- two metrics, like the current one, completedMetrics + incompletedPlacementStatus metrics
- one metrics, PlacementStatus metrics to record the last condition of CPR. isCompleted = ReportDiff == true || available condition true
Personally i prefer option 2
There was a problem hiding this comment.
I personally prefer option 1. The completed metrics makes it a lot easier to query whether the CRP is completed or not for dashboards. If in the future, not sure if we will, we add more final conditions (like Available and ReportDiff) then we will have to continue changing the alert to include these cases. The second metric was just to provide more information in the alert, while the first helps with dashboard and easier to filter out through the logs.
There was a problem hiding this comment.
The current status metric never has a "good" status. How are we going to use this metric? I feel this metric is not independent: we always need to use it along with the completed metric, we always need to verify if crp is still incomplete first and then refer to this metric to see why it's still incomplete.
There was a problem hiding this comment.
agree with Wantong, they seem coupled, not sure how to join them but with different timestamp.
There was a problem hiding this comment.
Talked to Ryan. We will couple them.
| gotCRP = retrieveAndValidateClusterResourcePlacement(testCRPName, wantCRP) | ||
|
|
||
| By("Ensure placement complete metric was emitted with isCompleted True") | ||
| checkPlacementCompleteMetric(customRegistry, testCRPName, true, 2) |
There was a problem hiding this comment.
we also need to validate the crp status metrics
There was a problem hiding this comment.
For a complete crp, the test is very flaky on figuring out what statuses will be emitted depending on how fast the conditions update since we don't have all the controllers working properly in the integration test. Originally, we wanted the placementstatus to record the last status for incomplete crp to see where they stopped to include in the alert.
There was a problem hiding this comment.
i see, could you please add a comment there to clarify why we don't validate the status metrics here.
pkg/controllers/clusterresourceplacement/controller_integration_test.go
Outdated
Show resolved
Hide resolved
|
Persistent review updated to latest commit 43250ec |
43250ec to
d3e408d
Compare
|
Persistent review updated to latest commit d3e408d |
| if a.Name == nil || b.Name == nil { | ||
| return a.Name == nil | ||
| } | ||
| return *a.Name < *b.Name // Sort by label |
There was a problem hiding this comment.
We could use a.GetName() < b.GetName() to avoid nil comparison. I did in #1107
There was a problem hiding this comment.
is this function a dup?
There was a problem hiding this comment.
Yea, I just don't know whether this PR or #1107 will merge first. The latter one can always reuse this util.
| metrics.FleetPlacementStatusLastTimeStampSeconds.WithLabelValues(crp.Name, strconv.FormatInt(crp.Generation, 10), string(condType.ClusterResourcePlacementConditionType()), status).SetToCurrentTime() | ||
| return | ||
| } | ||
| } |
There was a problem hiding this comment.
The current status metric never has a "good" status. How are we going to use this metric? I feel this metric is not independent: we always need to use it along with the completed metric, we always need to verify if crp is still incomplete first and then refer to this metric to see why it's still incomplete.
| klog.ErrorS(updateErr, "Failed to update the status", "clusterResourcePlacement", crpKObj) | ||
| return ctrl.Result{}, controller.NewUpdateIgnoreConflictError(updateErr) | ||
| } | ||
| metrics.FleetPlacementStatusLastTimeStampSeconds.DeletePartialMatch(prometheus.Labels{"name": crp.Name}) |
There was a problem hiding this comment.
Here you can call emitPlacementStatusMetric too.
| r.Recorder.Event(crp, corev1.EventTypeNormal, i.EventReasonForTrue(), i.EventMessageForTrue()) | ||
| } | ||
| } | ||
| emitPlacementStatusMetric(crp) |
There was a problem hiding this comment.
I think it's better to call this metrics emit function in a defer function because we have so many returns in handleUpdate. We may easily miss cases if we call this function ad-hoc.
| retrieveAndValidateClusterResourcePlacement(testCRPName, wantCRP) | ||
|
|
||
| By("Ensure placement complete metric was emitted with isCompleted False") | ||
| wantCompleteMetrics := []*prometheusclientmodel.Metric{ |
There was a problem hiding this comment.
This incomplete metric can be refactored into a function. There's code duplication.
| } | ||
| Expect(k8sClient.Status().Update(ctx, gotPolicySnapshot)).Should(Succeed(), "Failed to update the policy snapshot status") | ||
|
|
||
| By("By creating clusterResourceBinding on member-1") |
There was a problem hiding this comment.
| By("By creating clusterResourceBinding on member-1") | |
| By("Create clusterResourceBinding on member-1") |
There was a problem hiding this comment.
I actually prefer using "Creating", but I see none of the other code use "ing".
| By("By creating clusterResourceBinding on member-1") | ||
| member1Binding = createOverriddenClusterResourceBinding(member1Name, gotPolicySnapshot, gotResourceSnapshot) | ||
|
|
||
| By("By validating the CRP status") |
There was a problem hiding this comment.
| By("By validating the CRP status") | |
| By("Validate the CRP status") |
| } | ||
| checkPlacementStatusMetric(customRegistry, wantStatusMetrics) | ||
|
|
||
| By("By creating a synchronized clusterResourceBinding on member-2") |
There was a problem hiding this comment.
| By("By creating a synchronized clusterResourceBinding on member-2") | |
| By("Create a synchronized clusterResourceBinding on member-2") |
| Expect(k8sClient.Update(ctx, crp)).Should(Succeed(), "Failed to update crp") | ||
| Expect(k8sClient.Get(ctx, types.NamespacedName{Name: testCRPName}, crp)).Should(BeNil(), "Get() clusterResourcePlacement mismatch") | ||
|
|
||
| By("By validating the CRP status with new spec") |
There was a problem hiding this comment.
| By("By validating the CRP status with new spec") | |
| By("Validate the CRP status with new spec") |
| Spec: crp.Spec, | ||
| Status: placementv1beta1.ClusterResourcePlacementStatus{ | ||
| ObservedResourceIndex: "0", | ||
| Conditions: []metav1.Condition{ |
There was a problem hiding this comment.
I think this piece of code can also be refactored.
| if a.Name == nil || b.Name == nil { | ||
| return a.Name == nil | ||
| } | ||
| return *a.Name < *b.Name // Sort by label |
There was a problem hiding this comment.
is this function a dup?
| metrics.FleetPlacementCompleteLastTimeStampSeconds.DeletePartialMatch(prometheus.Labels{"name": crp.Name}) | ||
| metrics.FleetPlacementStatusLastTimeStampSeconds.DeletePartialMatch(prometheus.Labels{"name": crp.Name}) |
There was a problem hiding this comment.
I wonder why move the delete earlier instead of replacing them at line 105?
There was a problem hiding this comment.
I asked to move these earlier. If the pod stops when finalizer is removed but before metrics are removed, they would be left never deleted.
|
Hi Britania! I am closing this PR as part of the CNCF repo migration process; please consider moving (re-creating) this PR in the new repo once the sync PR is merged. If there's any question/concern, please let me know. Thanks 🙏 |
Description of your changes
Fixes #
I have: added a metric to emit crp condition status' along with a test.
make reviewableto ensure this PR is ready for review.How has this code been tested
Special notes for your reviewer