Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 34 additions & 3 deletions pkg/controllers/clusterresourceplacement/controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -94,14 +94,15 @@ func (r *Reconciler) handleDelete(ctx context.Context, crp *fleetv1beta1.Cluster
return ctrl.Result{}, err
}

metrics.FleetPlacementCompleteLastTimeStampSeconds.DeletePartialMatch(prometheus.Labels{"name": crp.Name})
metrics.FleetPlacementStatusLastTimeStampSeconds.DeletePartialMatch(prometheus.Labels{"name": crp.Name})
Comment on lines +97 to +98
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why move the delete earlier instead of replacing them at line 105?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked to move these earlier. If the pod stops when finalizer is removed but before metrics are removed, they would be left never deleted.

controllerutil.RemoveFinalizer(crp, fleetv1beta1.ClusterResourcePlacementCleanupFinalizer)
if err := r.Client.Update(ctx, crp); err != nil {
klog.ErrorS(err, "Failed to remove crp finalizer", "clusterResourcePlacement", crpKObj)
return ctrl.Result{}, err
}
klog.V(2).InfoS("Removed crp-cleanup finalizer", "clusterResourcePlacement", crpKObj)
r.Recorder.Event(crp, corev1.EventTypeNormal, "PlacementCleanupFinalizerRemoved", "Deleted the snapshots and removed the placement cleanup finalizer")
metrics.FleetPlacementStatus.Delete(prometheus.Labels{"name": crp.Name})
return ctrl.Result{}, nil
}

Expand Down Expand Up @@ -177,6 +178,8 @@ func (r *Reconciler) handleUpdate(ctx context.Context, crp *fleetv1beta1.Cluster
klog.ErrorS(updateErr, "Failed to update the status", "clusterResourcePlacement", crpKObj)
return ctrl.Result{}, controller.NewUpdateIgnoreConflictError(updateErr)
}
metrics.FleetPlacementStatusLastTimeStampSeconds.DeletePartialMatch(prometheus.Labels{"name": crp.Name})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you can call emitPlacementStatusMetric too.

metrics.FleetPlacementStatusLastTimeStampSeconds.WithLabelValues(crp.Name, strconv.FormatInt(crp.Generation, 10), scheduleCondition.Type, string(scheduleCondition.Status)).SetToCurrentTime()
return ctrl.Result{}, err
}

Expand Down Expand Up @@ -214,6 +217,7 @@ func (r *Reconciler) handleUpdate(ctx context.Context, crp *fleetv1beta1.Cluster
r.Recorder.Event(crp, corev1.EventTypeNormal, i.EventReasonForTrue(), i.EventMessageForTrue())
}
}
emitPlacementStatusMetric(crp)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to call this metrics emit function in a defer function because we have so many returns in handleUpdate. We may easily miss cases if we call this function ad-hoc.


// Rollout is considered to be completed when all the expected condition types are set to the
// True status.
Expand All @@ -222,12 +226,13 @@ func (r *Reconciler) handleUpdate(ctx context.Context, crp *fleetv1beta1.Cluster
klog.V(2).InfoS("Placement has finished the rollout process and reached the desired status", "clusterResourcePlacement", crpKObj, "generation", crp.Generation)
r.Recorder.Event(crp, corev1.EventTypeNormal, "PlacementRolloutCompleted", "Placement has finished the rollout process and reached the desired status")
}
metrics.FleetPlacementStatus.WithLabelValues(crp.Name).Set(1)
metrics.FleetPlacementCompleteLastTimeStampSeconds.WithLabelValues(crp.Name, "true").SetToCurrentTime()
// We don't need to requeue any request now by watching the binding changes
return ctrl.Result{}, nil
}

metrics.FleetPlacementStatus.WithLabelValues(crp.Name).Set(0)
metrics.FleetPlacementCompleteLastTimeStampSeconds.WithLabelValues(crp.Name, "false").SetToCurrentTime()

if !isClusterScheduled {
// Note:
// If the scheduledCondition is failed, it means the placement requirement cannot be satisfied fully. For example,
Expand Down Expand Up @@ -1035,3 +1040,29 @@ func isRolloutCompleted(crp *fleetv1beta1.ClusterResourcePlacement) bool {
func isCRPScheduled(crp *fleetv1beta1.ClusterResourcePlacement) bool {
return condition.IsConditionStatusTrue(crp.GetCondition(string(fleetv1beta1.ClusterResourcePlacementScheduledConditionType)), crp.Generation)
}

func emitPlacementStatusMetric(crp *fleetv1beta1.ClusterResourcePlacement) {
// Check CRP Scheduled condition
status := "nil"
cond := crp.GetCondition(string(fleetv1beta1.ClusterResourcePlacementScheduledConditionType))
if !condition.IsConditionStatusTrue(cond, crp.Generation) {
if cond != nil {
status = string(cond.Status)
}
metrics.FleetPlacementStatusLastTimeStampSeconds.WithLabelValues(crp.Name, strconv.FormatInt(crp.Generation, 10), string(fleetv1beta1.ClusterResourcePlacementScheduledConditionType), status).SetToCurrentTime()
return
}

// Check CRP expected conditions
expectedCondTypes := determineExpectedCRPAndResourcePlacementStatusCondType(crp)
for _, condType := range expectedCondTypes {
cond = crp.GetCondition(string(condType.ClusterResourcePlacementConditionType()))
if !condition.IsConditionStatusTrue(cond, crp.Generation) {
if cond != nil {
status = string(cond.Status)
}
metrics.FleetPlacementStatusLastTimeStampSeconds.WithLabelValues(crp.Name, strconv.FormatInt(crp.Generation, 10), string(condType.ClusterResourcePlacementConditionType()), status).SetToCurrentTime()
return
}
}
Copy link
Contributor

@zhiying-lin zhiying-lin Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two options:

  1. two metrics, like the current one, completedMetrics + incompletedPlacementStatus metrics
  2. one metrics, PlacementStatus metrics to record the last condition of CPR. isCompleted = ReportDiff == true || available condition true

Personally i prefer option 2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally prefer option 1. The completed metrics makes it a lot easier to query whether the CRP is completed or not for dashboards. If in the future, not sure if we will, we add more final conditions (like Available and ReportDiff) then we will have to continue changing the alert to include these cases. The second metric was just to provide more information in the alert, while the first helps with dashboard and easier to filter out through the logs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current status metric never has a "good" status. How are we going to use this metric? I feel this metric is not independent: we always need to use it along with the completed metric, we always need to verify if crp is still incomplete first and then refer to this metric to see why it's still incomplete.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with Wantong, they seem coupled, not sure how to join them but with different timestamp.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked to Ryan. We will couple them.

}
Loading
Loading