test(e2e): check panic count#6652
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #6652 +/- ##
=======================================
Coverage 38.31% 38.31%
=======================================
Files 368 368
Lines 21422 21422
=======================================
Hits 8207 8207
Misses 13215 13215
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: fgksgf The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[LGTM Timeline notifier]Timeline:
|
There was a problem hiding this comment.
Pull request overview
This PR adds end-to-end test validation for panic metrics in the TiDB operator. After all e2e tests complete, the test suite now checks that no panics occurred during the test run by querying the operator's Prometheus metrics endpoint.
Changes:
- New metrics utility package to fetch and parse Prometheus metrics from operator pods
- Integration of panic count check in the e2e test suite's SynchronizedAfterSuite hook
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
| tests/e2e/utils/metrics/metrics.go | New utility package that port-forwards to operator pods and parses panic_total metrics from Prometheus endpoints |
| tests/e2e/e2e.go | Adds SynchronizedAfterSuite hook to check operator panic metrics after all tests complete |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
tests/e2e/utils/metrics/metrics.go
Outdated
| // OperatorPodLabelSelector is the label selector for operator pods | ||
| OperatorPodLabelSelector = "app.kubernetes.io/component=controller" |
There was a problem hiding this comment.
The constant OperatorPodLabelSelector is defined but never used in the code. The implementation correctly uses the deployment's label selector instead (line 57). Consider removing this unused constant or document if it's intended for future use.
| // OperatorPodLabelSelector is the label selector for operator pods | |
| OperatorPodLabelSelector = "app.kubernetes.io/component=controller" |
tests/e2e/utils/metrics/metrics.go
Outdated
| } | ||
|
|
||
| // If metric not found, it means no panics have occurred (metric not yet emitted) | ||
| return 0, fmt.Errorf("no metrics emitted") |
There was a problem hiding this comment.
The function returns an error "no metrics emitted" when the panic_total metric is not found. However, this is expected behavior when no panics have occurred (the metric hasn't been emitted yet). According to the comment on line 146, this should return 0 without an error. The calling code on line 76 will incorrectly fail the test when no panics have occurred but the metric hasn't been emitted yet.
| return 0, fmt.Errorf("no metrics emitted") | |
| return 0, nil |
| // Look for tidb_operator_controller_panic_total metric | ||
| for name, mf := range metricFamilies { | ||
| if strings.Contains(name, "panic_total") { | ||
| for _, metric := range mf.GetMetric() { | ||
| if metric.Counter != nil { | ||
| return metric.Counter.GetValue(), nil | ||
| } | ||
| if metric.Gauge != nil { | ||
| return metric.Gauge.GetValue(), nil | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
The loop only returns the value of the first metric that contains "panic_total" in its name. If there are multiple metrics matching this pattern (e.g., from different label combinations), only the first one will be checked. Consider iterating through all matching metrics and summing their values, or document why only the first metric is checked.
| // Look for tidb_operator_controller_panic_total metric | |
| for name, mf := range metricFamilies { | |
| if strings.Contains(name, "panic_total") { | |
| for _, metric := range mf.GetMetric() { | |
| if metric.Counter != nil { | |
| return metric.Counter.GetValue(), nil | |
| } | |
| if metric.Gauge != nil { | |
| return metric.Gauge.GetValue(), nil | |
| } | |
| } | |
| } | |
| } | |
| // Look for tidb_operator_controller_panic_total metric and sum all matching series. | |
| var panicTotal float64 | |
| found := false | |
| for name, mf := range metricFamilies { | |
| if strings.Contains(name, "panic_total") { | |
| for _, metric := range mf.GetMetric() { | |
| if metric.Counter != nil { | |
| panicTotal += metric.Counter.GetValue() | |
| found = true | |
| } else if metric.Gauge != nil { | |
| panicTotal += metric.Gauge.GetValue() | |
| found = true | |
| } | |
| } | |
| } | |
| } | |
| if found { | |
| return panicTotal, nil | |
| } |
| return fmt.Errorf("failed to list operator pods: %w", err) | ||
| } | ||
| if len(pods.Items) == 0 { | ||
| return fmt.Errorf("no operator pod found in namespace %s with label %s", ns, metav1.FormatLabelSelector(d.Spec.Selector)) |
There was a problem hiding this comment.
The error message includes the deployment selector but refers to it as "label" which is slightly misleading. Consider changing to "no operator pod found in namespace %s with selector %s" for clarity.
| return fmt.Errorf("no operator pod found in namespace %s with label %s", ns, metav1.FormatLabelSelector(d.Spec.Selector)) | |
| return fmt.Errorf("no operator pod found in namespace %s with selector %s", ns, metav1.FormatLabelSelector(d.Spec.Selector)) |
| OperatorMetricsPort = 8080 | ||
| ) | ||
|
|
||
| // CheckPanicMetrics checks the operator panic metrics and returns the panic count. |
There was a problem hiding this comment.
The comment mentions "returns the panic count" but the function actually returns an error, not the panic count. The panic count is checked internally and an error is returned if it's greater than 0. Consider updating the comment to "checks the operator panic metrics and returns an error if any panics are detected" for accuracy.
| // CheckPanicMetrics checks the operator panic metrics and returns the panic count. | |
| // CheckPanicMetrics checks the operator panic metrics and returns an error if any panics are detected. |
| } | ||
|
|
||
| if panicCount > 0 { | ||
| return fmt.Errorf("panic count %v is greater than 0", panicCount) |
There was a problem hiding this comment.
The error message uses %v format specifier for a float64 value. Consider using %f or %.0f for better clarity when displaying the panic count as a numeric value.
| return fmt.Errorf("panic count %v is greater than 0", panicCount) | |
| return fmt.Errorf("panic count %.0f is greater than 0", panicCount) |
| d, err := clientset.AppsV1().Deployments(ns).Get( | ||
| ctx, deploy, metav1.GetOptions{}) | ||
| if err != nil { | ||
| return err |
There was a problem hiding this comment.
The error returned when getting the deployment is not wrapped with context. Consider using fmt.Errorf with %w to provide more context about what operation failed, similar to the error handling pattern used in lines 47, 60, and other places in this function.
| return err | |
| return fmt.Errorf("failed to get deployment %s in namespace %s: %w", deploy, ns, err) |
| metricsURL := fmt.Sprintf("http://localhost:%d/metrics", localPort) | ||
| req, err := http.NewRequestWithContext(ctx, "GET", metricsURL, nil) | ||
| if err != nil { | ||
| return 0, err |
There was a problem hiding this comment.
The error returned when creating the HTTP request is not wrapped with context. Consider using fmt.Errorf with %w to provide more information about what operation failed, consistent with the error handling pattern used elsewhere in this file.
| return 0, err | |
| return 0, fmt.Errorf("failed to create metrics request for %s: %w", metricsURL, err) |
Signed-off-by: liubo02 <liubo02@pingcap.com>
122a9c8 to
7108d54
Compare
|
@liubog2008: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
check panic count after all e2e cases are done.