Skip to content

test(e2e): check panic count#6652

Merged
ti-chi-bot[bot] merged 2 commits intopingcap:mainfrom
liubog2008:liubo02/add-panic-check
Jan 14, 2026
Merged

test(e2e): check panic count#6652
ti-chi-bot[bot] merged 2 commits intopingcap:mainfrom
liubog2008:liubo02/add-panic-check

Conversation

@liubog2008
Copy link
Member

@liubog2008 liubog2008 commented Jan 13, 2026

check panic count after all e2e cases are done.

@ti-chi-bot ti-chi-bot bot requested a review from shonge January 13, 2026 02:13
@github-actions github-actions bot added the v2 for operator v2 label Jan 13, 2026
@ti-chi-bot ti-chi-bot bot added the size/L label Jan 13, 2026
@fgksgf fgksgf requested a review from Copilot January 13, 2026 02:15
@codecov-commenter
Copy link

codecov-commenter commented Jan 13, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 38.31%. Comparing base (c195c68) to head (7108d54).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #6652   +/-   ##
=======================================
  Coverage   38.31%   38.31%           
=======================================
  Files         368      368           
  Lines       21422    21422           
=======================================
  Hits         8207     8207           
  Misses      13215    13215           
Flag Coverage Δ
unittest 38.31% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@fgksgf
Copy link
Member

fgksgf commented Jan 13, 2026

/lgtm

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Jan 13, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fgksgf

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Jan 13, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-01-13 02:20:52.32660759 +0000 UTC m=+324096.388472499: ☑️ agreed by fgksgf.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds end-to-end test validation for panic metrics in the TiDB operator. After all e2e tests complete, the test suite now checks that no panics occurred during the test run by querying the operator's Prometheus metrics endpoint.

Changes:

  • New metrics utility package to fetch and parse Prometheus metrics from operator pods
  • Integration of panic count check in the e2e test suite's SynchronizedAfterSuite hook

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.

File Description
tests/e2e/utils/metrics/metrics.go New utility package that port-forwards to operator pods and parses panic_total metrics from Prometheus endpoints
tests/e2e/e2e.go Adds SynchronizedAfterSuite hook to check operator panic metrics after all tests complete

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 34 to 35
// OperatorPodLabelSelector is the label selector for operator pods
OperatorPodLabelSelector = "app.kubernetes.io/component=controller"
Copy link

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constant OperatorPodLabelSelector is defined but never used in the code. The implementation correctly uses the deployment's label selector instead (line 57). Consider removing this unused constant or document if it's intended for future use.

Suggested change
// OperatorPodLabelSelector is the label selector for operator pods
OperatorPodLabelSelector = "app.kubernetes.io/component=controller"

Copilot uses AI. Check for mistakes.
}

// If metric not found, it means no panics have occurred (metric not yet emitted)
return 0, fmt.Errorf("no metrics emitted")
Copy link

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function returns an error "no metrics emitted" when the panic_total metric is not found. However, this is expected behavior when no panics have occurred (the metric hasn't been emitted yet). According to the comment on line 146, this should return 0 without an error. The calling code on line 76 will incorrectly fail the test when no panics have occurred but the metric hasn't been emitted yet.

Suggested change
return 0, fmt.Errorf("no metrics emitted")
return 0, nil

Copilot uses AI. Check for mistakes.
Comment on lines +132 to +145
// Look for tidb_operator_controller_panic_total metric
for name, mf := range metricFamilies {
if strings.Contains(name, "panic_total") {
for _, metric := range mf.GetMetric() {
if metric.Counter != nil {
return metric.Counter.GetValue(), nil
}
if metric.Gauge != nil {
return metric.Gauge.GetValue(), nil
}
}
}
}

Copy link

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop only returns the value of the first metric that contains "panic_total" in its name. If there are multiple metrics matching this pattern (e.g., from different label combinations), only the first one will be checked. Consider iterating through all matching metrics and summing their values, or document why only the first metric is checked.

Suggested change
// Look for tidb_operator_controller_panic_total metric
for name, mf := range metricFamilies {
if strings.Contains(name, "panic_total") {
for _, metric := range mf.GetMetric() {
if metric.Counter != nil {
return metric.Counter.GetValue(), nil
}
if metric.Gauge != nil {
return metric.Gauge.GetValue(), nil
}
}
}
}
// Look for tidb_operator_controller_panic_total metric and sum all matching series.
var panicTotal float64
found := false
for name, mf := range metricFamilies {
if strings.Contains(name, "panic_total") {
for _, metric := range mf.GetMetric() {
if metric.Counter != nil {
panicTotal += metric.Counter.GetValue()
found = true
} else if metric.Gauge != nil {
panicTotal += metric.Gauge.GetValue()
found = true
}
}
}
}
if found {
return panicTotal, nil
}

Copilot uses AI. Check for mistakes.
return fmt.Errorf("failed to list operator pods: %w", err)
}
if len(pods.Items) == 0 {
return fmt.Errorf("no operator pod found in namespace %s with label %s", ns, metav1.FormatLabelSelector(d.Spec.Selector))
Copy link

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message includes the deployment selector but refers to it as "label" which is slightly misleading. Consider changing to "no operator pod found in namespace %s with selector %s" for clarity.

Suggested change
return fmt.Errorf("no operator pod found in namespace %s with label %s", ns, metav1.FormatLabelSelector(d.Spec.Selector))
return fmt.Errorf("no operator pod found in namespace %s with selector %s", ns, metav1.FormatLabelSelector(d.Spec.Selector))

Copilot uses AI. Check for mistakes.
OperatorMetricsPort = 8080
)

// CheckPanicMetrics checks the operator panic metrics and returns the panic count.
Copy link

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment mentions "returns the panic count" but the function actually returns an error, not the panic count. The panic count is checked internally and an error is returned if it's greater than 0. Consider updating the comment to "checks the operator panic metrics and returns an error if any panics are detected" for accuracy.

Suggested change
// CheckPanicMetrics checks the operator panic metrics and returns the panic count.
// CheckPanicMetrics checks the operator panic metrics and returns an error if any panics are detected.

Copilot uses AI. Check for mistakes.
}

if panicCount > 0 {
return fmt.Errorf("panic count %v is greater than 0", panicCount)
Copy link

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message uses %v format specifier for a float64 value. Consider using %f or %.0f for better clarity when displaying the panic count as a numeric value.

Suggested change
return fmt.Errorf("panic count %v is greater than 0", panicCount)
return fmt.Errorf("panic count %.0f is greater than 0", panicCount)

Copilot uses AI. Check for mistakes.
d, err := clientset.AppsV1().Deployments(ns).Get(
ctx, deploy, metav1.GetOptions{})
if err != nil {
return err
Copy link

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error returned when getting the deployment is not wrapped with context. Consider using fmt.Errorf with %w to provide more context about what operation failed, similar to the error handling pattern used in lines 47, 60, and other places in this function.

Suggested change
return err
return fmt.Errorf("failed to get deployment %s in namespace %s: %w", deploy, ns, err)

Copilot uses AI. Check for mistakes.
metricsURL := fmt.Sprintf("http://localhost:%d/metrics", localPort)
req, err := http.NewRequestWithContext(ctx, "GET", metricsURL, nil)
if err != nil {
return 0, err
Copy link

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error returned when creating the HTTP request is not wrapped with context. Consider using fmt.Errorf with %w to provide more information about what operation failed, consistent with the error handling pattern used elsewhere in this file.

Suggested change
return 0, err
return 0, fmt.Errorf("failed to create metrics request for %s: %w", metricsURL, err)

Copilot uses AI. Check for mistakes.
Signed-off-by: liubo02 <liubo02@pingcap.com>
Signed-off-by: liubo02 <liubo02@pingcap.com>
@liubog2008 liubog2008 force-pushed the liubo02/add-panic-check branch from 122a9c8 to 7108d54 Compare January 14, 2026 12:09
@ti-chi-bot ti-chi-bot bot merged commit f8da624 into pingcap:main Jan 14, 2026
10 checks passed
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Jan 14, 2026

@liubog2008: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-e2e 7108d54 link unknown /test pull-e2e

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments