test(e2e): check panic count#6652

Merged

ti-chi-bot[bot] merged 2 commits intopingcap:mainfrom

liubog2008:liubo02/add-panic-check

Jan 14, 2026

Member

liubog2008 commented Jan 13, 2026 •

edited

Loading

check panic count after all e2e cases are done.

ti-chi-bot bot requested a review from shonge

January 13, 2026 02:13

github-actions bot added the v2 label

ti-chi-bot bot added the size/L label

fgksgf requested a review from Copilot

January 13, 2026 02:15

Copilot started reviewing on behalf of fgksgf

January 13, 2026 02:16

codecov-commenter commented Jan 13, 2026 •

edited

Loading

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 38.31%. Comparing base (c195c68) to head (7108d54).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #6652   +/-   ##
=======================================
  Coverage   38.31%   38.31%           
=======================================
  Files         368      368           
  Lines       21422    21422           
=======================================
  Hits         8207     8207           
  Misses      13215    13215

Flag	Coverage Δ
unittest	`38.31% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Member

fgksgf commented Jan 13, 2026

/lgtm

Contributor

ti-chi-bot bot commented Jan 13, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fgksgf

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [fgksgf]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot bot added lgtm approved labels

Contributor

ti-chi-bot bot commented Jan 13, 2026

[LGTM Timeline notifier]

Timeline:

2026-01-13 02:20:52.32660759 +0000 UTC m=+324096.388472499: ☑️ agreed by fgksgf.

Copilot AI reviewed

View reviewed changes

Copilot AI left a comment

Pull request overview

This PR adds end-to-end test validation for panic metrics in the TiDB operator. After all e2e tests complete, the test suite now checks that no panics occurred during the test run by querying the operator's Prometheus metrics endpoint.

Changes:

New metrics utility package to fetch and parse Prometheus metrics from operator pods
Integration of panic count check in the e2e test suite's SynchronizedAfterSuite hook

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.

File	Description
tests/e2e/utils/metrics/metrics.go	New utility package that port-forwards to operator pods and parses panic_total metrics from Prometheus endpoints
tests/e2e/e2e.go	Adds SynchronizedAfterSuite hook to check operator panic metrics after all tests complete

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/e2e/utils/metrics/metrics.go Outdated

Comment on lines 34 to 35

		// OperatorPodLabelSelector is the label selector for operator pods
		OperatorPodLabelSelector = "app.kubernetes.io/component=controller"

Copilot AI Jan 13, 2026

The constant OperatorPodLabelSelector is defined but never used in the code. The implementation correctly uses the deployment's label selector instead (line 57). Consider removing this unused constant or document if it's intended for future use.

Suggested change

      
            	// OperatorPodLabelSelector is the label selector for operator pods
          
            	OperatorPodLabelSelector = "app.kubernetes.io/component=controller"

Copilot uses AI. Check for mistakes.

tests/e2e/utils/metrics/metrics.go Outdated

+              	}
+              	// If metric not found, it means no panics have occurred (metric not yet emitted)
+              	return 0, fmt.Errorf("no metrics emitted")

Copilot AI Jan 13, 2026

The function returns an error "no metrics emitted" when the panic_total metric is not found. However, this is expected behavior when no panics have occurred (the metric hasn't been emitted yet). According to the comment on line 146, this should return 0 without an error. The calling code on line 76 will incorrectly fail the test when no panics have occurred but the metric hasn't been emitted yet.

Suggested change

      
            	return 0, fmt.Errorf("no metrics emitted")
          
            	return 0, nil

Copilot uses AI. Check for mistakes.

tests/e2e/utils/metrics/metrics.go

Comment on lines +132 to +145

+              	// Look for tidb_operator_controller_panic_total metric
+              	for name, mf := range metricFamilies {
+              		if strings.Contains(name, "panic_total") {
+              			for _, metric := range mf.GetMetric() {
+              				if metric.Counter != nil {
+              					return metric.Counter.GetValue(), nil
+              				}
+              				if metric.Gauge != nil {
+              					return metric.Gauge.GetValue(), nil
+              				}
+              			}
+              		}
+              	}

Copilot AI Jan 13, 2026

The loop only returns the value of the first metric that contains "panic_total" in its name. If there are multiple metrics matching this pattern (e.g., from different label combinations), only the first one will be checked. Consider iterating through all matching metrics and summing their values, or document why only the first metric is checked.

Suggested change

      
            	// Look for tidb_operator_controller_panic_total metric
          
            	for name, mf := range metricFamilies {
          
            		if strings.Contains(name, "panic_total") {
          
            			for _, metric := range mf.GetMetric() {
          
            				if metric.Counter != nil {
          
            					return metric.Counter.GetValue(), nil
          
            				}
          
            				if metric.Gauge != nil {
          
            					return metric.Gauge.GetValue(), nil
          
            				}
          
            			}
          
            		}
          
            	}
          
            	// Look for tidb_operator_controller_panic_total metric and sum all matching series.
          
            	var panicTotal float64
          
            	found := false
          
            	for name, mf := range metricFamilies {
          
            		if strings.Contains(name, "panic_total") {
          
            			for _, metric := range mf.GetMetric() {
          
            				if metric.Counter != nil {
          
            					panicTotal += metric.Counter.GetValue()
          
            					found = true
          
            				} else if metric.Gauge != nil {
          
            					panicTotal += metric.Gauge.GetValue()
          
            					found = true
          
            				}
          
            			}
          
            		}
          
            	}
          
            	if found {
          
            		return panicTotal, nil
          
            	}

Copilot uses AI. Check for mistakes.

tests/e2e/utils/metrics/metrics.go

+              		return fmt.Errorf("failed to list operator pods: %w", err)
+              	}
+              	if len(pods.Items) == 0 {
+              		return fmt.Errorf("no operator pod found in namespace %s with label %s", ns, metav1.FormatLabelSelector(d.Spec.Selector))

Copilot AI Jan 13, 2026

The error message includes the deployment selector but refers to it as "label" which is slightly misleading. Consider changing to "no operator pod found in namespace %s with selector %s" for clarity.

Suggested change

      
            		return fmt.Errorf("no operator pod found in namespace %s with label %s", ns, metav1.FormatLabelSelector(d.Spec.Selector))
          
            		return fmt.Errorf("no operator pod found in namespace %s with selector %s", ns, metav1.FormatLabelSelector(d.Spec.Selector))

Copilot uses AI. Check for mistakes.

tests/e2e/utils/metrics/metrics.go

+              	OperatorMetricsPort = 8080
+              )
+              // CheckPanicMetrics checks the operator panic metrics and returns the panic count.

Copilot AI Jan 13, 2026

The comment mentions "returns the panic count" but the function actually returns an error, not the panic count. The panic count is checked internally and an error is returned if it's greater than 0. Consider updating the comment to "checks the operator panic metrics and returns an error if any panics are detected" for accuracy.

Suggested change

      
            // CheckPanicMetrics checks the operator panic metrics and returns the panic count.
          
            // CheckPanicMetrics checks the operator panic metrics and returns an error if any panics are detected.

Copilot uses AI. Check for mistakes.

tests/e2e/utils/metrics/metrics.go

+              		}
+              		if panicCount > 0 {
+              			return fmt.Errorf("panic count %v is greater than 0", panicCount)

Copilot AI Jan 13, 2026

The error message uses %v format specifier for a float64 value. Consider using %f or %.0f for better clarity when displaying the panic count as a numeric value.

Suggested change

      
            			return fmt.Errorf("panic count %v is greater than 0", panicCount)
          
            			return fmt.Errorf("panic count %.0f is greater than 0", panicCount)

Copilot uses AI. Check for mistakes.

tests/e2e/utils/metrics/metrics.go

+              	d, err := clientset.AppsV1().Deployments(ns).Get(
+              		ctx, deploy, metav1.GetOptions{})
+              	if err != nil {
+              		return err

Copilot AI Jan 13, 2026

The error returned when getting the deployment is not wrapped with context. Consider using fmt.Errorf with %w to provide more context about what operation failed, similar to the error handling pattern used in lines 47, 60, and other places in this function.

Suggested change

      
            		return err
          
            		return fmt.Errorf("failed to get deployment %s in namespace %s: %w", deploy, ns, err)

Copilot uses AI. Check for mistakes.

tests/e2e/utils/metrics/metrics.go

+              	metricsURL := fmt.Sprintf("http://localhost:%d/metrics", localPort)
+              	req, err := http.NewRequestWithContext(ctx, "GET", metricsURL, nil)
+              	if err != nil {
+              		return 0, err

Copilot AI Jan 13, 2026

The error returned when creating the HTTP request is not wrapped with context. Consider using fmt.Errorf with %w to provide more information about what operation failed, consistent with the error handling pattern used elsewhere in this file.

Suggested change

      
            		return 0, err
          
            		return 0, fmt.Errorf("failed to create metrics request for %s: %w", metricsURL, err)

Copilot uses AI. Check for mistakes.

liubog2008 added 2 commits

January 14, 2026 20:04


          test(e2e): check panic count

e8a7226

Signed-off-by: liubo02 <liubo02@pingcap.com>


          fix e2e

7108d54

Signed-off-by: liubo02 <liubo02@pingcap.com>

liubog2008 force-pushed the liubo02/add-panic-check branch from 122a9c8 to 7108d54 Compare

January 14, 2026 12:09

ti-chi-bot bot merged commit f8da624 into pingcap:main

10 checks passed

Contributor

ti-chi-bot bot commented Jan 14, 2026

@liubog2008: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-e2e	`7108d54`	link	unknown	`/test pull-e2e`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved lgtm size/L v2