e2e test for metrics #667

blenbot · 2026-01-21T21:05:31Z

What type of PR is this?
Implementng TestMetrics e2e test in kthena ecosystem.

What this PR does / why we need it:
Verify routing metrics collection (request count, latency, error rate)
Verify rate limiting metrics accuracy

Which issue(s) this PR fixes:
Part of #556

Special notes for your reviewer:
not for now

volcano-sh-bot · 2026-01-21T21:05:38Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign hzxuzhonghu for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

test/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

volcano-sh-bot · 2026-01-21T21:05:44Z

Welcome @blenbot! It looks like this is your first PR to volcano-sh/kthena 🎉

gemini-code-assist · 2026-01-21T21:05:47Z

Summary of Changes

Hello @blenbot, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the testing infrastructure by adding robust end-to-end tests for the router's metrics collection and rate limiting functionality. The new tests ensure that critical operational metrics, such as request counts, latencies, error rates, and rate limit exceedances, are accurately captured and reported, providing better observability and reliability for the routing layer.

Highlights

New E2E Metrics Tests: Introduced a new shared end-to-end test function, TestMetricsShared, to verify the collection of router metrics including request count, latency, and error rates. This test deploys a simple ModelRoute and asserts the presence and correctness of relevant Prometheus metrics.
New E2E Rate Limiting Metrics Tests: Added another shared end-to-end test function, TestRateLimitMetricsShared, specifically designed to validate the accuracy of rate limiting metrics. This test deploys a ModelRoute with rate limiting configured and verifies that the kthena_router_rate_limit_exceeded_total metric is correctly incremented for rate-limited requests.
Integration into E2E Suites: Both new shared test functions (TestMetricsShared and TestRateLimitMetricsShared) have been integrated into both the standard router e2e test suite (test/e2e/router/e2e_test.go) and the Gateway API enabled e2e test suite (test/e2e/router/gateway-api/e2e_test.go), ensuring comprehensive coverage for different deployment scenarios.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This PR introduces e2e tests for router metrics and rate limit metrics. The implementation uses shared test functions for both standard and Gateway API setups, which is a good pattern. My review focuses on improving the robustness and maintainability of the new test code. I've suggested replacing fixed time.Sleep calls with polling mechanisms to avoid flaky tests, removing a redundant sleep, and refactoring duplicated code into a helper function.

gemini-code-assist · 2026-01-21T21:10:06Z

test/e2e/router/shared.go

+	fetchMetrics := func() string {
+		resp, err := http.Get("http://127.0.0.1:9090/metrics")
+		require.NoError(t, err, "Failed to fetch metrics")
+		defer resp.Body.Close()
+		body, err := io.ReadAll(resp.Body)
+		require.NoError(t, err, "Failed to read metrics response")
+		return string(body)
+	}


The fetchMetrics function is duplicated in TestMetricsShared and TestRateLimitMetricsShared (lines 964-971). The only difference is the port number. To improve maintainability and reduce code duplication, consider extracting this into a file-level helper function that accepts the port as a parameter. Also, http.Get does not have a timeout, which can make tests hang. Using an http.Client with a timeout is more robust.

Example:

func fetchMetricsFromPort(t *testing.T, port string) string { client := http.Client{Timeout: 5 * time.Second} resp, err := client.Get(fmt.Sprintf("http://127.0.0.1:%s/metrics", port)) require.NoError(t, err, "Failed to fetch metrics from port %s", port) defer resp.Body.Close() body, err := io.ReadAll(resp.Body) require.NoError(t, err, "Failed to read metrics response") return string(body) }

gemini-code-assist · 2026-01-21T21:10:06Z

test/e2e/router/shared.go

+		time.Sleep(2 * time.Second)
+
+		metricsBody := fetchMetrics()
+
+		assert.Contains(t, metricsBody, "kthena_router_requests_total", "Request count metric should exist")
+		assert.Contains(t, metricsBody, fmt.Sprintf(`model="%s"`, modelRoute.Spec.ModelName), "Metric should have model label")
+		assert.Contains(t, metricsBody, `status_code="200"`, "Metric should have successful status code")


Using time.Sleep to wait for metrics to become available can lead to flaky tests. If the system is slow, the test might fail. If the system is fast, the test waits longer than necessary. A better approach is to use a polling mechanism like require.Eventually to wait for the desired metrics to appear. This makes the test more robust and efficient.

This advice applies to other time.Sleep calls in this file that wait for metrics (e.g., lines 903, 916, and 997).

Suggested change

time.Sleep(2 * time.Second)

metricsBody := fetchMetrics()

assert.Contains(t, metricsBody, "kthena_router_requests_total", "Request count metric should exist")

assert.Contains(t, metricsBody, fmt.Sprintf(`model="%s"`, modelRoute.Spec.ModelName), "Metric should have model label")

assert.Contains(t, metricsBody, `status_code="200"`, "Metric should have successful status code")

require.Eventually(t, func() bool {

metricsBody := fetchMetrics()

return strings.Contains(metricsBody, "kthena_router_requests_total") &&

strings.Contains(metricsBody, fmt.Sprintf("model=\"%s\"", modelRoute.Spec.ModelName)) &&

strings.Contains(metricsBody, "status_code=\"200\"")

}, 15*time.Second, time.Second, "Expected metrics for successful requests were not found in time")

@blenbot This is a good suggestion. You should fix this.

Also there are too many sleep, we should use the pattern AI suggested above.

Alright imma add that

gemini-code-assist · 2026-01-21T21:10:06Z

test/e2e/router/shared.go

+	})
+
+	// Wait for rate limiter to be configured
+	time.Sleep(3 * time.Second)


This time.Sleep appears to be redundant. The subsequent call to utils.CheckChatCompletions at line 979 likely has a retry mechanism to ensure the route is ready (as suggested by the comment on line 978), which makes this fixed sleep unnecessary. Removing it will make the test faster and more reliable by waiting for the actual condition instead of a fixed duration.

Copilot

Pull request overview

This PR adds end-to-end tests for metrics collection in the kthena router ecosystem. The tests verify routing metrics (request count, latency, error rate) and rate limiting metrics accuracy.

Changes:

Added two shared test functions: TestMetricsShared and TestRateLimitMetricsShared in test/e2e/router/shared.go
Added corresponding test wrappers in both router and gateway-api test suites to execute the shared tests with and without Gateway API configuration

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
test/e2e/router/shared.go	Implements shared test functions for metrics verification: TestMetricsShared validates request count, latency, and error metrics; TestRateLimitMetricsShared validates rate limiting metrics
test/e2e/router/e2e_test.go	Adds TestMetrics and TestRateLimitMetrics wrappers that execute shared tests without Gateway API (no ParentRefs)
test/e2e/router/gateway-api/e2e_test.go	Adds TestMetrics and TestRateLimitMetrics wrappers that execute shared tests with Gateway API enabled (with ParentRefs)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-21T21:13:11Z

test/e2e/router/shared.go

+		assert.Contains(t, metricsBody, "kthena_router_requests_total", "Request count metric should exist")
+		assert.Contains(t, metricsBody, fmt.Sprintf(`model="%s"`, modelRoute.Spec.ModelName), "Metric should have model label")
+		assert.Contains(t, metricsBody, `status_code="200"`, "Metric should have successful status code")


The metric verification only checks if the metric name exists in the response string. This is a weak assertion that could pass even if the metric has a value of 0 or is not properly tracking the requests. Consider parsing the metrics response and verifying that the counter value is at least 3 (since 3 requests were made in the loop above).

Copilot · 2026-01-21T21:13:11Z

test/e2e/router/shared.go

+		assert.Contains(t, metricsBody, "kthena_router_rate_limit_exceeded_total", "Rate limit exceeded metric should exist")
+		assert.Contains(t, metricsBody, fmt.Sprintf(`model="%s"`, modelRoute.Spec.ModelName), "Metric should have model label")


The metric verification only checks if the metric name and model label exist in the response. This doesn't verify that rate limiting actually occurred or that the metric value increased. Consider parsing the metrics response to verify that the kthena_router_rate_limit_exceeded_total counter value is greater than 0, since rate limiting should have been triggered based on the requests made above.

test/e2e/router/shared.go

katara-Jayprakash · 2026-01-22T13:54:03Z

test/e2e/router/shared.go

+	}
+
+	t.Run("VerifyRequestCountMetrics", func(t *testing.T) {
+		for i := 0; i < 3; i++ {


i would not suggest using the hardcore values,

katara-Jayprakash · 2026-01-22T13:59:59Z

test/e2e/router/shared.go

+		assert.Contains(t, metricsBody, `status_code="200"`, "Metric should have successful status code")
+	})
+
+	t.Run("VerifyLatencyMetrics", func(t *testing.T) {


/cc @blenbot
@YaoZengzeng Sir, I would really appreciate your guidance on how this test is actually checking or verifying the latency metrics. To me, this test and the one above it look mostly the same. Could you please point out what I might be missing?

Make sense, @blenbot could merge this subtest with VerifyRequestCountMetrics, only check these metrics exists and not zero. Please check they're not zero as well.

Make sense, @blenbot could merge this subtest with VerifyRequestCountMetrics, only check these metrics exists and not zero. Please check they're not zero as well.

Hey @YaoZengzeng I looked into extracting the exact metrics like total requests, latency, error rate and rate limiting metrics, I first logged the metrics body to see if it's possible or no and it seems only failed/rate limited metrics are recorded using metricsRecorder, when I checked the router package on successful requests metricsRecorder wasn't called to finish the job n record the metric.
I am unsure if this is an intentional design or something just missed out before but if you could shed some light it could be a possible issue, I did add a workaround in router package by calling metricsRecorder.Finish on successful requests after which I logged the metrics body again n the data for total requests, latency for successful requests was there which I could then parse either through regex or the helper ParseMetricsURL in metrics package, please let me know what would be the optimal approach

Signed-off-by: blenbot <[email protected]>

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 9 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-23T21:58:47Z

test/e2e/router/shared.go

+
+		metricsBody := fetchMetrics()
+
+		assert.Contains(t, metricsBody, "kthena_router_request_duration_seconds", "Duration histogram should exist")


The test only checks for the existence of the duration histogram metric but doesn't validate that it actually recorded the request. Consider verifying that the histogram has at least one sample by checking for bucket values or the _count suffix of the histogram metric.

Copilot · 2026-01-23T21:58:48Z

test/e2e/router/shared.go

+
+		metricsBody := fetchMetrics()
+
+		assert.Contains(t, metricsBody, `model="non-existent-model-xyz"`, "Error model should be tracked in metrics")


The test only verifies that the non-existent model name appears in the metrics but doesn't check for a 404 status code in the metrics or verify that an error-related metric counter was incremented. Consider also checking for status_code="404" in the metrics or for a specific error counter metric to ensure errors are properly tracked.

Copilot · 2026-01-23T21:58:48Z

test/e2e/router/shared.go

+	t.Run("VerifyRequestCountMetrics", func(t *testing.T) {
+		for i := 0; i < 3; i++ {
+			resp := utils.CheckChatCompletions(t, modelRoute.Spec.ModelName, messages)
+			assert.Equal(t, 200, resp.StatusCode)
+		}
+
+		time.Sleep(2 * time.Second)
+
+		metricsBody := fetchMetrics()
+
+		assert.Contains(t, metricsBody, "kthena_router_requests_total", "Request count metric should exist")
+		assert.Contains(t, metricsBody, fmt.Sprintf(`model="%s"`, modelRoute.Spec.ModelName), "Metric should have model label")
+		assert.Contains(t, metricsBody, `status_code="200"`, "Metric should have successful status code")
+	})
+
+	t.Run("VerifyLatencyMetrics", func(t *testing.T) {
+		resp := utils.CheckChatCompletions(t, modelRoute.Spec.ModelName, messages)
+		assert.Equal(t, 200, resp.StatusCode)
+
+		time.Sleep(2 * time.Second)
+
+		metricsBody := fetchMetrics()
+
+		assert.Contains(t, metricsBody, "kthena_router_request_duration_seconds", "Duration histogram should exist")
+	})
+
+	t.Run("VerifyErrorMetrics", func(t *testing.T) {
+		// SendChatRequest without retry since we expect 404 for non-existent model
+		resp := utils.SendChatRequest(t, "non-existent-model-xyz", messages)
+		defer resp.Body.Close()
+		assert.Equal(t, 404, resp.StatusCode)
+
+		time.Sleep(2 * time.Second)
+
+		metricsBody := fetchMetrics()
+
+		assert.Contains(t, metricsBody, `model="non-existent-model-xyz"`, "Error model should be tracked in metrics")
+	})
+}


The subtests in TestMetricsShared have implicit dependencies on each other because they all use the same ModelRoute. However, if subtests are run in parallel (with t.Parallel()), this would cause issues. While Go's default behavior is to run subtests sequentially, it's a best practice to either: 1) ensure each subtest is independent by creating separate ModelRoutes, or 2) explicitly document that these subtests must run sequentially. The current implementation assumes sequential execution but doesn't make this requirement explicit.

Copilot · 2026-01-23T21:58:48Z

test/e2e/router/e2e_test.go

+func TestMetrics(t *testing.T) {
+	TestMetricsShared(t, testCtx, testNamespace, false, kthenaNamespace)


Inconsistent parameter passing: TestMetrics passes kthenaNamespace as the last parameter, while all other test functions in this file (TestModelRouteSimple, TestModelRouteMultiModels, etc.) pass an empty string "" for the kthenaNamespace parameter. This inconsistency could lead to different behavior. Verify whether passing kthenaNamespace is intentional and necessary, or if it should be "" like the other tests for consistency.

Copilot · 2026-01-23T21:58:48Z

test/e2e/router/e2e_test.go

+func TestRateLimitMetrics(t *testing.T) {
+	TestRateLimitMetricsShared(t, testCtx, testNamespace, false, kthenaNamespace)


Inconsistent parameter passing: TestRateLimitMetrics passes kthenaNamespace as the last parameter, while all other test functions in this file (TestModelRouteSimple, TestModelRouteMultiModels, etc.) pass an empty string "" for the kthenaNamespace parameter. This inconsistency could lead to different behavior. Verify whether passing kthenaNamespace is intentional and necessary, or if it should be "" like the other tests for consistency.

test/e2e/router/shared.go

Signed-off-by: blenbot <[email protected]>

blenbot · 2026-01-23T23:51:40Z

@YaoZengzeng I have added the suggestions

YaoZengzeng · 2026-01-26T06:28:48Z

test/e2e/router/shared.go

+		body, err := io.ReadAll(resp.Body)
+		require.NoError(t, err, "Failed to read metrics response")
+		return string(body)
+	}


Good suggestion, @blenbot please fix this one.

Copilot AI review requested due to automatic review settings January 21, 2026 21:05

volcano-sh-bot requested review from YaoZengzeng and hzxuzhonghu January 21, 2026 21:05

volcano-sh-bot added the do-not-merge/contains-merge-commits label Jan 21, 2026

volcano-sh-bot added the size/L label Jan 21, 2026

Copilot started reviewing on behalf of blenbot January 21, 2026 21:06 View session

gemini-code-assist bot reviewed Jan 21, 2026

View reviewed changes

Copilot AI reviewed Jan 21, 2026

View reviewed changes

blenbot force-pushed the e2e_TestMetrics branch from 073b3bc to 0952862 Compare January 21, 2026 21:15

volcano-sh-bot removed the do-not-merge/contains-merge-commits label Jan 21, 2026

blenbot force-pushed the e2e_TestMetrics branch from 0952862 to 81beff3 Compare January 21, 2026 21:24

YaoZengzeng reviewed Jan 22, 2026

View reviewed changes

test/e2e/router/shared.go Outdated Show resolved Hide resolved

YaoZengzeng reviewed Jan 22, 2026

View reviewed changes

test/e2e/router/shared.go Outdated Show resolved Hide resolved

YaoZengzeng reviewed Jan 22, 2026

View reviewed changes

test/e2e/router/shared.go Outdated Show resolved Hide resolved

YaoZengzeng reviewed Jan 22, 2026

View reviewed changes

test/e2e/router/shared.go Outdated Show resolved Hide resolved

katara-Jayprakash reviewed Jan 22, 2026

View reviewed changes

blenbot added 2 commits January 24, 2026 03:19

added testmetrics

e9e04ef

Signed-off-by: blenbot <[email protected]>

fixed TestMetrics

dbce0be

Signed-off-by: blenbot <[email protected]>

blenbot force-pushed the e2e_TestMetrics branch from 81beff3 to dbce0be Compare January 23, 2026 21:50

Copilot AI review requested due to automatic review settings January 23, 2026 21:50

Copilot started reviewing on behalf of blenbot January 23, 2026 21:50 View session

Copilot AI reviewed Jan 23, 2026

View reviewed changes

added suggestions

0a56b89

Signed-off-by: blenbot <[email protected]>

blenbot requested a review from YaoZengzeng January 26, 2026 04:55

YaoZengzeng reviewed Jan 26, 2026

View reviewed changes

		assert.Contains(t, metricsBody, "kthena_router_rate_limit_exceeded_total", "Rate limit exceeded metric should exist")
		assert.Contains(t, metricsBody, fmt.Sprintf(`model="%s"`, modelRoute.Spec.ModelName), "Metric should have model label")


		metricsBody := fetchMetrics()

		assert.Contains(t, metricsBody, "kthena_router_request_duration_seconds", "Duration histogram should exist")


		metricsBody := fetchMetrics()

		assert.Contains(t, metricsBody, `model="non-existent-model-xyz"`, "Error model should be tracked in metrics")

		func TestMetrics(t *testing.T) {
		TestMetricsShared(t, testCtx, testNamespace, false, kthenaNamespace)

		func TestRateLimitMetrics(t *testing.T) {
		TestRateLimitMetricsShared(t, testCtx, testNamespace, false, kthenaNamespace)

e2e test for metrics #667

Are you sure you want to change the base?

e2e test for metrics #667

Conversation

blenbot commented Jan 21, 2026

Uh oh!

volcano-sh-bot commented Jan 21, 2026

Uh oh!

volcano-sh-bot commented Jan 21, 2026

Uh oh!

gemini-code-assist bot commented Jan 21, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

YaoZengzeng Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

blenbot Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

katara-Jayprakash Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

katara-Jayprakash Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

YaoZengzeng Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

blenbot Feb 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

blenbot Jan 22, 2026 •

edited

Loading