-
Notifications
You must be signed in to change notification settings - Fork 51
e2e test for metrics #667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
e2e test for metrics #667
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Welcome @blenbot! It looks like this is your first PR to volcano-sh/kthena 🎉 |
Summary of ChangesHello @blenbot, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the testing infrastructure by adding robust end-to-end tests for the router's metrics collection and rate limiting functionality. The new tests ensure that critical operational metrics, such as request counts, latencies, error rates, and rate limit exceedances, are accurately captured and reported, providing better observability and reliability for the routing layer. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This PR introduces e2e tests for router metrics and rate limit metrics. The implementation uses shared test functions for both standard and Gateway API setups, which is a good pattern. My review focuses on improving the robustness and maintainability of the new test code. I've suggested replacing fixed time.Sleep calls with polling mechanisms to avoid flaky tests, removing a redundant sleep, and refactoring duplicated code into a helper function.
| fetchMetrics := func() string { | ||
| resp, err := http.Get("http://127.0.0.1:9090/metrics") | ||
| require.NoError(t, err, "Failed to fetch metrics") | ||
| defer resp.Body.Close() | ||
| body, err := io.ReadAll(resp.Body) | ||
| require.NoError(t, err, "Failed to read metrics response") | ||
| return string(body) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fetchMetrics function is duplicated in TestMetricsShared and TestRateLimitMetricsShared (lines 964-971). The only difference is the port number. To improve maintainability and reduce code duplication, consider extracting this into a file-level helper function that accepts the port as a parameter. Also, http.Get does not have a timeout, which can make tests hang. Using an http.Client with a timeout is more robust.
Example:
func fetchMetricsFromPort(t *testing.T, port string) string {
client := http.Client{Timeout: 5 * time.Second}
resp, err := client.Get(fmt.Sprintf("http://127.0.0.1:%s/metrics", port))
require.NoError(t, err, "Failed to fetch metrics from port %s", port)
defer resp.Body.Close()
body, err := io.ReadAll(resp.Body)
require.NoError(t, err, "Failed to read metrics response")
return string(body)
}
test/e2e/router/shared.go
Outdated
| time.Sleep(2 * time.Second) | ||
|
|
||
| metricsBody := fetchMetrics() | ||
|
|
||
| assert.Contains(t, metricsBody, "kthena_router_requests_total", "Request count metric should exist") | ||
| assert.Contains(t, metricsBody, fmt.Sprintf(`model="%s"`, modelRoute.Spec.ModelName), "Metric should have model label") | ||
| assert.Contains(t, metricsBody, `status_code="200"`, "Metric should have successful status code") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using time.Sleep to wait for metrics to become available can lead to flaky tests. If the system is slow, the test might fail. If the system is fast, the test waits longer than necessary. A better approach is to use a polling mechanism like require.Eventually to wait for the desired metrics to appear. This makes the test more robust and efficient.
This advice applies to other time.Sleep calls in this file that wait for metrics (e.g., lines 903, 916, and 997).
| time.Sleep(2 * time.Second) | |
| metricsBody := fetchMetrics() | |
| assert.Contains(t, metricsBody, "kthena_router_requests_total", "Request count metric should exist") | |
| assert.Contains(t, metricsBody, fmt.Sprintf(`model="%s"`, modelRoute.Spec.ModelName), "Metric should have model label") | |
| assert.Contains(t, metricsBody, `status_code="200"`, "Metric should have successful status code") | |
| require.Eventually(t, func() bool { | |
| metricsBody := fetchMetrics() | |
| return strings.Contains(metricsBody, "kthena_router_requests_total") && | |
| strings.Contains(metricsBody, fmt.Sprintf("model=\"%s\"", modelRoute.Spec.ModelName)) && | |
| strings.Contains(metricsBody, "status_code=\"200\"") | |
| }, 15*time.Second, time.Second, "Expected metrics for successful requests were not found in time") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@blenbot This is a good suggestion. You should fix this.
Also there are too many sleep, we should use the pattern AI suggested above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright imma add that
test/e2e/router/shared.go
Outdated
| }) | ||
|
|
||
| // Wait for rate limiter to be configured | ||
| time.Sleep(3 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This time.Sleep appears to be redundant. The subsequent call to utils.CheckChatCompletions at line 979 likely has a retry mechanism to ensure the route is ready (as suggested by the comment on line 978), which makes this fixed sleep unnecessary. Removing it will make the test faster and more reliable by waiting for the actual condition instead of a fixed duration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds end-to-end tests for metrics collection in the kthena router ecosystem. The tests verify routing metrics (request count, latency, error rate) and rate limiting metrics accuracy.
Changes:
- Added two shared test functions:
TestMetricsSharedandTestRateLimitMetricsSharedintest/e2e/router/shared.go - Added corresponding test wrappers in both router and gateway-api test suites to execute the shared tests with and without Gateway API configuration
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| test/e2e/router/shared.go | Implements shared test functions for metrics verification: TestMetricsShared validates request count, latency, and error metrics; TestRateLimitMetricsShared validates rate limiting metrics |
| test/e2e/router/e2e_test.go | Adds TestMetrics and TestRateLimitMetrics wrappers that execute shared tests without Gateway API (no ParentRefs) |
| test/e2e/router/gateway-api/e2e_test.go | Adds TestMetrics and TestRateLimitMetrics wrappers that execute shared tests with Gateway API enabled (with ParentRefs) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
test/e2e/router/shared.go
Outdated
| assert.Contains(t, metricsBody, "kthena_router_requests_total", "Request count metric should exist") | ||
| assert.Contains(t, metricsBody, fmt.Sprintf(`model="%s"`, modelRoute.Spec.ModelName), "Metric should have model label") | ||
| assert.Contains(t, metricsBody, `status_code="200"`, "Metric should have successful status code") |
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metric verification only checks if the metric name exists in the response string. This is a weak assertion that could pass even if the metric has a value of 0 or is not properly tracking the requests. Consider parsing the metrics response and verifying that the counter value is at least 3 (since 3 requests were made in the loop above).
test/e2e/router/shared.go
Outdated
| assert.Contains(t, metricsBody, "kthena_router_rate_limit_exceeded_total", "Rate limit exceeded metric should exist") | ||
| assert.Contains(t, metricsBody, fmt.Sprintf(`model="%s"`, modelRoute.Spec.ModelName), "Metric should have model label") |
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metric verification only checks if the metric name and model label exist in the response. This doesn't verify that rate limiting actually occurred or that the metric value increased. Consider parsing the metrics response to verify that the kthena_router_rate_limit_exceeded_total counter value is greater than 0, since rate limiting should have been triggered based on the requests made above.
073b3bc to
0952862
Compare
0952862 to
81beff3
Compare
| } | ||
|
|
||
| t.Run("VerifyRequestCountMetrics", func(t *testing.T) { | ||
| for i := 0; i < 3; i++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would not suggest using the hardcore values,
| assert.Contains(t, metricsBody, `status_code="200"`, "Metric should have successful status code") | ||
| }) | ||
|
|
||
| t.Run("VerifyLatencyMetrics", func(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/cc @blenbot
@YaoZengzeng Sir, I would really appreciate your guidance on how this test is actually checking or verifying the latency metrics. To me, this test and the one above it look mostly the same. Could you please point out what I might be missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense, @blenbot could merge this subtest with VerifyRequestCountMetrics, only check these metrics exists and not zero. Please check they're not zero as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense, @blenbot could merge this subtest with
VerifyRequestCountMetrics, only check these metrics exists and not zero. Please check they're not zero as well.
Hey @YaoZengzeng I looked into extracting the exact metrics like total requests, latency, error rate and rate limiting metrics, I first logged the metrics body to see if it's possible or no and it seems only failed/rate limited metrics are recorded using metricsRecorder, when I checked the router package on successful requests metricsRecorder wasn't called to finish the job n record the metric.
I am unsure if this is an intentional design or something just missed out before but if you could shed some light it could be a possible issue, I did add a workaround in router package by calling metricsRecorder.Finish on successful requests after which I logged the metrics body again n the data for total requests, latency for successful requests was there which I could then parse either through regex or the helper ParseMetricsURL in metrics package, please let me know what would be the optimal approach
Signed-off-by: blenbot <[email protected]>
Signed-off-by: blenbot <[email protected]>
81beff3 to
dbce0be
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 9 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
test/e2e/router/shared.go
Outdated
|
|
||
| metricsBody := fetchMetrics() | ||
|
|
||
| assert.Contains(t, metricsBody, "kthena_router_request_duration_seconds", "Duration histogram should exist") |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test only checks for the existence of the duration histogram metric but doesn't validate that it actually recorded the request. Consider verifying that the histogram has at least one sample by checking for bucket values or the _count suffix of the histogram metric.
test/e2e/router/shared.go
Outdated
|
|
||
| metricsBody := fetchMetrics() | ||
|
|
||
| assert.Contains(t, metricsBody, `model="non-existent-model-xyz"`, "Error model should be tracked in metrics") |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test only verifies that the non-existent model name appears in the metrics but doesn't check for a 404 status code in the metrics or verify that an error-related metric counter was incremented. Consider also checking for status_code="404" in the metrics or for a specific error counter metric to ensure errors are properly tracked.
| t.Run("VerifyRequestCountMetrics", func(t *testing.T) { | ||
| for i := 0; i < 3; i++ { | ||
| resp := utils.CheckChatCompletions(t, modelRoute.Spec.ModelName, messages) | ||
| assert.Equal(t, 200, resp.StatusCode) | ||
| } | ||
|
|
||
| time.Sleep(2 * time.Second) | ||
|
|
||
| metricsBody := fetchMetrics() | ||
|
|
||
| assert.Contains(t, metricsBody, "kthena_router_requests_total", "Request count metric should exist") | ||
| assert.Contains(t, metricsBody, fmt.Sprintf(`model="%s"`, modelRoute.Spec.ModelName), "Metric should have model label") | ||
| assert.Contains(t, metricsBody, `status_code="200"`, "Metric should have successful status code") | ||
| }) | ||
|
|
||
| t.Run("VerifyLatencyMetrics", func(t *testing.T) { | ||
| resp := utils.CheckChatCompletions(t, modelRoute.Spec.ModelName, messages) | ||
| assert.Equal(t, 200, resp.StatusCode) | ||
|
|
||
| time.Sleep(2 * time.Second) | ||
|
|
||
| metricsBody := fetchMetrics() | ||
|
|
||
| assert.Contains(t, metricsBody, "kthena_router_request_duration_seconds", "Duration histogram should exist") | ||
| }) | ||
|
|
||
| t.Run("VerifyErrorMetrics", func(t *testing.T) { | ||
| // SendChatRequest without retry since we expect 404 for non-existent model | ||
| resp := utils.SendChatRequest(t, "non-existent-model-xyz", messages) | ||
| defer resp.Body.Close() | ||
| assert.Equal(t, 404, resp.StatusCode) | ||
|
|
||
| time.Sleep(2 * time.Second) | ||
|
|
||
| metricsBody := fetchMetrics() | ||
|
|
||
| assert.Contains(t, metricsBody, `model="non-existent-model-xyz"`, "Error model should be tracked in metrics") | ||
| }) | ||
| } |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The subtests in TestMetricsShared have implicit dependencies on each other because they all use the same ModelRoute. However, if subtests are run in parallel (with t.Parallel()), this would cause issues. While Go's default behavior is to run subtests sequentially, it's a best practice to either: 1) ensure each subtest is independent by creating separate ModelRoutes, or 2) explicitly document that these subtests must run sequentially. The current implementation assumes sequential execution but doesn't make this requirement explicit.
| func TestMetrics(t *testing.T) { | ||
| TestMetricsShared(t, testCtx, testNamespace, false, kthenaNamespace) |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent parameter passing: TestMetrics passes kthenaNamespace as the last parameter, while all other test functions in this file (TestModelRouteSimple, TestModelRouteMultiModels, etc.) pass an empty string "" for the kthenaNamespace parameter. This inconsistency could lead to different behavior. Verify whether passing kthenaNamespace is intentional and necessary, or if it should be "" like the other tests for consistency.
| func TestRateLimitMetrics(t *testing.T) { | ||
| TestRateLimitMetricsShared(t, testCtx, testNamespace, false, kthenaNamespace) |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent parameter passing: TestRateLimitMetrics passes kthenaNamespace as the last parameter, while all other test functions in this file (TestModelRouteSimple, TestModelRouteMultiModels, etc.) pass an empty string "" for the kthenaNamespace parameter. This inconsistency could lead to different behavior. Verify whether passing kthenaNamespace is intentional and necessary, or if it should be "" like the other tests for consistency.
Signed-off-by: blenbot <[email protected]>
|
@YaoZengzeng I have added the suggestions |
| body, err := io.ReadAll(resp.Body) | ||
| require.NoError(t, err, "Failed to read metrics response") | ||
| return string(body) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion, @blenbot please fix this one.
What type of PR is this?
Implementng TestMetrics e2e test in kthena ecosystem.
What this PR does / why we need it:
Verify routing metrics collection (request count, latency, error rate)
Verify rate limiting metrics accuracy
Which issue(s) this PR fixes:
Part of #556
Special notes for your reviewer:
not for now