fix: Fix failed request count in report by arekay-nv · Pull Request #169 · mlcommons/endpoints

arekay-nv · 2026-03-12T19:31:12Z

What does this PR do?

Addresses #144

Type of change

Bug fix
New feature
Documentation update
Refactor/cleanup

Related issues

Testing

Tests added/updated
All tests pass locally
Manual testing completed

Checklist

Code follows project style
Pre-commit hooks pass
Documentation updated (if needed)

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

github-actions · 2026-03-12T19:31:22Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

gemini-code-assist · 2026-03-12T19:31:28Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses an inaccuracy in reporting failed requests within the benchmark results. It introduces a dedicated mechanism to track and report the exact number of failed samples, enhancing the reliability and detail of performance metrics. The changes involve modifying the report structure, updating its generation and display, and adding comprehensive unit tests to validate the fix.

Highlights

Accurate Failed Request Count: The calculation for failed requests in the benchmark report has been corrected to use a dedicated n_samples_failed field from the report, rather than deriving it from total and successful counts.
Report Class Enhancement: A new field, n_samples_failed, was added to the Report dataclass to explicitly track the number of failed samples, improving data integrity and clarity.
Report Generation and Display: The report generation logic was updated to populate the new n_samples_failed field using get_error_count(), and the report display function was modified to include this new metric.
Unit Test Coverage: New unit tests were added and existing ones updated to ensure the correct calculation, population, and JSON serialization of the n_samples_failed metric.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

src/inference_endpoint/commands/benchmark.py
- Updated the calculation of failed requests in the benchmark results dictionary to use the new report.n_samples_failed.
src/inference_endpoint/metrics/reporter.py
- Added n_samples_failed: int to the Report dataclass.
- Included n_samples_failed in the display method output.
- Populated n_samples_failed using self.get_error_count() in the create_report method.
tests/unit/metrics/test_reporter.py
- Added an assertion for report.n_samples_failed in test_reporter_create_report.
- Included n_samples_failed in the expected_keys for JSON serialization in test_reporter_json.
- Added an assertion for json_dict["n_samples_failed"] in test_reporter_json.

Activity

The pull request addresses issue Failed responses are not properly tracked in results.json #144.
The author marked this as a bug fix.
The author indicated that tests were added/updated, all tests pass locally, and manual testing was completed.
The author confirmed that the code follows project style, pre-commit hooks pass, and documentation was updated (if needed).

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request correctly identifies that the previous method for calculating failed requests (total - successful) was inaccurate and introduces a new n_samples_failed field to the report for a more precise count. The changes to add this field, populate it, and display it in the summary are well-implemented. However, I've identified a significant issue in how the failed count is calculated: it doesn't respect the performance window defined by the STOP_PERFORMANCE_TRACKING event. This inconsistency with other metrics can lead to incorrect reports. I've provided a detailed comment with a suggested fix for this issue.

gemini-code-assist · 2026-03-12T19:32:43Z

src/inference_endpoint/metrics/reporter.py

            test_started_at=test_started_at,
            n_samples_issued=sample_statuses["total_sent"],
            n_samples_completed=sample_statuses["completed"],
+            n_samples_failed=self.get_error_count(),


The get_error_count() method counts all ERROR events in the database, without filtering for the performance window defined by STOP_PERFORMANCE_TRACKING. This is inconsistent with how n_samples_issued and n_samples_completed are calculated (they respect the performance window via get_sample_statuses).

This can lead to incorrect metrics. For example, if a sample is issued and fails after the performance window closes, it will be counted in n_samples_failed but not in n_samples_issued, which could lead to confusing results like n_samples_failed > n_samples_issued.

To ensure consistency, get_error_count() should be modified to only count errors for samples that were issued within the performance window. A similar filtering logic is already used in get_sample_statuses() and derive_TTFT().

Here is a suggested implementation for get_error_count:

def get_error_count(self) -> int: stop_ts = self.stop_performance_tracking_timestamp_ns where_clause = "" if stop_ts != float("inf"): # Filter for errors associated with samples issued before the stop timestamp. where_clause = f""" AND sample_uuid IN ( SELECT DISTINCT sample_uuid FROM events WHERE event_type = '{SessionEvent.LOADGEN_ISSUE_CALLED.value}' AND timestamp_ns < {stop_ts} ) """ query = f""" SELECT COUNT(*) AS error_count FROM events WHERE event_type = '{SessionEvent.ERROR.value}' {where_clause} """ result = self.cur_.execute(query).fetchone() return result[0] if result else 0

Additionally, a test case that includes a STOP_PERFORMANCE_TRACKING event and errors both inside and outside the performance window should be added to verify the fix.

Copilot

Pull request overview

This PR addresses #144 by ensuring benchmark/report outputs include a non-zero failed/error count when request errors occur, rather than inferring failures from total - successful.

Changes:

Add n_samples_failed to the metrics Report model and include it in JSON + display output.
Populate n_samples_failed from recorded SessionEvent.ERROR events when creating reports.
Use report.n_samples_failed for the failed field in results.json, and update unit tests accordingly.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
`tests/unit/metrics/test_reporter.py`	Extends report and JSON serialization assertions to include `n_samples_failed`.
`src/inference_endpoint/metrics/reporter.py`	Adds `n_samples_failed` to `Report`, prints it in `display()`, and sets it during `create_report()`.
`src/inference_endpoint/commands/benchmark.py`	Writes `failed` to `results.json` using `report.n_samples_failed` instead of `total - successful`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

tests/unit/metrics/test_reporter.py

Copilot · 2026-03-12T19:35:13Z

src/inference_endpoint/metrics/reporter.py

            test_started_at=test_started_at,
            n_samples_issued=sample_statuses["total_sent"],
            n_samples_completed=sample_statuses["completed"],
+            n_samples_failed=self.get_error_count(),


n_samples_failed is derived from get_error_count(), but get_error_count() counts all ERROR events across the entire DB and does not apply the same STOP_PERFORMANCE_TRACKING / “performance window” filtering used by get_sample_statuses(), TTFT, and latency. This can make n_samples_failed inconsistent with n_samples_issued/n_samples_completed (which are window-filtered). Consider updating the error-count query to apply the same window filter (and ideally count distinct sample UUIDs if this is meant to be “samples failed”).

Suggested change

n_samples_failed=self.get_error_count(),

n_samples_failed=sample_statuses["failed"],

Copilot · 2026-03-12T19:35:13Z

src/inference_endpoint/commands/benchmark.py

                        "total": total,
                        "successful": success_count,
-                        "failed": total - success_count,
+                        "failed": report.n_samples_failed,


Switching failed to report.n_samples_failed changes the implied invariant of the JSON output (typically total == successful + failed). As implemented, successful is still report.n_samples_completed while failed is an ERROR-event count, so successful + failed can exceed total, and failed may be outside the same STOP_PERFORMANCE_TRACKING window used for total/successful. Consider aligning the semantics (e.g., compute successful as “completed without any ERROR” and failed as distinct sample UUIDs with ERROR within the same window, or rename fields in results.json if you intend to expose an error-event count).

Suggested change

"failed": report.n_samples_failed,

"error_events": report.n_samples_failed,

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

src/inference_endpoint/metrics/reporter.py

    n_samples_issued: int
    n_samples_completed: int
+    n_samples_failed: int
    duration_ns: int


src/inference_endpoint/metrics/reporter.py

-            COUNT(*) AS error_count
+            COUNT(DISTINCT sample_uuid) AS error_count
        FROM events
        WHERE event_type = '{SessionEvent.ERROR.value}'


src/inference_endpoint/commands/benchmark.py

                    "results": {
                        "total": total,
                        "successful": success_count,
-                        "failed": total - success_count,
+                        "failed": report.n_samples_failed,
                        "elapsed_time": elapsed_time,
                        "qps": estimated_qps,
                    },


Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

src/inference_endpoint/metrics/reporter.py

+        stop_ts = self.stop_performance_tracking_timestamp_ns
+
+        where_clause = ""
+        if stop_ts != float("inf"):
+            where_clause = f"""
+            AND sample_uuid IN (
+                SELECT DISTINCT sample_uuid FROM events
+                WHERE event_type = '{SessionEvent.LOADGEN_ISSUE_CALLED.value}'
+                AND timestamp_ns < {stop_ts}
+            )
+            """
+
        return self.cur_.execute(f"""
        SELECT
-            COUNT(*) AS error_count
+            COUNT(DISTINCT sample_uuid) AS error_count
        FROM events
        WHERE event_type = '{SessionEvent.ERROR.value}'
+        {where_clause}
        """).fetchone()[0]


src/inference_endpoint/commands/benchmark.py

                    "results": {
                        "total": total,
                        "successful": success_count,
-                        "failed": total - success_count,
+                        "failed": report.n_samples_failed,
                        "elapsed_time": elapsed_time,
                        "qps": estimated_qps,


Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

src/inference_endpoint/commands/benchmark.py

+            )
            total = report.n_samples_issued
-            success_count = report.n_samples_completed
+            success_count = total - report.n_samples_failed


src/inference_endpoint/metrics/reporter.py

    n_samples_completed: int
-    duration_ns: int
+    n_samples_failed: int
+    duration_ns: int | None


Fix failed request count

22fff0b

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

arekay-nv requested a review from a team as a code owner March 12, 2026 19:31

Copilot AI review requested due to automatic review settings March 12, 2026 19:31

github-actions bot requested a review from nvzhihanj March 12, 2026 19:31

Copilot started reviewing on behalf of arekay-nv March 12, 2026 19:31 View session

gemini-code-assist bot reviewed Mar 12, 2026

View reviewed changes

Copilot AI reviewed Mar 12, 2026

View reviewed changes

arekay-nv added 2 commits March 12, 2026 13:27

Address comments

5f02d38

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

Merge branch 'main' into arekay/fix_failed_count_report

242069b

Copilot AI review requested due to automatic review settings March 13, 2026 20:59

Copilot started reviewing on behalf of arekay-nv March 13, 2026 21:00 View session

Copilot AI reviewed Mar 13, 2026

View reviewed changes

viraatc approved these changes Mar 13, 2026

View reviewed changes

arekay-nv requested a review from Copilot March 16, 2026 20:50

Copilot started reviewing on behalf of arekay-nv March 16, 2026 20:50 View session

Copilot AI reviewed Mar 16, 2026

View reviewed changes

arekay-nv added 2 commits March 16, 2026 16:19

Comments

a234537

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

lint

5841580

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

Copilot AI review requested due to automatic review settings March 16, 2026 22:01

Copilot started reviewing on behalf of arekay-nv March 16, 2026 22:01 View session

Copilot AI reviewed Mar 16, 2026

View reviewed changes

arekay-nv merged commit 721ea6d into main Mar 16, 2026
8 checks passed

arekay-nv deleted the arekay/fix_failed_count_report branch March 16, 2026 22:06

github-actions bot locked and limited conversation to collaborators Mar 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Fix failed request count in report#169

fix: Fix failed request count in report#169
arekay-nv merged 5 commits intomainfrom
arekay/fix_failed_count_report

arekay-nv commented Mar 12, 2026

Uh oh!

github-actions bot commented Mar 12, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Mar 12, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Mar 12, 2026

Uh oh!

Copilot AI Mar 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	n_samples_failed=self.get_error_count(),
	n_samples_failed=sample_statuses["failed"],

	"failed": report.n_samples_failed,
	"error_events": report.n_samples_failed,

Conversation

arekay-nv commented Mar 12, 2026

What does this PR do?

Type of change

Related issues

Testing

Checklist

Uh oh!

github-actions bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Mar 12, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Mar 12, 2026 •

edited

Loading