Skip to content

fix: Fix failed request count in report#169

Merged
arekay-nv merged 5 commits intomainfrom
arekay/fix_failed_count_report
Mar 16, 2026
Merged

fix: Fix failed request count in report#169
arekay-nv merged 5 commits intomainfrom
arekay/fix_failed_count_report

Conversation

@arekay-nv
Copy link
Copy Markdown
Collaborator

What does this PR do?

Addresses #144

Type of change

  • Bug fix
  • New feature
  • Documentation update
  • Refactor/cleanup

Related issues

Testing

  • Tests added/updated
  • All tests pass locally
  • Manual testing completed

Checklist

  • Code follows project style
  • Pre-commit hooks pass
  • Documentation updated (if needed)

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
@arekay-nv arekay-nv requested a review from a team as a code owner March 12, 2026 19:31
Copilot AI review requested due to automatic review settings March 12, 2026 19:31
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 12, 2026

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@github-actions github-actions bot requested a review from nvzhihanj March 12, 2026 19:31
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses an inaccuracy in reporting failed requests within the benchmark results. It introduces a dedicated mechanism to track and report the exact number of failed samples, enhancing the reliability and detail of performance metrics. The changes involve modifying the report structure, updating its generation and display, and adding comprehensive unit tests to validate the fix.

Highlights

  • Accurate Failed Request Count: The calculation for failed requests in the benchmark report has been corrected to use a dedicated n_samples_failed field from the report, rather than deriving it from total and successful counts.
  • Report Class Enhancement: A new field, n_samples_failed, was added to the Report dataclass to explicitly track the number of failed samples, improving data integrity and clarity.
  • Report Generation and Display: The report generation logic was updated to populate the new n_samples_failed field using get_error_count(), and the report display function was modified to include this new metric.
  • Unit Test Coverage: New unit tests were added and existing ones updated to ensure the correct calculation, population, and JSON serialization of the n_samples_failed metric.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • src/inference_endpoint/commands/benchmark.py
    • Updated the calculation of failed requests in the benchmark results dictionary to use the new report.n_samples_failed.
  • src/inference_endpoint/metrics/reporter.py
    • Added n_samples_failed: int to the Report dataclass.
    • Included n_samples_failed in the display method output.
    • Populated n_samples_failed using self.get_error_count() in the create_report method.
  • tests/unit/metrics/test_reporter.py
    • Added an assertion for report.n_samples_failed in test_reporter_create_report.
    • Included n_samples_failed in the expected_keys for JSON serialization in test_reporter_json.
    • Added an assertion for json_dict["n_samples_failed"] in test_reporter_json.
Activity
  • The pull request addresses issue Failed responses are not properly tracked in results.json #144.
  • The author marked this as a bug fix.
  • The author indicated that tests were added/updated, all tests pass locally, and manual testing was completed.
  • The author confirmed that the code follows project style, pre-commit hooks pass, and documentation was updated (if needed).
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly identifies that the previous method for calculating failed requests (total - successful) was inaccurate and introduces a new n_samples_failed field to the report for a more precise count. The changes to add this field, populate it, and display it in the summary are well-implemented. However, I've identified a significant issue in how the failed count is calculated: it doesn't respect the performance window defined by the STOP_PERFORMANCE_TRACKING event. This inconsistency with other metrics can lead to incorrect reports. I've provided a detailed comment with a suggested fix for this issue.

test_started_at=test_started_at,
n_samples_issued=sample_statuses["total_sent"],
n_samples_completed=sample_statuses["completed"],
n_samples_failed=self.get_error_count(),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The get_error_count() method counts all ERROR events in the database, without filtering for the performance window defined by STOP_PERFORMANCE_TRACKING. This is inconsistent with how n_samples_issued and n_samples_completed are calculated (they respect the performance window via get_sample_statuses).

This can lead to incorrect metrics. For example, if a sample is issued and fails after the performance window closes, it will be counted in n_samples_failed but not in n_samples_issued, which could lead to confusing results like n_samples_failed > n_samples_issued.

To ensure consistency, get_error_count() should be modified to only count errors for samples that were issued within the performance window. A similar filtering logic is already used in get_sample_statuses() and derive_TTFT().

Here is a suggested implementation for get_error_count:

def get_error_count(self) -> int:
    stop_ts = self.stop_performance_tracking_timestamp_ns

    where_clause = ""
    if stop_ts != float("inf"):
        # Filter for errors associated with samples issued before the stop timestamp.
        where_clause = f"""
        AND sample_uuid IN (
            SELECT DISTINCT sample_uuid FROM events
            WHERE event_type = '{SessionEvent.LOADGEN_ISSUE_CALLED.value}'
            AND timestamp_ns < {stop_ts}
        )
        """

    query = f"""
        SELECT COUNT(*) AS error_count
        FROM events
        WHERE event_type = '{SessionEvent.ERROR.value}'
        {where_clause}
    """
    result = self.cur_.execute(query).fetchone()
    return result[0] if result else 0

Additionally, a test case that includes a STOP_PERFORMANCE_TRACKING event and errors both inside and outside the performance window should be added to verify the fix.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses #144 by ensuring benchmark/report outputs include a non-zero failed/error count when request errors occur, rather than inferring failures from total - successful.

Changes:

  • Add n_samples_failed to the metrics Report model and include it in JSON + display output.
  • Populate n_samples_failed from recorded SessionEvent.ERROR events when creating reports.
  • Use report.n_samples_failed for the failed field in results.json, and update unit tests accordingly.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
tests/unit/metrics/test_reporter.py Extends report and JSON serialization assertions to include n_samples_failed.
src/inference_endpoint/metrics/reporter.py Adds n_samples_failed to Report, prints it in display(), and sets it during create_report().
src/inference_endpoint/commands/benchmark.py Writes failed to results.json using report.n_samples_failed instead of total - successful.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

test_started_at=test_started_at,
n_samples_issued=sample_statuses["total_sent"],
n_samples_completed=sample_statuses["completed"],
n_samples_failed=self.get_error_count(),
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n_samples_failed is derived from get_error_count(), but get_error_count() counts all ERROR events across the entire DB and does not apply the same STOP_PERFORMANCE_TRACKING / “performance window” filtering used by get_sample_statuses(), TTFT, and latency. This can make n_samples_failed inconsistent with n_samples_issued/n_samples_completed (which are window-filtered). Consider updating the error-count query to apply the same window filter (and ideally count distinct sample UUIDs if this is meant to be “samples failed”).

Suggested change
n_samples_failed=self.get_error_count(),
n_samples_failed=sample_statuses["failed"],

Copilot uses AI. Check for mistakes.
"total": total,
"successful": success_count,
"failed": total - success_count,
"failed": report.n_samples_failed,
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switching failed to report.n_samples_failed changes the implied invariant of the JSON output (typically total == successful + failed). As implemented, successful is still report.n_samples_completed while failed is an ERROR-event count, so successful + failed can exceed total, and failed may be outside the same STOP_PERFORMANCE_TRACKING window used for total/successful. Consider aligning the semantics (e.g., compute successful as “completed without any ERROR” and failed as distinct sample UUIDs with ERROR within the same window, or rename fields in results.json if you intend to expose an error-event count).

Suggested change
"failed": report.n_samples_failed,
"error_events": report.n_samples_failed,

Copilot uses AI. Check for mistakes.
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 13, 2026 20:59
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

n_samples_issued: int
n_samples_completed: int
n_samples_failed: int
duration_ns: int
COUNT(*) AS error_count
COUNT(DISTINCT sample_uuid) AS error_count
FROM events
WHERE event_type = '{SessionEvent.ERROR.value}'
Comment on lines 697 to 703
"results": {
"total": total,
"successful": success_count,
"failed": total - success_count,
"failed": report.n_samples_failed,
"elapsed_time": elapsed_time,
"qps": estimated_qps,
},
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +946 to 964
stop_ts = self.stop_performance_tracking_timestamp_ns

where_clause = ""
if stop_ts != float("inf"):
where_clause = f"""
AND sample_uuid IN (
SELECT DISTINCT sample_uuid FROM events
WHERE event_type = '{SessionEvent.LOADGEN_ISSUE_CALLED.value}'
AND timestamp_ns < {stop_ts}
)
"""

return self.cur_.execute(f"""
SELECT
COUNT(*) AS error_count
COUNT(DISTINCT sample_uuid) AS error_count
FROM events
WHERE event_type = '{SessionEvent.ERROR.value}'
{where_clause}
""").fetchone()[0]
Comment on lines 697 to 702
"results": {
"total": total,
"successful": success_count,
"failed": total - success_count,
"failed": report.n_samples_failed,
"elapsed_time": elapsed_time,
"qps": estimated_qps,
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 16, 2026 22:01
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

)
total = report.n_samples_issued
success_count = report.n_samples_completed
success_count = total - report.n_samples_failed
n_samples_completed: int
duration_ns: int
n_samples_failed: int
duration_ns: int | None
@arekay-nv arekay-nv merged commit 721ea6d into main Mar 16, 2026
8 checks passed
@arekay-nv arekay-nv deleted the arekay/fix_failed_count_report branch March 16, 2026 22:06
@github-actions github-actions bot locked and limited conversation to collaborators Mar 16, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants