test: refactor tests for readability and maintainability by nv-alicheng · Pull Request #189 · mlcommons/endpoints

nv-alicheng · 2026-03-18T00:56:34Z

Convert repetitive test methods to parametrized tests
Split test_reporter.py into focused modules: test_reporter_report.py, test_reporter_rollup.py, test_reporter_stop_perf.py
Consolidate duplicate test patterns across test suites
Improve test organization in transforms, extractor, types, record tests

What does this PR do?

Type of change

Bug fix
New feature
Documentation update
Refactor/cleanup

Related issues

Testing

Tests added/updated
All tests pass locally
Manual testing completed

Checklist

Code follows project style
Pre-commit hooks pass
Documentation updated (if needed)

- Move record.py from async_utils/transport to core/ - Add PromptData, TextModelOutput, ErrorData types with msgspec Struct - Deprecate str as response_output type in favor of TextModelOutput - Add msgspec struct performance flags (gc=False, array_like=True) - Fix threading safety issues in http_client, sample handler, recorder - Update all imports across the codebase - Bump default worker init timeout to 60s

github-actions · 2026-03-18T00:56:43Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

gemini-code-assist · 2026-03-18T00:57:19Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the existing test suite to enhance readability, maintainability, and efficiency. By converting repetitive tests into parametrized formats and reorganizing test files, the changes aim to streamline future test development and make it easier to understand the purpose and behavior of various components. The overall impact is a cleaner, more robust testing infrastructure.

Highlights

Test Refactoring: Converted numerous repetitive test methods across various modules into more concise and maintainable parametrized tests using pytest.mark.parametrize.
Test File Organization: Split the monolithic test_reporter.py into three focused modules: test_reporter_report.py, test_reporter_rollup.py, and test_reporter_stop_perf.py to improve clarity and manageability.
Test Clarity and Brevity: Renamed many test methods to be more concise and descriptive, enhancing readability without losing context.
Mocking External Dependencies: Introduced a FakeTokenizer and an _mock_tokenizer fixture in test_token_metrics.py to prevent network calls during tests, making them faster and more reliable.
Consolidated Test Patterns: Unified common test patterns, such as error handling, serialization round-trips, and column remapping, into single parametrized tests.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request is a significant and valuable refactoring of the test suite. The changes greatly improve the readability and maintainability of the tests by converting many repetitive test methods into concise, parametrized tests. Splitting test_reporter.py into more focused modules is also a good structural improvement.

I've found a couple of minor issues:

A small bug in an f-string in an error message within test_random_data_loader.
A removed assertion in test_preserves_original_columns that I believe should be restored to ensure the test is fully robust.

Overall, this is an excellent cleanup of the test code. My comments are minor suggestions for improvement.

tests/unit/dataset_manager/test_data_loader.py

gemini-code-assist · 2026-03-18T00:59:34Z

tests/unit/dataset_manager/test_transforms.py

@@ -218,7 +238,6 @@ def test_preserves_original_columns(self):
        assert "answer" in result.columns
        assert "prompt" in result.columns
        assert list(result["question"]) == ["Q1", "Q2"]


This test is named test_preserves_original_columns. While it checks for the presence of the answer column, it no longer verifies that its content is preserved. It would be good to add back the assertion that checks the content of the answer column to make the test more robust and aligned with its name.

Suggested change

assert list(result["question"]) == ["Q1", "Q2"]

assert list(result["question"]) == ["Q1", "Q2"]

assert list(result["answer"]) == ["A1", "A2"]

The SGLang adapter now returns TextModelOutput, not raw str. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- New EventLoggerService subscribing to ZMQ events - JSONLWriter for file-based event logging - SQLWriter via SQLAlchemy for DB compatibility - Writer base class with async context manager protocol - ZMQ context: add socket_dir parameter for configurable IPC paths - Add sqlalchemy dependency

- MetricsAggregator: subscribes to ZMQ events, computes QPS/latency/TTFT/TPOT - MetricsTable: columnar storage for efficient percentile calculations - TokenMetrics: ISL/OSL token-level metrics from PromptData - MetricsEmitter: periodic metric publishing - Service entry point with CLI interface - Comprehensive unit and e2e tests

- Convert repetitive test methods to parametrized tests - Split test_reporter.py into focused modules: test_reporter_report.py, test_reporter_rollup.py, test_reporter_stop_perf.py - Consolidate duplicate test patterns across test suites - Improve test organization in transforms, extractor, types, record tests

The error message said `* 0.2` but the assertion checks `* 1.2`. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

nvzhihanj · 2026-03-19T05:25:39Z

Review Council — Multi-AI Code Review

Reviewed by: Codex + Claude | Depth: thorough

Found 15 issues across 12 files.

Must Fix (high)

Issues that weaken test coverage or will cause incorrect behavior.

#	File	Line	Category	Reviewer	Summary
1	`tests/unit/async_utils/services/metrics_aggregator/test_token_metrics.py`	41	testing	Codex	The autouse patch returns the same `FakeTokenizer` object for every call, so all `TokenizePool` workers share one tokeni...
2	`tests/unit/dataset_manager/test_transforms.py`	210	testing	Codex	This refactor now only checks `result[output_col][0]`, and the multi-variable formatter case was reduced to a single-row...
3	`tests/unit/metrics/test_reporter_report.py`	35	testing	Codex	The split-out report test no longer checks `report.n_samples_failed`, even though this fixture contains one distinct fai...

Should Fix (medium)

Real issues under specific conditions, or gaps that will compound.

#	File	Line	Category	Reviewer	Summary
4	`src/inference_endpoint/openai/accumulator.py`	91	bug	Claude	When `output_chunks` is empty, `TextModelOutput(output=[], reasoning=None)` is created. `__post_init__` converts to `out...
5	`src/inference_endpoint/core/types.py`	250	api-contract	Claude	`QueryResult.response_output` is typed as `OUTPUT_TYPE
6	`tests/unit/async_utils/services/metrics_aggregator/test_token_metrics.py`	63	testing	Claude	The parametrized `test_error_cases` with `action == "close_twice"` has no assertion when `error_type is None`. The test ...
7	`tests/unit/async_utils/services/metrics_aggregator/test_aggregator_e2e.py`	152	testing	Claude	`TestAggregatorE2E` is marked `@pytest.mark.integration` but is located in `tests/unit/`. This class creates real ZMQ co...
8	`tests/unit/core/test_record.py`	34	testing	Claude	Test classes `TestEventType`, `TestEventRecordConstruction`, `TestEncodeEventRecord`, and `TestEventRecordRoundTrip` are...
9	`tests/unit/core/test_types.py`	36	testing	Claude	Test classes `TestErrorData`, `TestQuerySerialization`, `TestQueryResultSerialization`, `TestStreamChunkSerialization`, ...

Consider (low)

Valid improvements — could be follow-ups.

#	File	Line	Category	Reviewer	Summary
10	`src/inference_endpoint/utils/__init__.py`	114	concurrency	Claude	`_instance_lock` is a class variable on `SingletonMixin`, meaning all subclasses share one lock. Not a bug (each checks ...
11	`src/inference_endpoint/async_utils/transport/zmq/context.py`	58	design	Claude	The `socket_dir` docstring says "This directory will be cleaned up when the context is cleaned up." This is only true fo...
12	`src/inference_endpoint/async_utils/transport/zmq/context.py`	87	design	Claude	`self.socket_dir: str
13	`src/inference_endpoint/endpoint_client/worker.py`	521	design	Claude	When `error` is a plain string, `ErrorData` is created with `error_type="error"` — generic and non-descriptive. Consider...
14	`tests/unit/async_utils/services/event_logger/test_event_logger.py`	329	testing	Claude	`test_creates_log_dir_if_missing` and `test_not_a_directory_error` don't test production code — they replicate logic man...
15	`tests/unit/async_utils/services/metrics_aggregator/test_aggregator_e2e.py`	193	testing	Claude	`_publish_and_sleep` uses hardcoded `time.sleep(0.05)` between publishes and `time.sleep(0.5)` for ZMQ slow-joiner — fla...

Note: Most issues point to source files in the base branch (feat/alicheng-metrics-aggregator), not files changed in this PR. Inline comments could not be posted on those files.

nvzhihanj

test

nvzhihanj · 2026-03-19T05:26:09Z

tests/unit/async_utils/services/metrics_aggregator/test_token_metrics.py

+        return list(range(len(text.split())))
+
+
+@pytest.fixture(autouse=True)


[Codex] high (testing): The autouse patch returns the same FakeTokenizer object for every call, so all TokenizePool workers share one tokenizer instance. If _get_thread_tokenizer() or its thread-local caching regresses to cross-thread reuse, test_multiple_workers and test_concurrent_thread_safe will still pass even though production code is supposed to keep one tokenizer per worker thread.

nvzhihanj

test

nvzhihanj · 2026-03-19T05:26:10Z

tests/unit/async_utils/services/metrics_aggregator/test_token_metrics.py

    def test_token_count_returns_int(self):
-        with TokenizePool("gpt2", n_workers=1) as pool:
+        with TokenizePool("fake", n_workers=1) as pool:
            count = pool.token_count("Hello world")


[Claude] medium (testing): The parametrized test_error_cases with action == "close_twice" has no assertion when error_type is None. The test simply calls close() twice and returns without verifying behavior. Add an explicit assertion (e.g., assert pool._executor is None) to clarify test intent.

nvzhihanj

test

nvzhihanj · 2026-03-19T05:26:11Z

tests/unit/core/test_record.py

@@ -32,33 +32,37 @@


 class TestEventType:


[Claude] medium (testing): Test classes TestEventType, TestEventRecordConstruction, TestEncodeEventRecord, and TestEventRecordRoundTrip are missing @pytest.mark.unit markers. Per project testing standards, every test needs a marker.

nvzhihanj

test

nvzhihanj · 2026-03-19T05:26:12Z

tests/unit/core/test_types.py

@@ -36,150 +36,104 @@
 class TestErrorData:


[Claude] medium (testing): Test classes TestErrorData, TestQuerySerialization, TestQueryResultSerialization, TestStreamChunkSerialization, TestQueryResultWorkerPatterns, and TestMixedTypeSerialization are missing @pytest.mark.unit markers.

arekay-nv · 2026-03-24T18:56:05Z

Closing for now. Will followup later.

nv-alicheng added 2 commits March 17, 2026 17:28

Revert hallucinated fix

4a0d50b

nv-alicheng requested a review from a team as a code owner March 18, 2026 00:56

github-actions bot requested review from arekay-nv and nvzhihanj March 18, 2026 00:56

gemini-code-assist bot reviewed Mar 18, 2026

View reviewed changes

nv-alicheng and others added 5 commits March 17, 2026 20:41

fix: remove stale comments about deprecated str output path

781a146

The SGLang adapter now returns TextModelOutput, not raw str. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: correct f-string upper bound in test_data_loader assertion

bc244a8

The error message said `* 0.2` but the assertion checks `* 1.2`. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

nv-alicheng force-pushed the feat/alicheng-test-cleanup branch from b723356 to bc244a8 Compare March 18, 2026 03:43

nv-alicheng force-pushed the feat/alicheng-metrics-aggregator branch from f2ec2f8 to 14fdfdc Compare March 18, 2026 03:43

nv-alicheng mentioned this pull request Mar 18, 2026

feat: add EventLoggerService with JSONL and SQL writers #187

Merged

10 tasks

nvzhihanj reviewed Mar 19, 2026

View reviewed changes

nv-alicheng force-pushed the feat/alicheng-metrics-aggregator branch from 14fdfdc to 43a3769 Compare March 20, 2026 00:34

nv-alicheng mentioned this pull request Mar 23, 2026

Move Record to be a core type, add SQLWriter #161

Closed

10 tasks

nv-alicheng force-pushed the feat/alicheng-metrics-aggregator branch from 43a3769 to 1ce35f4 Compare March 24, 2026 01:20

arekay-nv closed this Mar 24, 2026

github-actions bot locked and limited conversation to collaborators Mar 24, 2026

	assert list(result["question"]) == ["Q1", "Q2"]
	assert list(result["question"]) == ["Q1", "Q2"]
	assert list(result["answer"]) == ["A1", "A2"]

		return list(range(len(text.split())))


		@pytest.fixture(autouse=True)

Conversation

nv-alicheng commented Mar 18, 2026

What does this PR do?

Type of change

Related issues

Testing

Checklist

Uh oh!

github-actions bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Mar 18, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

nvzhihanj commented Mar 19, 2026

Review Council — Multi-AI Code Review

Must Fix (high)

Should Fix (medium)

Consider (low)

Uh oh!

nvzhihanj left a comment

Choose a reason for hiding this comment

Uh oh!

nvzhihanj Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

nvzhihanj left a comment

Choose a reason for hiding this comment

Uh oh!

nvzhihanj Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

nvzhihanj left a comment

Choose a reason for hiding this comment

Uh oh!

nvzhihanj Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

nvzhihanj left a comment

Choose a reason for hiding this comment

Uh oh!

nvzhihanj Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

arekay-nv commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Mar 18, 2026 •

edited

Loading