test: refactor tests for readability and maintainability#189
test: refactor tests for readability and maintainability#189nv-alicheng wants to merge 7 commits intofeat/alicheng-metrics-aggregatorfrom
Conversation
- Move record.py from async_utils/transport to core/ - Add PromptData, TextModelOutput, ErrorData types with msgspec Struct - Deprecate str as response_output type in favor of TextModelOutput - Add msgspec struct performance flags (gc=False, array_like=True) - Fix threading safety issues in http_client, sample handler, recorder - Update all imports across the codebase - Bump default worker init timeout to 60s
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refactors the existing test suite to enhance readability, maintainability, and efficiency. By converting repetitive tests into parametrized formats and reorganizing test files, the changes aim to streamline future test development and make it easier to understand the purpose and behavior of various components. The overall impact is a cleaner, more robust testing infrastructure. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request is a significant and valuable refactoring of the test suite. The changes greatly improve the readability and maintainability of the tests by converting many repetitive test methods into concise, parametrized tests. Splitting test_reporter.py into more focused modules is also a good structural improvement.
I've found a couple of minor issues:
- A small bug in an f-string in an error message within
test_random_data_loader. - A removed assertion in
test_preserves_original_columnsthat I believe should be restored to ensure the test is fully robust.
Overall, this is an excellent cleanup of the test code. My comments are minor suggestions for improvement.
| @@ -218,7 +238,6 @@ def test_preserves_original_columns(self): | |||
| assert "answer" in result.columns | |||
| assert "prompt" in result.columns | |||
| assert list(result["question"]) == ["Q1", "Q2"] | |||
There was a problem hiding this comment.
This test is named test_preserves_original_columns. While it checks for the presence of the answer column, it no longer verifies that its content is preserved. It would be good to add back the assertion that checks the content of the answer column to make the test more robust and aligned with its name.
| assert list(result["question"]) == ["Q1", "Q2"] | |
| assert list(result["question"]) == ["Q1", "Q2"] | |
| assert list(result["answer"]) == ["A1", "A2"] |
The SGLang adapter now returns TextModelOutput, not raw str. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- New EventLoggerService subscribing to ZMQ events - JSONLWriter for file-based event logging - SQLWriter via SQLAlchemy for DB compatibility - Writer base class with async context manager protocol - ZMQ context: add socket_dir parameter for configurable IPC paths - Add sqlalchemy dependency
- MetricsAggregator: subscribes to ZMQ events, computes QPS/latency/TTFT/TPOT - MetricsTable: columnar storage for efficient percentile calculations - TokenMetrics: ISL/OSL token-level metrics from PromptData - MetricsEmitter: periodic metric publishing - Service entry point with CLI interface - Comprehensive unit and e2e tests
- Convert repetitive test methods to parametrized tests - Split test_reporter.py into focused modules: test_reporter_report.py, test_reporter_rollup.py, test_reporter_stop_perf.py - Consolidate duplicate test patterns across test suites - Improve test organization in transforms, extractor, types, record tests
The error message said `* 0.2` but the assertion checks `* 1.2`. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
b723356 to
bc244a8
Compare
f2ec2f8 to
14fdfdc
Compare
Review Council — Multi-AI Code ReviewReviewed by: Codex + Claude | Depth: thorough Found 15 issues across 12 files. Must Fix (high)Issues that weaken test coverage or will cause incorrect behavior.
Should Fix (medium)Real issues under specific conditions, or gaps that will compound.
Consider (low)Valid improvements — could be follow-ups.
Note: Most issues point to source files in the base branch ( |
| return list(range(len(text.split()))) | ||
|
|
||
|
|
||
| @pytest.fixture(autouse=True) |
There was a problem hiding this comment.
[Codex] high (testing): The autouse patch returns the same FakeTokenizer object for every call, so all TokenizePool workers share one tokenizer instance. If _get_thread_tokenizer() or its thread-local caching regresses to cross-thread reuse, test_multiple_workers and test_concurrent_thread_safe will still pass even though production code is supposed to keep one tokenizer per worker thread.
| def test_token_count_returns_int(self): | ||
| with TokenizePool("gpt2", n_workers=1) as pool: | ||
| with TokenizePool("fake", n_workers=1) as pool: | ||
| count = pool.token_count("Hello world") |
There was a problem hiding this comment.
[Claude] medium (testing): The parametrized test_error_cases with action == "close_twice" has no assertion when error_type is None. The test simply calls close() twice and returns without verifying behavior. Add an explicit assertion (e.g., assert pool._executor is None) to clarify test intent.
| @@ -32,33 +32,37 @@ | |||
|
|
|||
|
|
|||
| class TestEventType: | |||
There was a problem hiding this comment.
[Claude] medium (testing): Test classes TestEventType, TestEventRecordConstruction, TestEncodeEventRecord, and TestEventRecordRoundTrip are missing @pytest.mark.unit markers. Per project testing standards, every test needs a marker.
| @@ -36,150 +36,104 @@ | |||
| class TestErrorData: | |||
There was a problem hiding this comment.
[Claude] medium (testing): Test classes TestErrorData, TestQuerySerialization, TestQueryResultSerialization, TestStreamChunkSerialization, TestQueryResultWorkerPatterns, and TestMixedTypeSerialization are missing @pytest.mark.unit markers.
14fdfdc to
43a3769
Compare
43a3769 to
1ce35f4
Compare
|
Closing for now. Will followup later. |
What does this PR do?
Type of change
Related issues
Testing
Checklist