feat: Implement preference for server token counts#716
feat: Implement preference for server token counts#716matthewkotila wants to merge 5 commits intomainfrom
Conversation
Try out this PRQuick install: pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@e2c3be076545049348d1daf58a9f7038240f3be6Recommended with virtual environment (using uv): uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@e2c3be076545049348d1daf58a9f7038240f3be6Last updated for commit: |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughThis PR deprecates Changes
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. 📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip CodeRabbit can generate a title for your PR based on the changes with custom instructions.Set the |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
docs/metrics_reference.md (1)
918-918:⚠️ Potential issue | 🟡 MinorUpdate documentation for deprecated flag.
Line 918 still references
--use-server-token-countas a solution for tokenizer mismatch, but this flag is now deprecated and a no-op. Consider updating this guidance to reference the new dual-source approach or remove the recommendation.📝 Suggested update
-- If discrepancy is due to tokenizer mismatch between client and server, use `--use-server-token-count`. +- If discrepancy is due to tokenizer mismatch between client and server, compare the server-reported and client-side values in the exported file-only metrics (`_server` vs `_local` variants).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/metrics_reference.md` at line 918, Replace the deprecated guidance that recommends using the `--use-server-token-count` flag: remove or change the sentence referencing `--use-server-token-count` and instead recommend the new dual-source approach (e.g., compare client and server token counts and surface both sources) or point users to the current flag/setting that replaced it; update any nearby wording to clarify that `--use-server-token-count` is a no-op and link to the dual-source documentation or migration note so readers know the correct current procedure.
🧹 Nitpick comments (3)
src/aiperf/metrics/types/output_token_count.py (1)
53-59: Inconsistent zero-handling between metrics.
OutputTokenCountMetricuses truthy checks (if record.token_counts.output:) which treats0as missing, whileOutputTokenCountServerMetricandOutputTokenCountLocalMetricuseis not Nonechecks which correctly allow0.This inconsistency could cause confusion: a record with
output=0would raiseNoMetricValuein this metric but return0in the server-specific variant.Consider aligning the behavior:
♻️ Proposed fix for consistent zero-handling
- if record.token_counts.output: + if record.token_counts.output is not None: return record.token_counts.output - if record.token_counts.output_local: + if record.token_counts.output_local is not None: return record.token_counts.output_local🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/aiperf/metrics/types/output_token_count.py` around lines 53 - 59, The OutputTokenCountMetric currently uses truthy checks that treat 0 as missing; change the conditional checks in the OutputTokenCountMetric to use "is not None" for record.token_counts.output and record.token_counts.output_local (same semantics as OutputTokenCountServerMetric/OutputTokenCountLocalMetric) so that zero values are returned instead of raising NoMetricValue; keep the final raise NoMetricValue(...) if both are None.tests/unit/records/test_inference_result_parser.py (2)
207-323: Make tokenization mode explicit in these tests to prevent config-coupled flakiness.These cases assert
input_local/encode()behavior that now depends on parser flags. Settingtokenize_input/tokenize_outputexplicitly in each test keeps them deterministic if fixture defaults change.Proposed adjustment pattern
async def test_server_overlay_on_client( self, server_token_parser, request_record, spy_tokenizer ): + server_token_parser.tokenize_input = True + server_token_parser.tokenize_output = False server_token_parser.get_tokenizer = AsyncMock(return_value=spy_tokenizer)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/records/test_inference_result_parser.py` around lines 207 - 323, These tests rely on client-side tokenization behavior, so make tokenization mode explicit by setting the parser flags before calling process_valid_record; for each test (e.g., test_server_overlay_on_client, test_no_server_usage_falls_back_to_client, test_partial_server_usage, test_streaming_uses_last_server_value, test_server_output_overrides_client) set the parser instance's tokenize_input = True and tokenize_output = True (on server_token_parser or setup_inference_parser as applicable) so input_local/output_local and spy_tokenizer.encode() behavior is deterministic regardless of fixture defaults.
38-38: Fixture docstring is stale after tokenize gating changes.“both counts are always computed” no longer matches current behavior with
tokenize_input/tokenize_outputgating. Please reword this to avoid misleading future test intent.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/records/test_inference_result_parser.py` at line 38, Update the docstring on the parser fixture in test_inference_result_parser.py (the "parser" fixture) to remove the stale claim that "both counts are always computed" and instead state that the fixture mirrors the default parser behavior with counts computed conditionally according to the tokenize_input and tokenize_output gating flags; ensure the wording clearly indicates that counts may be included or omitted depending on those tokenize_* settings.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/aiperf/records/inference_result_parser.py`:
- Around line 279-283: The warning is emitted when only input_token_count and
output_server are missing but ignores a present reasoning_server; change the
predicate so the warning only fires when all token sources are absent. Update
the condition in inference_result_parser.py to check that input_token_count is
None AND output_server is None AND reasoning_server is None before calling
self.warning, referencing the existing variables input_token_count,
output_server, reasoning_server and the self.warning call to locate the code.
- Around line 285-309: The local tokenization steps (calls to await
self.get_tokenizer, self._compute_token_count, and compute_input_token_count)
must be made best-effort and must not raise to the caller; wrap tokenizer load
and token counting in a try/except that catches encoding/load errors, logs a
warning, and leaves input_local/output_local/reasoning_local as None so any
server-provided counts (input_token_count, output_server, reasoning_server) are
preserved; specifically update the block that calls compute_input_token_count,
await self.get_tokenizer(request_record.model_name),
self._parse_output_and_reasoning_texts, and self._compute_token_count to catch
exceptions, avoid re-raising, and proceed with existing server-parsed results
rather than converting the record to an error in parse_request_record.
---
Outside diff comments:
In `@docs/metrics_reference.md`:
- Line 918: Replace the deprecated guidance that recommends using the
`--use-server-token-count` flag: remove or change the sentence referencing
`--use-server-token-count` and instead recommend the new dual-source approach
(e.g., compare client and server token counts and surface both sources) or point
users to the current flag/setting that replaced it; update any nearby wording to
clarify that `--use-server-token-count` is a no-op and link to the dual-source
documentation or migration note so readers know the correct current procedure.
---
Nitpick comments:
In `@src/aiperf/metrics/types/output_token_count.py`:
- Around line 53-59: The OutputTokenCountMetric currently uses truthy checks
that treat 0 as missing; change the conditional checks in the
OutputTokenCountMetric to use "is not None" for record.token_counts.output and
record.token_counts.output_local (same semantics as
OutputTokenCountServerMetric/OutputTokenCountLocalMetric) so that zero values
are returned instead of raising NoMetricValue; keep the final raise
NoMetricValue(...) if both are None.
In `@tests/unit/records/test_inference_result_parser.py`:
- Around line 207-323: These tests rely on client-side tokenization behavior, so
make tokenization mode explicit by setting the parser flags before calling
process_valid_record; for each test (e.g., test_server_overlay_on_client,
test_no_server_usage_falls_back_to_client, test_partial_server_usage,
test_streaming_uses_last_server_value, test_server_output_overrides_client) set
the parser instance's tokenize_input = True and tokenize_output = True (on
server_token_parser or setup_inference_parser as applicable) so
input_local/output_local and spy_tokenizer.encode() behavior is deterministic
regardless of fixture defaults.
- Line 38: Update the docstring on the parser fixture in
test_inference_result_parser.py (the "parser" fixture) to remove the stale claim
that "both counts are always computed" and instead state that the fixture
mirrors the default parser behavior with counts computed conditionally according
to the tokenize_input and tokenize_output gating flags; ensure the wording
clearly indicates that counts may be included or omitted depending on those
tokenize_* settings.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📥 Commits
Reviewing files that changed from the base of the PR and between eab7190 and 823c4c2500f820b3a842e93728a7acf7ba5d5ba1.
📒 Files selected for processing (32)
docs/cli_options.mddocs/metrics_reference.mdsrc/aiperf/common/config/endpoint_config.pysrc/aiperf/common/config/tokenizer_config.pysrc/aiperf/common/config/user_config.pysrc/aiperf/common/enums/metric_enums.pysrc/aiperf/common/models/model_endpoint_info.pysrc/aiperf/common/models/record_models.pysrc/aiperf/common/tokenizer_validator.pysrc/aiperf/endpoints/openai_chat.pysrc/aiperf/endpoints/openai_completions.pysrc/aiperf/exporters/console_usage_discrepancy_exporter.pysrc/aiperf/metrics/base_record_metric.pysrc/aiperf/metrics/types/input_sequence_length_metric.pysrc/aiperf/metrics/types/output_sequence_length_metric.pysrc/aiperf/metrics/types/output_token_count.pysrc/aiperf/metrics/types/reasoning_token_count.pysrc/aiperf/metrics/types/usage_diff_metrics.pysrc/aiperf/post_processors/base_metrics_processor.pysrc/aiperf/records/inference_result_parser.pysrc/aiperf/records/record_processor_service.pytests/integration/test_use_server_token_counts.pytests/unit/endpoints/test_openai_chat_completions.pytests/unit/endpoints/test_openai_completions.pytests/unit/exporters/test_console_usage_discrepancy_exporter.pytests/unit/metrics/conftest.pytests/unit/metrics/test_input_sequence_length_metric.pytests/unit/metrics/test_output_sequence_length_metric.pytests/unit/metrics/test_output_token_count.pytests/unit/metrics/test_reasoning_token_count.pytests/unit/metrics/test_usage_diff_metrics.pytests/unit/records/test_inference_result_parser.py
💤 Files with no reviewable changes (4)
- src/aiperf/common/enums/metric_enums.py
- src/aiperf/common/models/model_endpoint_info.py
- src/aiperf/common/tokenizer_validator.py
- src/aiperf/post_processors/base_metrics_processor.py
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
ajcasagrande
left a comment
There was a problem hiding this comment.
Code Review
Three issues found after verifying against the actual code:
1. Truthy vs None check inconsistency in fallback logic
OutputTokenCountMetric uses truthy checks (excludes 0), while ReasoningTokenCountMetric and InputSequenceLengthMetric use explicit is not None checks (includes 0):
output_token_count.py:53-57:
if record.token_counts.output: # 0 falls through
return record.token_counts.output
if record.token_counts.output_local: # 0 falls through
return record.token_counts.output_localreasoning_token_count.py:55-59:
if record.token_counts.reasoning is not None: # 0 returns
return record.token_counts.reasoning
if record.token_counts.reasoning_local is not None:
return record.token_counts.reasoning_localIf a server reports output=0 (valid — model generated no output tokens), OutputTokenCountMetric skips it and falls through to output_local, then raises NoMetricValue if that's also falsy. Meanwhile reasoning=0 correctly returns 0. InputSequenceLengthMetric also uses is not None, making OutputTokenCountMetric the odd one out.
2. Breaking metric tag rename
usage_diff_metrics.py:103 — tag changed from usage_completion_tokens_diff_pct to usage_output_tokens_diff_pct. Any tooling parsing profile_export.jsonl for the old tag will silently find nothing. Worth calling out as a breaking change in the PR description or release notes.
3. Misleading test rename
test_input_sequence_length_metric.py:34-47:
def test_input_sequence_length_parse_record_client_fallback(
self, input_tokens, should_raise
):
"""Test input sequence length falls back to client-side when no server count."""
record = create_record(input_tokens=input_tokens)create_record(input_tokens=15) sets TokenCounts(input=15, input_local=None) — input is the server field. This test exercises the server-preferred path, not client fallback. The name and docstring don't match what the test does.
ajcasagrande
left a comment
There was a problem hiding this comment.
Deep Design Analysis
Six additional issues found after tracing the reasoning behind the changes through the codebase:
A. OSL can mix server and client token sources
output_sequence_length_metric.py:51-65 resolves output and reasoning independently through separate server→local fallback chains. When a server reports completion_tokens but not reasoning_tokens (standard for non-reasoning models), OSL sums server output + client reasoning. This propagates into OSLMismatchDiffMetric, OutputTokenThroughputMetric, TotalTokenThroughputMetric, and OSLMismatchCountMetric. The practical impact is narrow — it requires a model producing reasoning content (e.g. <think> blocks) where the server doesn't account for reasoning_tokens separately — but when it hits, the resulting OSL doesn't correspond to either system's view of reality.
B. Error records lose input tokenization for user-provided datasets
inference_result_parser.py:127,172 — The error record paths gate input tokenization on self.tokenize_input, which is auto-set to False for user-provided datasets (user_config.py:994). Error records have no server response, so token_counts.input is always None. Unlike the valid-record path (line 291) which has an elif input_token_count is None fallback, error records have no fallback. Both input and input_local are None → ErrorInputSequenceLengthMetric always raises NoMetricValue. The old code had no tokenize_input gate on error records.
C. stream_options.include_usage injected into all streaming requests
openai_chat.py:64-72, openai_completions.py:55-63 — Previously gated behind --use-server-token-count, now unconditional for all streaming endpoints. This changes the HTTP payload sent to the server, which may affect latency measurements (extra final usage chunk) or cause errors on servers that don't support stream_options. Users can work around it with --extra-inputs '{"stream_options": {"include_usage": false}}', but there is no notification of this change.
D. Prompt diff metrics silently disappear for user-provided datasets
For user-provided datasets: tokenize_input=False (auto) + server reports prompt_tokens → input_local is never computed (inference_result_parser.py:288-289 — the self.tokenize_input branch is skipped, and the fallback on line 291 doesn't fire because input_token_count is not None) → UsagePromptTokensDiffMetric raises NoMetricValue (line 69-72) → UsageDiscrepancyCountMetric cascades → console discrepancy exporter silently returns with a debug log. ISL displays on the console from server values, but the system has no ability to validate those values are correct.
E. Output/reasoning diff metrics are permanently dormant by default
UsageOutputTokensDiffMetric and UsageReasoningTokensDiffMetric require output_local/reasoning_local (lines 126, 184), which are only populated when --tokenize-output is passed or the server doesn't report counts (inference_result_parser.py:306-309). In the common case (server reports completion tokens, --tokenize-output not passed), these metrics always raise NoMetricValue. Only a TRACE-level log is emitted. Nothing tells the user these metrics exist or how to activate them.
F. UsageDiscrepancyCountMetric missing TOKENIZES_INPUT_ONLY flag
UsageDiscrepancyCountMetric has flags = NO_CONSOLE | NO_INDIVIDUAL_RECORDS (line 236). Its dependency UsagePromptTokensDiffMetric has TOKENIZES_INPUT_ONLY (line 48). For non-tokenizing endpoints, the dependency is filtered out at setup time by base_metrics_processor.py:get_filters, but the count metric passes the filter because it lacks the flag. At runtime it fails silently when it can't find the dependency. Functionally equivalent to being filtered, but semantically wrong — the metric shouldn't be in the applicable set at all.
a8d63f1 to
a912ba6
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
docs/metrics_reference.md (1)
918-918:⚠️ Potential issue | 🟡 MinorReference to deprecated
--use-server-token-countflag should be removed.Per the PR objectives,
--use-server-token-countis deprecated as a no-op. This line still references it as a solution for tokenizer mismatch. Consider removing this note or updating it to reflect the new dual-source behavior where server values are preferred by default.- - If discrepancy is due to tokenizer mismatch between client and server, use `--use-server-token-count`.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/metrics_reference.md` at line 918, Remove the deprecated recommendation to use `--use-server-token-count`: locate the sentence that reads "If discrepancy is due to tokenizer mismatch between client and server, use `--use-server-token-count`" and either delete it or replace it with a short note stating that `--use-server-token-count` is deprecated/no-op and that the system now uses dual-source counts with server values preferred by default. Ensure the updated wording mentions the deprecation and the new default server-preference behavior.
♻️ Duplicate comments (2)
src/aiperf/records/inference_result_parser.py (2)
285-309:⚠️ Potential issue | 🟠 MajorLocal tokenization failures should not invalidate otherwise valid records.
In this block, tokenizer load/encode errors bubble up to
parse_request_record, which converts the entire record to an error and drops responses. Since local counts are supplementary (fallback when server doesn't report), preserve valid server-parsed results when local tokenization fails.Suggested hardening
# Client-side input tokenization input_local: int | None = None - if not self.disable_tokenization: - if self.tokenize_input: - # Always compute (synthetic mode or explicit --tokenize-input) - input_local = await self.compute_input_token_count(request_record) - elif input_token_count is None: - # Fallback: server didn't report prompt tokens - input_local = await self.compute_input_token_count(request_record) + if not self.disable_tokenization and ( + self.tokenize_input or input_token_count is None + ): + with suppress(Exception): + input_local = await self.compute_input_token_count(request_record) # Client-side output/reasoning tokenization output_local: int | None = None reasoning_local: int | None = None need_local = ( self.tokenize_output or output_server is None or reasoning_server is None ) if not self.disable_tokenization and need_local: - tokenizer = await self.get_tokenizer(request_record.model_name) - output_texts, reasoning_texts = self._parse_output_and_reasoning_texts( - responses - ) - if self.tokenize_output or output_server is None: - output_local = self._compute_token_count(tokenizer, output_texts) - if self.tokenize_output or reasoning_server is None: - reasoning_local = self._compute_token_count(tokenizer, reasoning_texts) + with suppress(Exception): + tokenizer = await self.get_tokenizer(request_record.model_name) + output_texts, reasoning_texts = self._parse_output_and_reasoning_texts( + responses + ) + if self.tokenize_output or output_server is None: + output_local = self._compute_token_count(tokenizer, output_texts) + if self.tokenize_output or reasoning_server is None: + reasoning_local = self._compute_token_count(tokenizer, reasoning_texts)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/aiperf/records/inference_result_parser.py` around lines 285 - 309, The local tokenization path in inference_result_parser.py (around get_tokenizer, _compute_token_count, and compute_input_token_count calls inside parse_request_record) must not let tokenizer load/encode exceptions convert a valid server-parsed record into an error; wrap the tokenizer fetching and all local token counting (compute_input_token_count, _compute_token_count) in try/except, catch encoding/loading errors, log the failure (including model_name and exception) and leave input_local/output_local/reasoning_local as None so the existing server-provided values (input_token_count, output_server, reasoning_server) remain used; do not re-raise the exception or mutate the record into an error state.
278-283:⚠️ Potential issue | 🟡 MinorNo-usage warning predicate is incomplete.
The warning triggers when
input_token_countandoutput_serverare missing, even if the server reportedreasoning_tokens. This could emit a misleading message when the server does provide partial usage info.Suggested fix
- if input_token_count is None and output_server is None: + if ( + input_token_count is None + and output_server is None + and reasoning_server is None + ):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/aiperf/records/inference_result_parser.py` around lines 278 - 283, The warning in the inference result parsing currently fires when input_token_count and output_server are None even if the server provided reasoning_tokens; update the predicate (in the method that calls self.warning with the message about token usage — locate the block using input_token_count, output_server, and reasoning_tokens in inference_result_parser.py) to also check that reasoning_tokens is None so the warning only triggers when input_token_count, output_server, and reasoning_tokens are all missing.
🧹 Nitpick comments (1)
src/aiperf/metrics/base_record_metric.py (1)
28-33: Consider updating the docstring example to match actual ISL behavior.The example now shows
record.token_counts.input_local, but the actualInputSequenceLengthMetricprefers server-reportedrecord.token_counts.inputand falls back toinput_local. Consider updating the example to reflect the real server-preferred-with-fallback pattern:# Prefer server, fall back to local input_count = record.token_counts.input if input_count is None: input_count = record.token_counts.input_local return input_countAlternatively, if the intent is to show the simplest case, add a comment noting this is for
InputSequenceLengthLocalMetricspecifically.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/aiperf/metrics/base_record_metric.py` around lines 28 - 33, The docstring example for InputSequenceLengthMetric is misleading because the metric prefers server-reported token_counts.input and falls back to token_counts.input_local; update the example in the InputSequenceLengthMetric docstring to show the server-preferred-with-fallback pattern (use record.token_counts.input, if None then record.token_counts.input_local) or, if the example is intentionally local-only, change the example text to reference InputSequenceLengthLocalMetric and add a comment stating it uses input_local only; update references in the class/method docstring accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/aiperf/metrics/types/output_token_count.py`:
- Around line 50-59: OutputTokenCountMetric currently uses truthy checks (if
record.token_counts.output) which treats 0 as missing; change these to explicit
None checks to match ReasoningTokenCountMetric and allow legitimate zero values.
Update the logic in OutputTokenCountMetric to: first verify record.token_counts
is not None (unchanged), then use "if record.token_counts.output is not None:
return record.token_counts.output" and "if record.token_counts.output_local is
not None: return record.token_counts.output_local", otherwise raise
NoMetricValue. This preserves existing behavior while making it consistent with
ReasoningTokenCountMetric.
---
Outside diff comments:
In `@docs/metrics_reference.md`:
- Line 918: Remove the deprecated recommendation to use
`--use-server-token-count`: locate the sentence that reads "If discrepancy is
due to tokenizer mismatch between client and server, use
`--use-server-token-count`" and either delete it or replace it with a short note
stating that `--use-server-token-count` is deprecated/no-op and that the system
now uses dual-source counts with server values preferred by default. Ensure the
updated wording mentions the deprecation and the new default server-preference
behavior.
---
Duplicate comments:
In `@src/aiperf/records/inference_result_parser.py`:
- Around line 285-309: The local tokenization path in inference_result_parser.py
(around get_tokenizer, _compute_token_count, and compute_input_token_count calls
inside parse_request_record) must not let tokenizer load/encode exceptions
convert a valid server-parsed record into an error; wrap the tokenizer fetching
and all local token counting (compute_input_token_count, _compute_token_count)
in try/except, catch encoding/loading errors, log the failure (including
model_name and exception) and leave input_local/output_local/reasoning_local as
None so the existing server-provided values (input_token_count, output_server,
reasoning_server) remain used; do not re-raise the exception or mutate the
record into an error state.
- Around line 278-283: The warning in the inference result parsing currently
fires when input_token_count and output_server are None even if the server
provided reasoning_tokens; update the predicate (in the method that calls
self.warning with the message about token usage — locate the block using
input_token_count, output_server, and reasoning_tokens in
inference_result_parser.py) to also check that reasoning_tokens is None so the
warning only triggers when input_token_count, output_server, and
reasoning_tokens are all missing.
---
Nitpick comments:
In `@src/aiperf/metrics/base_record_metric.py`:
- Around line 28-33: The docstring example for InputSequenceLengthMetric is
misleading because the metric prefers server-reported token_counts.input and
falls back to token_counts.input_local; update the example in the
InputSequenceLengthMetric docstring to show the server-preferred-with-fallback
pattern (use record.token_counts.input, if None then
record.token_counts.input_local) or, if the example is intentionally local-only,
change the example text to reference InputSequenceLengthLocalMetric and add a
comment stating it uses input_local only; update references in the class/method
docstring accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 745ddad1-b3a7-453e-a890-3ec9d836d7b1
📥 Commits
Reviewing files that changed from the base of the PR and between 823c4c2500f820b3a842e93728a7acf7ba5d5ba1 and a8d63f1450b6927553edfc89acc843d8403f6929.
📒 Files selected for processing (32)
docs/cli_options.mddocs/metrics_reference.mdsrc/aiperf/common/config/endpoint_config.pysrc/aiperf/common/config/tokenizer_config.pysrc/aiperf/common/config/user_config.pysrc/aiperf/common/enums/metric_enums.pysrc/aiperf/common/models/model_endpoint_info.pysrc/aiperf/common/models/record_models.pysrc/aiperf/common/tokenizer_validator.pysrc/aiperf/endpoints/openai_chat.pysrc/aiperf/endpoints/openai_completions.pysrc/aiperf/exporters/console_usage_discrepancy_exporter.pysrc/aiperf/metrics/base_record_metric.pysrc/aiperf/metrics/types/input_sequence_length_metric.pysrc/aiperf/metrics/types/output_sequence_length_metric.pysrc/aiperf/metrics/types/output_token_count.pysrc/aiperf/metrics/types/reasoning_token_count.pysrc/aiperf/metrics/types/usage_diff_metrics.pysrc/aiperf/post_processors/base_metrics_processor.pysrc/aiperf/records/inference_result_parser.pysrc/aiperf/records/record_processor_service.pytests/integration/test_use_server_token_counts.pytests/unit/endpoints/test_openai_chat_completions.pytests/unit/endpoints/test_openai_completions.pytests/unit/exporters/test_console_usage_discrepancy_exporter.pytests/unit/metrics/conftest.pytests/unit/metrics/test_input_sequence_length_metric.pytests/unit/metrics/test_output_sequence_length_metric.pytests/unit/metrics/test_output_token_count.pytests/unit/metrics/test_reasoning_token_count.pytests/unit/metrics/test_usage_diff_metrics.pytests/unit/records/test_inference_result_parser.py
💤 Files with no reviewable changes (4)
- src/aiperf/common/enums/metric_enums.py
- src/aiperf/common/tokenizer_validator.py
- src/aiperf/common/models/model_endpoint_info.py
- src/aiperf/post_processors/base_metrics_processor.py
🚧 Files skipped from review as they are similar to previous changes (9)
- tests/unit/endpoints/test_openai_completions.py
- tests/unit/metrics/test_output_token_count.py
- tests/unit/exporters/test_console_usage_discrepancy_exporter.py
- src/aiperf/records/record_processor_service.py
- tests/unit/records/test_inference_result_parser.py
- src/aiperf/endpoints/openai_chat.py
- src/aiperf/common/config/endpoint_config.py
- tests/unit/metrics/test_input_sequence_length_metric.py
- tests/unit/metrics/conftest.py
a912ba6 to
2baa419
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/aiperf/common/models/record_models.py (1)
880-907:⚠️ Potential issue | 🔴 CriticalUsing Pydantic
Field()in a plain@dataclassdoesn't work as intended.
TokenCountsis decorated with@dataclass(slots=True), but the fields use Pydantic'sField()for defaults. In a plain dataclass,Field(default=None, ...)returns aFieldInfoobject which becomes the actual default value, notNone. This meansTokenCounts()would haveFieldInfoobjects as field values instead ofNone.For dataclasses, use plain defaults with docstrings for documentation (consistent with other dataclasses in this file like
SSEField,TextResponse, etc.):🐛 Proposed fix
`@dataclass`(slots=True) class TokenCounts: """Token counts for a record.""" - input: int | None = Field( - default=None, - description="The server-reported prompt token count from the API usage field. If None, the server did not report prompt tokens.", - ) - input_local: int | None = Field( - default=None, - description="The number of input tokens computed by the client-side tokenizer. If None, the number of tokens could not be calculated.", - ) - output: int | None = Field( - default=None, - description="The server-reported output token count (completion minus reasoning). If None, the server did not report completion tokens.", - ) - output_local: int | None = Field( - default=None, - description="The number of output tokens computed by the client-side tokenizer.", - ) - reasoning: int | None = Field( - default=None, - description="The server-reported reasoning token count. If None, the server did not report reasoning tokens.", - ) - reasoning_local: int | None = Field( - default=None, - description="The number of reasoning tokens computed by the client-side tokenizer.", - ) + input: int | None = None + """The server-reported prompt token count from the API usage field. None if not reported.""" + + input_local: int | None = None + """The number of input tokens computed by the client-side tokenizer. None if not calculated.""" + + output: int | None = None + """The server-reported output token count (completion minus reasoning). None if not reported.""" + + output_local: int | None = None + """The number of output tokens computed by the client-side tokenizer.""" + + reasoning: int | None = None + """The server-reported reasoning token count. None if not reported.""" + + reasoning_local: int | None = None + """The number of reasoning tokens computed by the client-side tokenizer."""As per coding guidelines: "Add docstrings on dataclass fields" and use plain defaults for dataclasses.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/aiperf/common/models/record_models.py` around lines 880 - 907, The TokenCounts dataclass is using Pydantic's Field() as defaults causing FieldInfo objects to become actual default values; replace all Field(...) usages inside the TokenCounts class with plain Python defaults (e.g., input: int | None = None) and move the descriptive text into per-field docstrings or the class docstring to match other dataclasses (reference TokenCounts and its attributes input, input_local, output, output_local, reasoning, reasoning_local) so instances of TokenCounts get real None defaults instead of FieldInfo objects.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/metrics-reference.md`:
- Line 61: Update the TOC entry that links to the "Usage Completion Tokens Diff
%" heading by replacing the incorrect anchor "#usage-output-tokens-diff-" with
the correct anchor "#usage-completion-tokens-diff-"; locate the text containing
the malformed link and change the fragment to match the heading text "Usage
Completion Tokens Diff %".
In `@tests/unit/metrics/test_output_sequence_length_metric.py`:
- Around line 81-131: The OutputSequenceLengthMetric.parse_record currently
chooses all-server vs all-local wholesale; change it to compute per-field
fallbacks: pick output_val = token_counts.output if not None else
token_counts.output_local, and reasoning_val = token_counts.reasoning if not
None else token_counts.reasoning_local; if both output_val and reasoning_val are
None raise NoMetricValue, otherwise treat any remaining None as 0 and return
output_val + reasoning_val. Update the logic in
OutputSequenceLengthMetric.parse_record to follow this per-field (output ??
output_local) + (reasoning ?? reasoning_local) behavior.
---
Outside diff comments:
In `@src/aiperf/common/models/record_models.py`:
- Around line 880-907: The TokenCounts dataclass is using Pydantic's Field() as
defaults causing FieldInfo objects to become actual default values; replace all
Field(...) usages inside the TokenCounts class with plain Python defaults (e.g.,
input: int | None = None) and move the descriptive text into per-field
docstrings or the class docstring to match other dataclasses (reference
TokenCounts and its attributes input, input_local, output, output_local,
reasoning, reasoning_local) so instances of TokenCounts get real None defaults
instead of FieldInfo objects.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: c6788c05-9d3e-449d-ba0c-ee28876b4dac
📥 Commits
Reviewing files that changed from the base of the PR and between a8d63f1450b6927553edfc89acc843d8403f6929 and 2baa419.
📒 Files selected for processing (34)
docs/cli-options.mddocs/metrics-reference.mdsrc/aiperf/common/config/cli_parameter.pysrc/aiperf/common/config/config_defaults.pysrc/aiperf/common/config/endpoint_config.pysrc/aiperf/common/config/tokenizer_config.pysrc/aiperf/common/config/user_config.pysrc/aiperf/common/enums/metric_enums.pysrc/aiperf/common/models/model_endpoint_info.pysrc/aiperf/common/models/record_models.pysrc/aiperf/common/tokenizer_validator.pysrc/aiperf/endpoints/openai_chat.pysrc/aiperf/endpoints/openai_completions.pysrc/aiperf/exporters/console_usage_discrepancy_exporter.pysrc/aiperf/metrics/base_record_metric.pysrc/aiperf/metrics/types/input_sequence_length_metric.pysrc/aiperf/metrics/types/output_sequence_length_metric.pysrc/aiperf/metrics/types/output_token_count.pysrc/aiperf/metrics/types/reasoning_token_count.pysrc/aiperf/metrics/types/usage_diff_metrics.pysrc/aiperf/post_processors/base_metrics_processor.pysrc/aiperf/records/inference_result_parser.pysrc/aiperf/records/record_processor_service.pytests/integration/test_use_server_token_counts.pytests/unit/endpoints/test_openai_chat_completions.pytests/unit/endpoints/test_openai_completions.pytests/unit/exporters/test_console_usage_discrepancy_exporter.pytests/unit/metrics/conftest.pytests/unit/metrics/test_input_sequence_length_metric.pytests/unit/metrics/test_output_sequence_length_metric.pytests/unit/metrics/test_output_token_count.pytests/unit/metrics/test_reasoning_token_count.pytests/unit/metrics/test_usage_diff_metrics.pytests/unit/records/test_inference_result_parser.py
💤 Files with no reviewable changes (3)
- src/aiperf/common/enums/metric_enums.py
- src/aiperf/common/tokenizer_validator.py
- src/aiperf/post_processors/base_metrics_processor.py
🚧 Files skipped from review as they are similar to previous changes (7)
- src/aiperf/common/models/model_endpoint_info.py
- src/aiperf/common/config/user_config.py
- tests/unit/exporters/test_console_usage_discrepancy_exporter.py
- src/aiperf/metrics/base_record_metric.py
- src/aiperf/exporters/console_usage_discrepancy_exporter.py
- tests/integration/test_use_server_token_counts.py
- src/aiperf/endpoints/openai_completions.py
There was a problem hiding this comment.
@ajcasagrande responding here so we can have a thread
1: Fixed. Uses None everywhere.
2: Reverted to old name to be consistent with the "usage_" prefix preferring openai name.
3: Removed test since it was already covered by another one.
A: Fixed. Forces consistent source of reasoning and output token counts. Either both server or both local.
B: Fixed. Error records fallback to local tokenization now unless user explicitly specifies --no-tokenize-input.
C: Fixed. Added --stream-usage / --no-stream-usage flag + info log.
D: Fixed. Added an info log when input tokenization is disabled. In rare case where user provided input and doesn't trust server token counts, they can locally tokenize with --tokenize-input.
E: Fixed. Added an info log when output tokenization is disabled. In rare case where user doesn't trust server token counts, they can locally tokenize with --tokenize-output.
F: Added TOKENIZES_INPUT_ONLY flag.
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/metrics-reference.md`:
- Around line 808-825: The documentation currently says prompt diff metrics
require only `--tokenize-output`, which is incorrect; update the
`UsagePromptTokensDiffMetric` notes to state that prompt-diff values depend on
`token_counts.input_local` and therefore require `--tokenize-input` (or
server-provided prompt token counts/fallback tokenization) to be present for
user-provided datasets; mention `token_counts.input_local`,
`UsagePromptTokensDiffMetric`, and the `--tokenize-input` flag so readers know
when the metric will actually be available.
- Around line 323-333: Update the OSL documentation to state that
OutputSequenceLengthMetric (OSL) uses an all-or-nothing token source selection:
it chooses either all server-reported counts or all client-local counts rather
than doing per-field fallbacks; explicitly call out that OSL does not mix server
`token_counts.*` (e.g., `token_counts.output`, `token_counts.reasoning`) with
local `*_local` fields to avoid double-counting when some backends fold
reasoning into `output` while leaving `reasoning` null; revise the description
and the “Formula” examples for OutputSequenceLengthMetric/OSL to show this
all-server OR all-local behavior (not `(output ?? output_local) + (reasoning ??
reasoning_local)`), and make the same clarification in the other section noted
(lines 623-633).
- Around line 327-329: The docs currently show output_token_count =
token_counts.output or token_counts.output_local which treats 0 as falsy; update
the text to show explicit None-check fallback semantics used in code: use
token_counts.output if token_counts.output is not None else
token_counts.output_local, and explain that token_counts.output_local is only
used when token_counts.output is None so a server-reported 0 is preserved.
- Around line 385-425: Update the ISL documentation to reflect the new storage
fields and correct flag behavior: change the primary/formula references from
usage.prompt_tokens and tokenizer.encode(...) to token_counts.input
(server-preferred) and token_counts.input_local (client-side), update the “Input
Sequence Length — Server (file-only)” and “Input Sequence Length — Local
(file-only)” subsections to state they read token_counts.input and
token_counts.input_local respectively, and clarify that --no-tokenize-input only
prevents explicit client tokenization when configured but does not stop the
fallback local tokenization when token_counts.input is missing (mention
--tokenize-input to force client tokenization). Ensure the formulas and notes
reference the exact symbols token_counts.input and token_counts.input_local and
update phrasing so fallback behavior is accurate.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 8049f8ea-4e80-46f5-91b4-d526d40892ba
📒 Files selected for processing (1)
docs/metrics-reference.md
25f573e to
633b7bd
Compare
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/aiperf/metrics/base_record_metric.py (1)
27-33:⚠️ Potential issue | 🟡 MinorDocstring example now conflicts with actual ISL source precedence.
The example should not return
input_localdirectly; it should demonstrate server-first (input) with local fallback, matchingInputSequenceLengthMetricbehavior.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/aiperf/metrics/base_record_metric.py` around lines 27 - 33, The implementation of _parse_record currently returns record.token_counts.input_local but the docstring/example expects server-first precedence; update BaseRecordMetric._parse_record to return record.token_counts.input when present and fall back to record.token_counts.input_local otherwise (preserving behavior similar to InputSequenceLengthMetric), ensuring you reference ParsedResponseRecord.token_counts.input and .input_local and keep the same return type (int) and signature.src/aiperf/metrics/types/usage_diff_metrics.py (1)
199-262:⚠️ Potential issue | 🟠 MajorThe exported counter name no longer matches what it counts.
usage_discrepancy_countnow depends only onUsagePromptTokensDiffMetric, so completion and reasoning mismatches are silently excluded even though the tag/header still read as a generic usage discrepancy count. That changes the meaning of an exported metric without any signal to downstream dashboards. Either keep the aggregate behavior or introduce a prompt-specific metric/tag and preserve the current one for backward compatibility.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/aiperf/metrics/types/usage_diff_metrics.py` around lines 199 - 262, The exported metric UsageDiscrepancyCountMetric (tag/header) no longer matches its implementation since _parse_record only checks UsagePromptTokensDiffMetric; either restore the original generic behavior by including completion/reasoning diff metrics in required_metrics and checking all diffs (e.g., take max(prompt, completion, reasoning) > threshold or increment if any exceed threshold in UsageDiscrepancyCountMetric._parse_record), or make this a prompt-specific metric by renaming the tag/header (e.g., tag="usage_prompt_discrepancy_count", header="Usage Prompt Discrepancy Count", short_header) and leave the original UsageDiscrepancyCountMetric intact for backward compatibility (create a new class UsagePromptDiscrepancyCountMetric referencing UsagePromptTokensDiffMetric if you choose rename).
🧹 Nitpick comments (4)
tests/unit/metrics/test_reasoning_token_count.py (1)
106-178: Align new test names with the project’s test naming convention.Please rename the newly added server/local tests to the
test_<function>_<scenario>_<expected>pattern for consistency (e.g.,test_parse_record_server_value_present_returns_server_value).As per coding guidelines,
tests/**/*.py: "Test naming convention:test_<function>_<scenario>_<expected>".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/metrics/test_reasoning_token_count.py` around lines 106 - 178, The test function names in TestReasoningTokenCountServerMetric and TestReasoningTokenCountLocalMetric do not follow the project's test naming convention; rename each test method to the test_<function>_<scenario>_<expected> pattern (for example, rename test_server_metric_basic -> test_parse_record_server_value_present_returns_server_value, test_server_metric_zero -> test_parse_record_server_value_zero_returns_zero, test_server_metric_none_raises -> test_parse_record_server_value_none_raises_NoMetricValue, and similarly rename the local ones: test_local_metric_basic -> test_parse_record_local_value_present_returns_local_value, test_local_metric_zero -> test_parse_record_local_value_zero_returns_zero, test_local_metric_none_raises -> test_parse_record_local_value_none_raises_NoMetricValue) so the names align with the guideline; keep the classes ReasoningTokenCountServerMetric and ReasoningTokenCountLocalMetric and their assertions unchanged.tests/unit/metrics/test_input_sequence_length_metric.py (1)
25-141: Good coverage overall; please align the new pytest methods with repo conventions.The added cases are useful, but names like
test_server_metric_parse_recordand the missing-> Noneannotations drift from the repo's pytest conventions for test naming and function typing.As per coding guidelines, "Name tests as
test_<function>_<scenario>_<expected>" and "Include type hints on ALL functions (params and return)".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/metrics/test_input_sequence_length_metric.py` around lines 25 - 141, Rename the test methods to follow the repo convention test_<function>_<scenario>_<expected> (e.g., rename test_server_metric_parse_record -> test_parse_record_server_returns_or_raises, test_server_metric_ignores_local -> test_parse_record_server_ignores_local_raises, test_server_metric_flags -> test_flags_server_contains_expected, and similarly for the local-metric methods) and add full type hints to all test function signatures (annotate parameters and add "-> None" return type) for the methods in TestInputSequenceLengthServerMetric, TestInputSequenceLengthLocalMetric and any other newly added tests such as test_input_sequence_length_prefers_server and test_input_sequence_length_both_none_raises so they conform to the project's naming and typing conventions.src/aiperf/records/inference_result_parser.py (1)
325-356: Keep tokenizer failures best-effort, but let parser bugs fail fast.These
except Exceptionblocks also swallow failures fromcompute_input_token_count()and_parse_output_and_reasoning_texts(), so regressions in our own prompt/response extraction quietly turn into missing local counts. Catch around tokenizer load/encode only, or re-raise non-tokenizer failures.Based on learnings, prefer fail-fast at the source of logic errors rather than adding downstream defensive guards that mask bugs.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/aiperf/records/inference_result_parser.py` around lines 325 - 356, The current broad excepts around compute_input_token_count and the block that calls get_tokenizer, _parse_output_and_reasoning_texts and _compute_token_count swallow parser/regression errors; narrow the scope so only tokenizer-related failures are best-effort. For compute_input_token_count, remove the blanket except or catch and re-raise non-tokenizer errors so only tokenizer load/encode errors get logged; for the output path, call _parse_output_and_reasoning_texts outside the tokenizer-specific try and let its exceptions propagate, then wrap only get_tokenizer and _compute_token_count calls in a try that catches tokenizer/encoding errors (and logs/warns) while re-raising any other exceptions. Reference functions: compute_input_token_count, get_tokenizer, _parse_output_and_reasoning_texts, _compute_token_count.tests/unit/metrics/test_usage_diff_metrics.py (1)
478-634: Please align the new pytest methods with repo naming and typing rules.The added completion/reasoning diff tests omit
-> None, and names liketest_exact_match/test_positive_differenceare too generic for the repo'stest_<function>_<scenario>_<expected>convention.As per coding guidelines, "Name tests as
test_<function>_<scenario>_<expected>" and "Include type hints on ALL functions (params and return)".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/metrics/test_usage_diff_metrics.py` around lines 478 - 634, The test methods in TestUsageCompletionTokensDiffMetric and TestUsageReasoningTokensDiffMetric lack type hints and use generic names; rename each method to follow the repo convention test_<function>_<scenario>_<expected> (e.g., replace test_exact_match with test_parse_record_exact_match_returns_zero for UsageCompletionTokensDiffMetric and analogous names for UsageReasoningTokensDiffMetric) and add full type annotations to all test method signatures (include parameter types like self and return type -> None) so every def in those classes has proper typing; keep the existing assertions and metric/class references (UsageCompletionTokensDiffMetric, UsageReasoningTokensDiffMetric, parse_record, MetricRecordDict, NoMetricValue, MetricFlags) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/cli-options.md`:
- Around line 168-170: The deprecation note for the flag
`--use-server-token-count` incorrectly states that AIPerf "always computes both
client-side and server-reported token counts"; instead update the wording to
clarify that `input_local` and `output_local` token counts are not computed
unconditionally but are only materialized when explicitly enabled or required as
a fallback to server counts, and that server-reported counts remain preferred
for output metrics while client-side counts are optional and may impose
performance cost; edit the `--use-server-token-count` description to reflect
this conditional materialization and the performance implication.
- Around line 681-684: Update the `--tokenize-input`/`--no-tokenize-input` docs
to reflect the conditional runtime default instead of claiming a flat True:
state that tokenization is enabled by default for synthetic inputs (so input
token counts are stored in token_counts.input_local), but it is automatically
disabled for user-provided datasets (`--custom-dataset-type` and
`--public-dataset`) unless explicitly overridden; remove or replace the
misleading "_Default: `True`_" line and ensure the text mentions the fallback
client-side tokenization behavior when the server does not report prompt tokens.
In `@docs/metrics-reference.md`:
- Around line 815-817: The note overstates prompt-diff availability: update the
text to say that UsagePromptTokensDiffMetric requires both API-provided prompt
usage (the server `usage` prompt token count) and client-local prompt token
counts; enabling local tokenization via --tokenize-input (or fallback
tokenization) only supplies the client side and does not create a prompt diff if
the API prompt usage is absent. Modify the paragraph to explicitly state that
prompt diff is available only when the API includes prompt usage plus client
tokenization, and that --tokenize-input enables client counting but cannot
substitute for missing API prompt usage.
- Around line 401-406: Update the documentation text around the client-side
tokenization and the --no-tokenize-input flag to make clear that
--no-tokenize-input only prevents client-side tokenization when the server
reports usage.prompt_tokens, but does not prevent fallback local tokenization
when the server omits prompt token counts; explicitly mention that client-side
tokenization (which uses add_special_tokens=False to count only content tokens)
will still be performed as a fallback if prompt token usage is missing, and note
that --tokenize-input forces local tokenization for user-provided input
datasets.
- Around line 396-398: The current ISL formula uses "usage.prompt_tokens or
len(tokenizer.encode(prompt, add_special_tokens=False))" which treats 0 as
falsy; change it to explicitly fall back only when usage.prompt_tokens is None
by using a conditional expression that returns usage.prompt_tokens if it is not
None, otherwise computes len(tokenizer.encode(prompt,
add_special_tokens=False)); reference the symbols input_sequence_length,
usage.prompt_tokens, tokenizer.encode, prompt and add_special_tokens=False when
applying this fix.
In `@src/aiperf/common/models/record_models.py`:
- Around line 884-907: The TokenCounts dataclass fields (input, input_local,
output, output_local, reasoning, reasoning_local) currently use
pydantic.Field(...) which stores FieldInfo objects when omitted; change each
field to a plain dataclass default (e.g., input: int | None = None) instead of
Field(...) and move the per-field descriptions into the class docstring or
comments so omitted fields are actual None values; update the TokenCounts
declaration (and any dataclass decorator usage) accordingly to preserve typing
and slots while ensuring downstream checks like record.token_counts.input is not
None work correctly.
In `@src/aiperf/records/inference_result_parser.py`:
- Around line 314-323: The current per-record warning when input_token_count,
output_server, and reasoning_server are all None will spam logs and overstate
impact; change it to warn only once by tracking a flag on the parser instance
(e.g., set self._warned_no_server_usage = True after the first call) and update
the message to be accurate and softer: indicate that server-side usage info is
missing but local fallbacks (input_local, output_local, reasoning_local) may
still be available. Modify the branch that calls self.warning (the check using
input_token_count, output_server, reasoning_server) to first check and set this
instance flag and to use the revised message mentioning
input_local/output_local/reasoning_local instead of claiming all token metrics
are unavailable.
---
Outside diff comments:
In `@src/aiperf/metrics/base_record_metric.py`:
- Around line 27-33: The implementation of _parse_record currently returns
record.token_counts.input_local but the docstring/example expects server-first
precedence; update BaseRecordMetric._parse_record to return
record.token_counts.input when present and fall back to
record.token_counts.input_local otherwise (preserving behavior similar to
InputSequenceLengthMetric), ensuring you reference
ParsedResponseRecord.token_counts.input and .input_local and keep the same
return type (int) and signature.
In `@src/aiperf/metrics/types/usage_diff_metrics.py`:
- Around line 199-262: The exported metric UsageDiscrepancyCountMetric
(tag/header) no longer matches its implementation since _parse_record only
checks UsagePromptTokensDiffMetric; either restore the original generic behavior
by including completion/reasoning diff metrics in required_metrics and checking
all diffs (e.g., take max(prompt, completion, reasoning) > threshold or
increment if any exceed threshold in UsageDiscrepancyCountMetric._parse_record),
or make this a prompt-specific metric by renaming the tag/header (e.g.,
tag="usage_prompt_discrepancy_count", header="Usage Prompt Discrepancy Count",
short_header) and leave the original UsageDiscrepancyCountMetric intact for
backward compatibility (create a new class UsagePromptDiscrepancyCountMetric
referencing UsagePromptTokensDiffMetric if you choose rename).
---
Nitpick comments:
In `@src/aiperf/records/inference_result_parser.py`:
- Around line 325-356: The current broad excepts around
compute_input_token_count and the block that calls get_tokenizer,
_parse_output_and_reasoning_texts and _compute_token_count swallow
parser/regression errors; narrow the scope so only tokenizer-related failures
are best-effort. For compute_input_token_count, remove the blanket except or
catch and re-raise non-tokenizer errors so only tokenizer load/encode errors get
logged; for the output path, call _parse_output_and_reasoning_texts outside the
tokenizer-specific try and let its exceptions propagate, then wrap only
get_tokenizer and _compute_token_count calls in a try that catches
tokenizer/encoding errors (and logs/warns) while re-raising any other
exceptions. Reference functions: compute_input_token_count, get_tokenizer,
_parse_output_and_reasoning_texts, _compute_token_count.
In `@tests/unit/metrics/test_input_sequence_length_metric.py`:
- Around line 25-141: Rename the test methods to follow the repo convention
test_<function>_<scenario>_<expected> (e.g., rename
test_server_metric_parse_record -> test_parse_record_server_returns_or_raises,
test_server_metric_ignores_local ->
test_parse_record_server_ignores_local_raises, test_server_metric_flags ->
test_flags_server_contains_expected, and similarly for the local-metric methods)
and add full type hints to all test function signatures (annotate parameters and
add "-> None" return type) for the methods in
TestInputSequenceLengthServerMetric, TestInputSequenceLengthLocalMetric and any
other newly added tests such as test_input_sequence_length_prefers_server and
test_input_sequence_length_both_none_raises so they conform to the project's
naming and typing conventions.
In `@tests/unit/metrics/test_reasoning_token_count.py`:
- Around line 106-178: The test function names in
TestReasoningTokenCountServerMetric and TestReasoningTokenCountLocalMetric do
not follow the project's test naming convention; rename each test method to the
test_<function>_<scenario>_<expected> pattern (for example, rename
test_server_metric_basic ->
test_parse_record_server_value_present_returns_server_value,
test_server_metric_zero -> test_parse_record_server_value_zero_returns_zero,
test_server_metric_none_raises ->
test_parse_record_server_value_none_raises_NoMetricValue, and similarly rename
the local ones: test_local_metric_basic ->
test_parse_record_local_value_present_returns_local_value,
test_local_metric_zero -> test_parse_record_local_value_zero_returns_zero,
test_local_metric_none_raises ->
test_parse_record_local_value_none_raises_NoMetricValue) so the names align with
the guideline; keep the classes ReasoningTokenCountServerMetric and
ReasoningTokenCountLocalMetric and their assertions unchanged.
In `@tests/unit/metrics/test_usage_diff_metrics.py`:
- Around line 478-634: The test methods in TestUsageCompletionTokensDiffMetric
and TestUsageReasoningTokensDiffMetric lack type hints and use generic names;
rename each method to follow the repo convention
test_<function>_<scenario>_<expected> (e.g., replace test_exact_match with
test_parse_record_exact_match_returns_zero for UsageCompletionTokensDiffMetric
and analogous names for UsageReasoningTokensDiffMetric) and add full type
annotations to all test method signatures (include parameter types like self and
return type -> None) so every def in those classes has proper typing; keep the
existing assertions and metric/class references
(UsageCompletionTokensDiffMetric, UsageReasoningTokensDiffMetric, parse_record,
MetricRecordDict, NoMetricValue, MetricFlags) unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 1258b632-f717-4506-8d4d-20d744ea73e8
📒 Files selected for processing (34)
docs/cli-options.mddocs/metrics-reference.mdsrc/aiperf/common/config/cli_parameter.pysrc/aiperf/common/config/config_defaults.pysrc/aiperf/common/config/endpoint_config.pysrc/aiperf/common/config/tokenizer_config.pysrc/aiperf/common/config/user_config.pysrc/aiperf/common/enums/metric_enums.pysrc/aiperf/common/models/model_endpoint_info.pysrc/aiperf/common/models/record_models.pysrc/aiperf/common/tokenizer_validator.pysrc/aiperf/endpoints/openai_chat.pysrc/aiperf/endpoints/openai_completions.pysrc/aiperf/exporters/console_usage_discrepancy_exporter.pysrc/aiperf/metrics/base_record_metric.pysrc/aiperf/metrics/types/input_sequence_length_metric.pysrc/aiperf/metrics/types/output_sequence_length_metric.pysrc/aiperf/metrics/types/output_token_count.pysrc/aiperf/metrics/types/reasoning_token_count.pysrc/aiperf/metrics/types/usage_diff_metrics.pysrc/aiperf/post_processors/base_metrics_processor.pysrc/aiperf/records/inference_result_parser.pysrc/aiperf/records/record_processor_service.pytests/integration/test_use_server_token_counts.pytests/unit/endpoints/test_openai_chat_completions.pytests/unit/endpoints/test_openai_completions.pytests/unit/exporters/test_console_usage_discrepancy_exporter.pytests/unit/metrics/conftest.pytests/unit/metrics/test_input_sequence_length_metric.pytests/unit/metrics/test_output_sequence_length_metric.pytests/unit/metrics/test_output_token_count.pytests/unit/metrics/test_reasoning_token_count.pytests/unit/metrics/test_usage_diff_metrics.pytests/unit/records/test_inference_result_parser.py
💤 Files with no reviewable changes (3)
- src/aiperf/post_processors/base_metrics_processor.py
- src/aiperf/common/enums/metric_enums.py
- src/aiperf/common/tokenizer_validator.py
🚧 Files skipped from review as they are similar to previous changes (8)
- src/aiperf/records/record_processor_service.py
- src/aiperf/endpoints/openai_chat.py
- src/aiperf/common/config/config_defaults.py
- tests/unit/metrics/conftest.py
- src/aiperf/common/config/cli_parameter.py
- tests/unit/endpoints/test_openai_completions.py
- src/aiperf/metrics/types/reasoning_token_count.py
- tests/unit/exporters/test_console_usage_discrepancy_exporter.py
633b7bd to
e2c3be0
Compare
Q&A: Token Count Sources and Defaults
Q: Where do token counts come from?
There are two sources for every token count:
token_counts.input,.output,.reasoningusagein the API responsetoken_counts.input_local,.output_local,.reasoning_localQ: Which source do console metrics (ISL, OSL) prefer?
Both ISL and OSL prefer the server-reported value and fall back to the client-local value:
ISL = token_counts.input ?? token_counts.input_local
OSL = (token_counts.output ?? output_local) + (token_counts.reasoning ?? reasoning_local)
This means console numbers match what the server actually saw, whenever that data is available.
Q: What changed with
--tokenize-input/--no-tokenize-input?Previously,
input_localwas always computed for every request — even when the server already reportedprompt_tokensand the local value was never used for anything visible.Now the behaviour depends on the input source:
input_localcomputed?--tokenize-input(on)--custom-dataset-type,--public-dataset)--no-tokenize-input(off)prompt_tokens--tokenize-input--no-tokenize-inputQ: How does this compare to
--tokenize-output?They now follow the same pattern — local tokenization is a fallback by default, opt-in for always-on:
--tokenize-output--tokenize-outputcomputesoutput_local+reasoning_localalways--tokenize-input--tokenize-inputcomputesinput_localalwaysprompt_tokens→ compute locallyQ: Will this break ISL on the console?
No. When
input_localis skipped, the server'sprompt_tokensis used for ISL. If the server doesn't reportprompt_tokens, the fallback kicks in and computesinput_localanyway. ISL only becomesNoMetricValueif both sources are unavailable — same as before.
Q: What about error records?
Error records have no server response to check, so there's no fallback path. When
tokenize_input=False, error records getinput_local=None. Whentokenize_input=True, input tokenization is attempted (withexceptions suppressed). This matches how
--tokenize-outputhandles error records.Q: What's the performance impact?
For user-provided input datasets with servers that report
prompt_tokens, this skips onetokenizer.encode()call per request. The tokenizer still loads (it's needed for output fallback), but the per-request inputencoding overhead is eliminated.
Summary by CodeRabbit
New Features
Bug Fixes & Improvements
Documentation