Skip to content

feat: Implement preference for server token counts#716

Open
matthewkotila wants to merge 5 commits intomainfrom
mkotila/aip-758-implement-preference-for-server-token-counts
Open

feat: Implement preference for server token counts#716
matthewkotila wants to merge 5 commits intomainfrom
mkotila/aip-758-implement-preference-for-server-token-counts

Conversation

@matthewkotila
Copy link
Contributor

@matthewkotila matthewkotila commented Feb 26, 2026

Q&A: Token Count Sources and Defaults

Q: Where do token counts come from?

There are two sources for every token count:

Source Field suffix How it works
Server token_counts.input, .output, .reasoning Extracted from usage in the API response
Client (local) token_counts.input_local, .output_local, .reasoning_local Computed by running the tokenizer locally

Q: Which source do console metrics (ISL, OSL) prefer?

Both ISL and OSL prefer the server-reported value and fall back to the client-local value:

ISL = token_counts.input ?? token_counts.input_local
OSL = (token_counts.output ?? output_local) + (token_counts.reasoning ?? reasoning_local)

This means console numbers match what the server actually saw, whenever that data is available.

Q: What changed with --tokenize-input / --no-tokenize-input?

Previously, input_local was always computed for every request — even when the server already reported prompt_tokens and the local value was never used for anything visible.

Now the behaviour depends on the input source:

Input source Default input_local computed?
Synthetic (shaped ISL) --tokenize-input (on) Always — needed for ISL shaping validation
User-provided (--custom-dataset-type, --public-dataset) --no-tokenize-input (off) Only as fallback when server doesn't report prompt_tokens
Explicit --tokenize-input on Always
Explicit --no-tokenize-input off Only as fallback

Q: How does this compare to --tokenize-output?

They now follow the same pattern — local tokenization is a fallback by default, opt-in for always-on:

Flag Default "Always" mode Fallback mode
--tokenize-output off --tokenize-output computes output_local + reasoning_local always Server doesn't report → compute locally
--tokenize-input on for synthetic, off for user-provided --tokenize-input computes input_local always Server doesn't report prompt_tokens → compute locally

Q: Will this break ISL on the console?

No. When input_local is skipped, the server's prompt_tokens is used for ISL. If the server doesn't report prompt_tokens, the fallback kicks in and computes input_local anyway. ISL only becomes NoMetricValue
if both sources are unavailable — same as before.

Q: What about error records?

Error records have no server response to check, so there's no fallback path. When tokenize_input=False, error records get input_local=None. When tokenize_input=True, input tokenization is attempted (with
exceptions suppressed). This matches how --tokenize-output handles error records.

Q: What's the performance impact?

For user-provided input datasets with servers that report prompt_tokens, this skips one tokenizer.encode() call per request. The tokenizer still loads (it's needed for output fallback), but the per-request input
encoding overhead is eliminated.

Summary by CodeRabbit

  • New Features

    • Added --stream-usage / --no-stream-usage, --tokenize-input / --no-tokenize-input, and --tokenize-output options; new server/local variants for input/output/reasoning token metrics.
  • Bug Fixes & Improvements

    • --use-server-token-count deprecated; both server and client token counts are computed by default.
    • Streaming will auto-include usage when enabled (unless --no-stream-usage).
    • Usage-discrepancy logic simplified to focus on prompt diffs with improved missing-value handling.
  • Documentation

    • CLI and metrics docs and exporter warnings updated to reflect new flags and behavior.

@github-actions
Copy link

github-actions bot commented Feb 26, 2026

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@e2c3be076545049348d1daf58a9f7038240f3be6

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@e2c3be076545049348d1daf58a9f7038240f3be6

Last updated for commit: e2c3be0Browse code

@coderabbitai
Copy link

coderabbitai bot commented Feb 26, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

This PR deprecates use_server_token_count, makes the system compute and track both server-reported and client-side token counts by default, adds stream_usage and tokenization CLI flags, extends TokenCounts with local fields, and introduces server/local metric variants and corresponding tests.

Changes

Cohort / File(s) Summary
Config & CLI
src/aiperf/common/config/endpoint_config.py, src/aiperf/common/config/tokenizer_config.py, src/aiperf/common/config/user_config.py, src/aiperf/common/config/config_defaults.py, src/aiperf/common/config/cli_parameter.py
Added stream_usage and STREAM_USAGE default; deprecated use_server_token_count description; added tokenize_input/tokenize_output CLI flags and _auto_configure_tokenize_input hook; adjusted CLIParameter defaulting.
Models & Enums
src/aiperf/common/models/model_endpoint_info.py, src/aiperf/common/models/record_models.py, src/aiperf/common/enums/metric_enums.py
Removed use_server_token_count; added stream_usage; extended TokenCounts with input_local, output_local, reasoning_local; added token_counts to metadata/parsed-response models; removed MetricFlags.USAGE_DIFF_ONLY.
Tokenization & Parsing
src/aiperf/common/tokenizer_validator.py, src/aiperf/records/inference_result_parser.py, src/aiperf/records/record_processor_service.py
Removed server-count-based tokenizer skip; unified server/local token computation via _compute_token_counts; added per-endpoint tokenize gating and propagated token_counts into metadata.
Endpoint payloads
src/aiperf/endpoints/openai_chat.py, src/aiperf/endpoints/openai_completions.py
Switched auto-inclusion of streaming include_usage from use_server_token_count to stream_usage; auto-adds/augments stream_options.include_usage when streaming and enabled.
Metrics — core changes
src/aiperf/metrics/base_record_metric.py, src/aiperf/metrics/types/...
Metrics now prefer server-reported values with client-side fallbacks; introduced server/local-specific metric classes for input, output, and reasoning token counts; parsing updated to use token_counts.* and raise NoMetricValue where appropriate.
Usage diff & postprocessing
src/aiperf/metrics/types/usage_diff_metrics.py, src/aiperf/post_processors/base_metrics_processor.py, src/aiperf/exporters/console_usage_discrepancy_exporter.py
Reworked diff metrics to compare server vs client (*_local) token fields; prompt-diff drives discrepancy count; removed logic that disabled usage-diff when server counts used; updated exporter wording.
Parser & record flow tests
tests/unit/records/test_inference_result_parser.py, tests/integration/test_use_server_token_counts.py
Updated tests and fixtures to the dual-count model; removed --use-server-token-count from invocations; adjusted assertions to expect *_local fields and updated diff metric expectations.
Endpoint tests
tests/unit/endpoints/test_openai_chat_completions.py, tests/unit/endpoints/test_openai_completions.py
Removed use_server_token_count parameterization; added tests verifying stream_options omitted when stream_usage=False; updated expectations for auto-configured stream_options.include_usage.
Metrics tests & fixtures
tests/unit/metrics/conftest.py, tests/unit/metrics/test_input_sequence_length_metric.py, tests/unit/metrics/test_output_sequence_length_metric.py, tests/unit/metrics/test_output_token_count.py, tests/unit/metrics/test_reasoning_token_count.py, tests/unit/metrics/test_usage_diff_metrics.py
Extended fixtures to accept local/server token params; added tests for server/local metric variants, fallbacks, and NoMetricValue behavior; updated helpers and expectations for new TokenCounts shape.
Docs
docs/cli-options.md, docs/metrics-reference.md
Documented deprecation of --use-server-token-count as no-op; added --stream-usage/--no-stream-usage, --tokenize-input/--no-tokenize-input, --tokenize-output; updated metrics reference for server/local variants.
Minor
src/aiperf/metrics/base_record_metric.py
Docstring/example updated to reference token_counts.input_local.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Poem

🐇 I hopped through counts both near and far,
Server whispers and local beats in tow,
stream_usage lights the streaming star,
Token twins march in tidy row,
Hooray—both tallies now softly glow! ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'feat: Implement preference for server token counts' accurately captures the main change: preferring server-reported token counts over client-side tokenization as the primary source.
Docstring Coverage ✅ Passed Docstring coverage is 91.30% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can generate a title for your PR based on the changes with custom instructions.

Set the reviews.auto_title_instructions setting to generate a title for your PR based on the changes in the PR with custom instructions.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docs/metrics_reference.md (1)

918-918: ⚠️ Potential issue | 🟡 Minor

Update documentation for deprecated flag.

Line 918 still references --use-server-token-count as a solution for tokenizer mismatch, but this flag is now deprecated and a no-op. Consider updating this guidance to reference the new dual-source approach or remove the recommendation.

📝 Suggested update
-- If discrepancy is due to tokenizer mismatch between client and server, use `--use-server-token-count`.
+- If discrepancy is due to tokenizer mismatch between client and server, compare the server-reported and client-side values in the exported file-only metrics (`_server` vs `_local` variants).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/metrics_reference.md` at line 918, Replace the deprecated guidance that
recommends using the `--use-server-token-count` flag: remove or change the
sentence referencing `--use-server-token-count` and instead recommend the new
dual-source approach (e.g., compare client and server token counts and surface
both sources) or point users to the current flag/setting that replaced it;
update any nearby wording to clarify that `--use-server-token-count` is a no-op
and link to the dual-source documentation or migration note so readers know the
correct current procedure.
🧹 Nitpick comments (3)
src/aiperf/metrics/types/output_token_count.py (1)

53-59: Inconsistent zero-handling between metrics.

OutputTokenCountMetric uses truthy checks (if record.token_counts.output:) which treats 0 as missing, while OutputTokenCountServerMetric and OutputTokenCountLocalMetric use is not None checks which correctly allow 0.

This inconsistency could cause confusion: a record with output=0 would raise NoMetricValue in this metric but return 0 in the server-specific variant.

Consider aligning the behavior:

♻️ Proposed fix for consistent zero-handling
-        if record.token_counts.output:
+        if record.token_counts.output is not None:
             return record.token_counts.output

-        if record.token_counts.output_local:
+        if record.token_counts.output_local is not None:
             return record.token_counts.output_local
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/aiperf/metrics/types/output_token_count.py` around lines 53 - 59, The
OutputTokenCountMetric currently uses truthy checks that treat 0 as missing;
change the conditional checks in the OutputTokenCountMetric to use "is not None"
for record.token_counts.output and record.token_counts.output_local (same
semantics as OutputTokenCountServerMetric/OutputTokenCountLocalMetric) so that
zero values are returned instead of raising NoMetricValue; keep the final raise
NoMetricValue(...) if both are None.
tests/unit/records/test_inference_result_parser.py (2)

207-323: Make tokenization mode explicit in these tests to prevent config-coupled flakiness.

These cases assert input_local/encode() behavior that now depends on parser flags. Setting tokenize_input/tokenize_output explicitly in each test keeps them deterministic if fixture defaults change.

Proposed adjustment pattern
 async def test_server_overlay_on_client(
     self, server_token_parser, request_record, spy_tokenizer
 ):
+    server_token_parser.tokenize_input = True
+    server_token_parser.tokenize_output = False
     server_token_parser.get_tokenizer = AsyncMock(return_value=spy_tokenizer)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/records/test_inference_result_parser.py` around lines 207 - 323,
These tests rely on client-side tokenization behavior, so make tokenization mode
explicit by setting the parser flags before calling process_valid_record; for
each test (e.g., test_server_overlay_on_client,
test_no_server_usage_falls_back_to_client, test_partial_server_usage,
test_streaming_uses_last_server_value, test_server_output_overrides_client) set
the parser instance's tokenize_input = True and tokenize_output = True (on
server_token_parser or setup_inference_parser as applicable) so
input_local/output_local and spy_tokenizer.encode() behavior is deterministic
regardless of fixture defaults.

38-38: Fixture docstring is stale after tokenize gating changes.

“both counts are always computed” no longer matches current behavior with tokenize_input / tokenize_output gating. Please reword this to avoid misleading future test intent.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/records/test_inference_result_parser.py` at line 38, Update the
docstring on the parser fixture in test_inference_result_parser.py (the "parser"
fixture) to remove the stale claim that "both counts are always computed" and
instead state that the fixture mirrors the default parser behavior with counts
computed conditionally according to the tokenize_input and tokenize_output
gating flags; ensure the wording clearly indicates that counts may be included
or omitted depending on those tokenize_* settings.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/aiperf/records/inference_result_parser.py`:
- Around line 279-283: The warning is emitted when only input_token_count and
output_server are missing but ignores a present reasoning_server; change the
predicate so the warning only fires when all token sources are absent. Update
the condition in inference_result_parser.py to check that input_token_count is
None AND output_server is None AND reasoning_server is None before calling
self.warning, referencing the existing variables input_token_count,
output_server, reasoning_server and the self.warning call to locate the code.
- Around line 285-309: The local tokenization steps (calls to await
self.get_tokenizer, self._compute_token_count, and compute_input_token_count)
must be made best-effort and must not raise to the caller; wrap tokenizer load
and token counting in a try/except that catches encoding/load errors, logs a
warning, and leaves input_local/output_local/reasoning_local as None so any
server-provided counts (input_token_count, output_server, reasoning_server) are
preserved; specifically update the block that calls compute_input_token_count,
await self.get_tokenizer(request_record.model_name),
self._parse_output_and_reasoning_texts, and self._compute_token_count to catch
exceptions, avoid re-raising, and proceed with existing server-parsed results
rather than converting the record to an error in parse_request_record.

---

Outside diff comments:
In `@docs/metrics_reference.md`:
- Line 918: Replace the deprecated guidance that recommends using the
`--use-server-token-count` flag: remove or change the sentence referencing
`--use-server-token-count` and instead recommend the new dual-source approach
(e.g., compare client and server token counts and surface both sources) or point
users to the current flag/setting that replaced it; update any nearby wording to
clarify that `--use-server-token-count` is a no-op and link to the dual-source
documentation or migration note so readers know the correct current procedure.

---

Nitpick comments:
In `@src/aiperf/metrics/types/output_token_count.py`:
- Around line 53-59: The OutputTokenCountMetric currently uses truthy checks
that treat 0 as missing; change the conditional checks in the
OutputTokenCountMetric to use "is not None" for record.token_counts.output and
record.token_counts.output_local (same semantics as
OutputTokenCountServerMetric/OutputTokenCountLocalMetric) so that zero values
are returned instead of raising NoMetricValue; keep the final raise
NoMetricValue(...) if both are None.

In `@tests/unit/records/test_inference_result_parser.py`:
- Around line 207-323: These tests rely on client-side tokenization behavior, so
make tokenization mode explicit by setting the parser flags before calling
process_valid_record; for each test (e.g., test_server_overlay_on_client,
test_no_server_usage_falls_back_to_client, test_partial_server_usage,
test_streaming_uses_last_server_value, test_server_output_overrides_client) set
the parser instance's tokenize_input = True and tokenize_output = True (on
server_token_parser or setup_inference_parser as applicable) so
input_local/output_local and spy_tokenizer.encode() behavior is deterministic
regardless of fixture defaults.
- Line 38: Update the docstring on the parser fixture in
test_inference_result_parser.py (the "parser" fixture) to remove the stale claim
that "both counts are always computed" and instead state that the fixture
mirrors the default parser behavior with counts computed conditionally according
to the tokenize_input and tokenize_output gating flags; ensure the wording
clearly indicates that counts may be included or omitted depending on those
tokenize_* settings.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between eab7190 and 823c4c2500f820b3a842e93728a7acf7ba5d5ba1.

📒 Files selected for processing (32)
  • docs/cli_options.md
  • docs/metrics_reference.md
  • src/aiperf/common/config/endpoint_config.py
  • src/aiperf/common/config/tokenizer_config.py
  • src/aiperf/common/config/user_config.py
  • src/aiperf/common/enums/metric_enums.py
  • src/aiperf/common/models/model_endpoint_info.py
  • src/aiperf/common/models/record_models.py
  • src/aiperf/common/tokenizer_validator.py
  • src/aiperf/endpoints/openai_chat.py
  • src/aiperf/endpoints/openai_completions.py
  • src/aiperf/exporters/console_usage_discrepancy_exporter.py
  • src/aiperf/metrics/base_record_metric.py
  • src/aiperf/metrics/types/input_sequence_length_metric.py
  • src/aiperf/metrics/types/output_sequence_length_metric.py
  • src/aiperf/metrics/types/output_token_count.py
  • src/aiperf/metrics/types/reasoning_token_count.py
  • src/aiperf/metrics/types/usage_diff_metrics.py
  • src/aiperf/post_processors/base_metrics_processor.py
  • src/aiperf/records/inference_result_parser.py
  • src/aiperf/records/record_processor_service.py
  • tests/integration/test_use_server_token_counts.py
  • tests/unit/endpoints/test_openai_chat_completions.py
  • tests/unit/endpoints/test_openai_completions.py
  • tests/unit/exporters/test_console_usage_discrepancy_exporter.py
  • tests/unit/metrics/conftest.py
  • tests/unit/metrics/test_input_sequence_length_metric.py
  • tests/unit/metrics/test_output_sequence_length_metric.py
  • tests/unit/metrics/test_output_token_count.py
  • tests/unit/metrics/test_reasoning_token_count.py
  • tests/unit/metrics/test_usage_diff_metrics.py
  • tests/unit/records/test_inference_result_parser.py
💤 Files with no reviewable changes (4)
  • src/aiperf/common/enums/metric_enums.py
  • src/aiperf/common/models/model_endpoint_info.py
  • src/aiperf/common/tokenizer_validator.py
  • src/aiperf/post_processors/base_metrics_processor.py

Copy link
Contributor

@ajcasagrande ajcasagrande left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Three issues found after verifying against the actual code:

1. Truthy vs None check inconsistency in fallback logic

OutputTokenCountMetric uses truthy checks (excludes 0), while ReasoningTokenCountMetric and InputSequenceLengthMetric use explicit is not None checks (includes 0):

output_token_count.py:53-57:

if record.token_counts.output:        # 0 falls through
    return record.token_counts.output
if record.token_counts.output_local:  # 0 falls through
    return record.token_counts.output_local

reasoning_token_count.py:55-59:

if record.token_counts.reasoning is not None:  # 0 returns
    return record.token_counts.reasoning
if record.token_counts.reasoning_local is not None:
    return record.token_counts.reasoning_local

If a server reports output=0 (valid — model generated no output tokens), OutputTokenCountMetric skips it and falls through to output_local, then raises NoMetricValue if that's also falsy. Meanwhile reasoning=0 correctly returns 0. InputSequenceLengthMetric also uses is not None, making OutputTokenCountMetric the odd one out.

2. Breaking metric tag rename

usage_diff_metrics.py:103 — tag changed from usage_completion_tokens_diff_pct to usage_output_tokens_diff_pct. Any tooling parsing profile_export.jsonl for the old tag will silently find nothing. Worth calling out as a breaking change in the PR description or release notes.

3. Misleading test rename

test_input_sequence_length_metric.py:34-47:

def test_input_sequence_length_parse_record_client_fallback(
    self, input_tokens, should_raise
):
    """Test input sequence length falls back to client-side when no server count."""
    record = create_record(input_tokens=input_tokens)

create_record(input_tokens=15) sets TokenCounts(input=15, input_local=None)input is the server field. This test exercises the server-preferred path, not client fallback. The name and docstring don't match what the test does.

Copy link
Contributor

@ajcasagrande ajcasagrande left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deep Design Analysis

Six additional issues found after tracing the reasoning behind the changes through the codebase:

A. OSL can mix server and client token sources

output_sequence_length_metric.py:51-65 resolves output and reasoning independently through separate server→local fallback chains. When a server reports completion_tokens but not reasoning_tokens (standard for non-reasoning models), OSL sums server output + client reasoning. This propagates into OSLMismatchDiffMetric, OutputTokenThroughputMetric, TotalTokenThroughputMetric, and OSLMismatchCountMetric. The practical impact is narrow — it requires a model producing reasoning content (e.g. <think> blocks) where the server doesn't account for reasoning_tokens separately — but when it hits, the resulting OSL doesn't correspond to either system's view of reality.

B. Error records lose input tokenization for user-provided datasets

inference_result_parser.py:127,172 — The error record paths gate input tokenization on self.tokenize_input, which is auto-set to False for user-provided datasets (user_config.py:994). Error records have no server response, so token_counts.input is always None. Unlike the valid-record path (line 291) which has an elif input_token_count is None fallback, error records have no fallback. Both input and input_local are None → ErrorInputSequenceLengthMetric always raises NoMetricValue. The old code had no tokenize_input gate on error records.

C. stream_options.include_usage injected into all streaming requests

openai_chat.py:64-72, openai_completions.py:55-63 — Previously gated behind --use-server-token-count, now unconditional for all streaming endpoints. This changes the HTTP payload sent to the server, which may affect latency measurements (extra final usage chunk) or cause errors on servers that don't support stream_options. Users can work around it with --extra-inputs '{"stream_options": {"include_usage": false}}', but there is no notification of this change.

D. Prompt diff metrics silently disappear for user-provided datasets

For user-provided datasets: tokenize_input=False (auto) + server reports prompt_tokensinput_local is never computed (inference_result_parser.py:288-289 — the self.tokenize_input branch is skipped, and the fallback on line 291 doesn't fire because input_token_count is not None) → UsagePromptTokensDiffMetric raises NoMetricValue (line 69-72) → UsageDiscrepancyCountMetric cascades → console discrepancy exporter silently returns with a debug log. ISL displays on the console from server values, but the system has no ability to validate those values are correct.

E. Output/reasoning diff metrics are permanently dormant by default

UsageOutputTokensDiffMetric and UsageReasoningTokensDiffMetric require output_local/reasoning_local (lines 126, 184), which are only populated when --tokenize-output is passed or the server doesn't report counts (inference_result_parser.py:306-309). In the common case (server reports completion tokens, --tokenize-output not passed), these metrics always raise NoMetricValue. Only a TRACE-level log is emitted. Nothing tells the user these metrics exist or how to activate them.

F. UsageDiscrepancyCountMetric missing TOKENIZES_INPUT_ONLY flag

UsageDiscrepancyCountMetric has flags = NO_CONSOLE | NO_INDIVIDUAL_RECORDS (line 236). Its dependency UsagePromptTokensDiffMetric has TOKENIZES_INPUT_ONLY (line 48). For non-tokenizing endpoints, the dependency is filtered out at setup time by base_metrics_processor.py:get_filters, but the count metric passes the filter because it lacks the flag. At runtime it fails silently when it can't find the dependency. Functionally equivalent to being filtered, but semantically wrong — the metric shouldn't be in the applicable set at all.

@matthewkotila matthewkotila force-pushed the mkotila/aip-758-implement-preference-for-server-token-counts branch 2 times, most recently from a8d63f1 to a912ba6 Compare March 7, 2026 02:40
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docs/metrics_reference.md (1)

918-918: ⚠️ Potential issue | 🟡 Minor

Reference to deprecated --use-server-token-count flag should be removed.

Per the PR objectives, --use-server-token-count is deprecated as a no-op. This line still references it as a solution for tokenizer mismatch. Consider removing this note or updating it to reflect the new dual-source behavior where server values are preferred by default.

-  - If discrepancy is due to tokenizer mismatch between client and server, use `--use-server-token-count`.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/metrics_reference.md` at line 918, Remove the deprecated recommendation
to use `--use-server-token-count`: locate the sentence that reads "If
discrepancy is due to tokenizer mismatch between client and server, use
`--use-server-token-count`" and either delete it or replace it with a short note
stating that `--use-server-token-count` is deprecated/no-op and that the system
now uses dual-source counts with server values preferred by default. Ensure the
updated wording mentions the deprecation and the new default server-preference
behavior.
♻️ Duplicate comments (2)
src/aiperf/records/inference_result_parser.py (2)

285-309: ⚠️ Potential issue | 🟠 Major

Local tokenization failures should not invalidate otherwise valid records.

In this block, tokenizer load/encode errors bubble up to parse_request_record, which converts the entire record to an error and drops responses. Since local counts are supplementary (fallback when server doesn't report), preserve valid server-parsed results when local tokenization fails.

Suggested hardening
         # Client-side input tokenization
         input_local: int | None = None
-        if not self.disable_tokenization:
-            if self.tokenize_input:
-                # Always compute (synthetic mode or explicit --tokenize-input)
-                input_local = await self.compute_input_token_count(request_record)
-            elif input_token_count is None:
-                # Fallback: server didn't report prompt tokens
-                input_local = await self.compute_input_token_count(request_record)
+        if not self.disable_tokenization and (
+            self.tokenize_input or input_token_count is None
+        ):
+            with suppress(Exception):
+                input_local = await self.compute_input_token_count(request_record)
 
         # Client-side output/reasoning tokenization
         output_local: int | None = None
         reasoning_local: int | None = None
         need_local = (
             self.tokenize_output or output_server is None or reasoning_server is None
         )
         if not self.disable_tokenization and need_local:
-            tokenizer = await self.get_tokenizer(request_record.model_name)
-            output_texts, reasoning_texts = self._parse_output_and_reasoning_texts(
-                responses
-            )
-            if self.tokenize_output or output_server is None:
-                output_local = self._compute_token_count(tokenizer, output_texts)
-            if self.tokenize_output or reasoning_server is None:
-                reasoning_local = self._compute_token_count(tokenizer, reasoning_texts)
+            with suppress(Exception):
+                tokenizer = await self.get_tokenizer(request_record.model_name)
+                output_texts, reasoning_texts = self._parse_output_and_reasoning_texts(
+                    responses
+                )
+                if self.tokenize_output or output_server is None:
+                    output_local = self._compute_token_count(tokenizer, output_texts)
+                if self.tokenize_output or reasoning_server is None:
+                    reasoning_local = self._compute_token_count(tokenizer, reasoning_texts)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/aiperf/records/inference_result_parser.py` around lines 285 - 309, The
local tokenization path in inference_result_parser.py (around get_tokenizer,
_compute_token_count, and compute_input_token_count calls inside
parse_request_record) must not let tokenizer load/encode exceptions convert a
valid server-parsed record into an error; wrap the tokenizer fetching and all
local token counting (compute_input_token_count, _compute_token_count) in
try/except, catch encoding/loading errors, log the failure (including model_name
and exception) and leave input_local/output_local/reasoning_local as None so the
existing server-provided values (input_token_count, output_server,
reasoning_server) remain used; do not re-raise the exception or mutate the
record into an error state.

278-283: ⚠️ Potential issue | 🟡 Minor

No-usage warning predicate is incomplete.

The warning triggers when input_token_count and output_server are missing, even if the server reported reasoning_tokens. This could emit a misleading message when the server does provide partial usage info.

Suggested fix
-        if input_token_count is None and output_server is None:
+        if (
+            input_token_count is None
+            and output_server is None
+            and reasoning_server is None
+        ):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/aiperf/records/inference_result_parser.py` around lines 278 - 283, The
warning in the inference result parsing currently fires when input_token_count
and output_server are None even if the server provided reasoning_tokens; update
the predicate (in the method that calls self.warning with the message about
token usage — locate the block using input_token_count, output_server, and
reasoning_tokens in inference_result_parser.py) to also check that
reasoning_tokens is None so the warning only triggers when input_token_count,
output_server, and reasoning_tokens are all missing.
🧹 Nitpick comments (1)
src/aiperf/metrics/base_record_metric.py (1)

28-33: Consider updating the docstring example to match actual ISL behavior.

The example now shows record.token_counts.input_local, but the actual InputSequenceLengthMetric prefers server-reported record.token_counts.input and falls back to input_local. Consider updating the example to reflect the real server-preferred-with-fallback pattern:

# Prefer server, fall back to local
input_count = record.token_counts.input
if input_count is None:
    input_count = record.token_counts.input_local
return input_count

Alternatively, if the intent is to show the simplest case, add a comment noting this is for InputSequenceLengthLocalMetric specifically.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/aiperf/metrics/base_record_metric.py` around lines 28 - 33, The docstring
example for InputSequenceLengthMetric is misleading because the metric prefers
server-reported token_counts.input and falls back to token_counts.input_local;
update the example in the InputSequenceLengthMetric docstring to show the
server-preferred-with-fallback pattern (use record.token_counts.input, if None
then record.token_counts.input_local) or, if the example is intentionally
local-only, change the example text to reference InputSequenceLengthLocalMetric
and add a comment stating it uses input_local only; update references in the
class/method docstring accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/aiperf/metrics/types/output_token_count.py`:
- Around line 50-59: OutputTokenCountMetric currently uses truthy checks (if
record.token_counts.output) which treats 0 as missing; change these to explicit
None checks to match ReasoningTokenCountMetric and allow legitimate zero values.
Update the logic in OutputTokenCountMetric to: first verify record.token_counts
is not None (unchanged), then use "if record.token_counts.output is not None:
return record.token_counts.output" and "if record.token_counts.output_local is
not None: return record.token_counts.output_local", otherwise raise
NoMetricValue. This preserves existing behavior while making it consistent with
ReasoningTokenCountMetric.

---

Outside diff comments:
In `@docs/metrics_reference.md`:
- Line 918: Remove the deprecated recommendation to use
`--use-server-token-count`: locate the sentence that reads "If discrepancy is
due to tokenizer mismatch between client and server, use
`--use-server-token-count`" and either delete it or replace it with a short note
stating that `--use-server-token-count` is deprecated/no-op and that the system
now uses dual-source counts with server values preferred by default. Ensure the
updated wording mentions the deprecation and the new default server-preference
behavior.

---

Duplicate comments:
In `@src/aiperf/records/inference_result_parser.py`:
- Around line 285-309: The local tokenization path in inference_result_parser.py
(around get_tokenizer, _compute_token_count, and compute_input_token_count calls
inside parse_request_record) must not let tokenizer load/encode exceptions
convert a valid server-parsed record into an error; wrap the tokenizer fetching
and all local token counting (compute_input_token_count, _compute_token_count)
in try/except, catch encoding/loading errors, log the failure (including
model_name and exception) and leave input_local/output_local/reasoning_local as
None so the existing server-provided values (input_token_count, output_server,
reasoning_server) remain used; do not re-raise the exception or mutate the
record into an error state.
- Around line 278-283: The warning in the inference result parsing currently
fires when input_token_count and output_server are None even if the server
provided reasoning_tokens; update the predicate (in the method that calls
self.warning with the message about token usage — locate the block using
input_token_count, output_server, and reasoning_tokens in
inference_result_parser.py) to also check that reasoning_tokens is None so the
warning only triggers when input_token_count, output_server, and
reasoning_tokens are all missing.

---

Nitpick comments:
In `@src/aiperf/metrics/base_record_metric.py`:
- Around line 28-33: The docstring example for InputSequenceLengthMetric is
misleading because the metric prefers server-reported token_counts.input and
falls back to token_counts.input_local; update the example in the
InputSequenceLengthMetric docstring to show the server-preferred-with-fallback
pattern (use record.token_counts.input, if None then
record.token_counts.input_local) or, if the example is intentionally local-only,
change the example text to reference InputSequenceLengthLocalMetric and add a
comment stating it uses input_local only; update references in the class/method
docstring accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 745ddad1-b3a7-453e-a890-3ec9d836d7b1

📥 Commits

Reviewing files that changed from the base of the PR and between 823c4c2500f820b3a842e93728a7acf7ba5d5ba1 and a8d63f1450b6927553edfc89acc843d8403f6929.

📒 Files selected for processing (32)
  • docs/cli_options.md
  • docs/metrics_reference.md
  • src/aiperf/common/config/endpoint_config.py
  • src/aiperf/common/config/tokenizer_config.py
  • src/aiperf/common/config/user_config.py
  • src/aiperf/common/enums/metric_enums.py
  • src/aiperf/common/models/model_endpoint_info.py
  • src/aiperf/common/models/record_models.py
  • src/aiperf/common/tokenizer_validator.py
  • src/aiperf/endpoints/openai_chat.py
  • src/aiperf/endpoints/openai_completions.py
  • src/aiperf/exporters/console_usage_discrepancy_exporter.py
  • src/aiperf/metrics/base_record_metric.py
  • src/aiperf/metrics/types/input_sequence_length_metric.py
  • src/aiperf/metrics/types/output_sequence_length_metric.py
  • src/aiperf/metrics/types/output_token_count.py
  • src/aiperf/metrics/types/reasoning_token_count.py
  • src/aiperf/metrics/types/usage_diff_metrics.py
  • src/aiperf/post_processors/base_metrics_processor.py
  • src/aiperf/records/inference_result_parser.py
  • src/aiperf/records/record_processor_service.py
  • tests/integration/test_use_server_token_counts.py
  • tests/unit/endpoints/test_openai_chat_completions.py
  • tests/unit/endpoints/test_openai_completions.py
  • tests/unit/exporters/test_console_usage_discrepancy_exporter.py
  • tests/unit/metrics/conftest.py
  • tests/unit/metrics/test_input_sequence_length_metric.py
  • tests/unit/metrics/test_output_sequence_length_metric.py
  • tests/unit/metrics/test_output_token_count.py
  • tests/unit/metrics/test_reasoning_token_count.py
  • tests/unit/metrics/test_usage_diff_metrics.py
  • tests/unit/records/test_inference_result_parser.py
💤 Files with no reviewable changes (4)
  • src/aiperf/common/enums/metric_enums.py
  • src/aiperf/common/tokenizer_validator.py
  • src/aiperf/common/models/model_endpoint_info.py
  • src/aiperf/post_processors/base_metrics_processor.py
🚧 Files skipped from review as they are similar to previous changes (9)
  • tests/unit/endpoints/test_openai_completions.py
  • tests/unit/metrics/test_output_token_count.py
  • tests/unit/exporters/test_console_usage_discrepancy_exporter.py
  • src/aiperf/records/record_processor_service.py
  • tests/unit/records/test_inference_result_parser.py
  • src/aiperf/endpoints/openai_chat.py
  • src/aiperf/common/config/endpoint_config.py
  • tests/unit/metrics/test_input_sequence_length_metric.py
  • tests/unit/metrics/conftest.py

@matthewkotila matthewkotila force-pushed the mkotila/aip-758-implement-preference-for-server-token-counts branch from a912ba6 to 2baa419 Compare March 11, 2026 06:34
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/aiperf/common/models/record_models.py (1)

880-907: ⚠️ Potential issue | 🔴 Critical

Using Pydantic Field() in a plain @dataclass doesn't work as intended.

TokenCounts is decorated with @dataclass(slots=True), but the fields use Pydantic's Field() for defaults. In a plain dataclass, Field(default=None, ...) returns a FieldInfo object which becomes the actual default value, not None. This means TokenCounts() would have FieldInfo objects as field values instead of None.

For dataclasses, use plain defaults with docstrings for documentation (consistent with other dataclasses in this file like SSEField, TextResponse, etc.):

🐛 Proposed fix
 `@dataclass`(slots=True)
 class TokenCounts:
     """Token counts for a record."""

-    input: int | None = Field(
-        default=None,
-        description="The server-reported prompt token count from the API usage field. If None, the server did not report prompt tokens.",
-    )
-    input_local: int | None = Field(
-        default=None,
-        description="The number of input tokens computed by the client-side tokenizer. If None, the number of tokens could not be calculated.",
-    )
-    output: int | None = Field(
-        default=None,
-        description="The server-reported output token count (completion minus reasoning). If None, the server did not report completion tokens.",
-    )
-    output_local: int | None = Field(
-        default=None,
-        description="The number of output tokens computed by the client-side tokenizer.",
-    )
-    reasoning: int | None = Field(
-        default=None,
-        description="The server-reported reasoning token count. If None, the server did not report reasoning tokens.",
-    )
-    reasoning_local: int | None = Field(
-        default=None,
-        description="The number of reasoning tokens computed by the client-side tokenizer.",
-    )
+    input: int | None = None
+    """The server-reported prompt token count from the API usage field. None if not reported."""
+
+    input_local: int | None = None
+    """The number of input tokens computed by the client-side tokenizer. None if not calculated."""
+
+    output: int | None = None
+    """The server-reported output token count (completion minus reasoning). None if not reported."""
+
+    output_local: int | None = None
+    """The number of output tokens computed by the client-side tokenizer."""
+
+    reasoning: int | None = None
+    """The server-reported reasoning token count. None if not reported."""
+
+    reasoning_local: int | None = None
+    """The number of reasoning tokens computed by the client-side tokenizer."""

As per coding guidelines: "Add docstrings on dataclass fields" and use plain defaults for dataclasses.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/aiperf/common/models/record_models.py` around lines 880 - 907, The
TokenCounts dataclass is using Pydantic's Field() as defaults causing FieldInfo
objects to become actual default values; replace all Field(...) usages inside
the TokenCounts class with plain Python defaults (e.g., input: int | None =
None) and move the descriptive text into per-field docstrings or the class
docstring to match other dataclasses (reference TokenCounts and its attributes
input, input_local, output, output_local, reasoning, reasoning_local) so
instances of TokenCounts get real None defaults instead of FieldInfo objects.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/metrics-reference.md`:
- Line 61: Update the TOC entry that links to the "Usage Completion Tokens Diff
%" heading by replacing the incorrect anchor "#usage-output-tokens-diff-" with
the correct anchor "#usage-completion-tokens-diff-"; locate the text containing
the malformed link and change the fragment to match the heading text "Usage
Completion Tokens Diff %".

In `@tests/unit/metrics/test_output_sequence_length_metric.py`:
- Around line 81-131: The OutputSequenceLengthMetric.parse_record currently
chooses all-server vs all-local wholesale; change it to compute per-field
fallbacks: pick output_val = token_counts.output if not None else
token_counts.output_local, and reasoning_val = token_counts.reasoning if not
None else token_counts.reasoning_local; if both output_val and reasoning_val are
None raise NoMetricValue, otherwise treat any remaining None as 0 and return
output_val + reasoning_val. Update the logic in
OutputSequenceLengthMetric.parse_record to follow this per-field (output ??
output_local) + (reasoning ?? reasoning_local) behavior.

---

Outside diff comments:
In `@src/aiperf/common/models/record_models.py`:
- Around line 880-907: The TokenCounts dataclass is using Pydantic's Field() as
defaults causing FieldInfo objects to become actual default values; replace all
Field(...) usages inside the TokenCounts class with plain Python defaults (e.g.,
input: int | None = None) and move the descriptive text into per-field
docstrings or the class docstring to match other dataclasses (reference
TokenCounts and its attributes input, input_local, output, output_local,
reasoning, reasoning_local) so instances of TokenCounts get real None defaults
instead of FieldInfo objects.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c6788c05-9d3e-449d-ba0c-ee28876b4dac

📥 Commits

Reviewing files that changed from the base of the PR and between a8d63f1450b6927553edfc89acc843d8403f6929 and 2baa419.

📒 Files selected for processing (34)
  • docs/cli-options.md
  • docs/metrics-reference.md
  • src/aiperf/common/config/cli_parameter.py
  • src/aiperf/common/config/config_defaults.py
  • src/aiperf/common/config/endpoint_config.py
  • src/aiperf/common/config/tokenizer_config.py
  • src/aiperf/common/config/user_config.py
  • src/aiperf/common/enums/metric_enums.py
  • src/aiperf/common/models/model_endpoint_info.py
  • src/aiperf/common/models/record_models.py
  • src/aiperf/common/tokenizer_validator.py
  • src/aiperf/endpoints/openai_chat.py
  • src/aiperf/endpoints/openai_completions.py
  • src/aiperf/exporters/console_usage_discrepancy_exporter.py
  • src/aiperf/metrics/base_record_metric.py
  • src/aiperf/metrics/types/input_sequence_length_metric.py
  • src/aiperf/metrics/types/output_sequence_length_metric.py
  • src/aiperf/metrics/types/output_token_count.py
  • src/aiperf/metrics/types/reasoning_token_count.py
  • src/aiperf/metrics/types/usage_diff_metrics.py
  • src/aiperf/post_processors/base_metrics_processor.py
  • src/aiperf/records/inference_result_parser.py
  • src/aiperf/records/record_processor_service.py
  • tests/integration/test_use_server_token_counts.py
  • tests/unit/endpoints/test_openai_chat_completions.py
  • tests/unit/endpoints/test_openai_completions.py
  • tests/unit/exporters/test_console_usage_discrepancy_exporter.py
  • tests/unit/metrics/conftest.py
  • tests/unit/metrics/test_input_sequence_length_metric.py
  • tests/unit/metrics/test_output_sequence_length_metric.py
  • tests/unit/metrics/test_output_token_count.py
  • tests/unit/metrics/test_reasoning_token_count.py
  • tests/unit/metrics/test_usage_diff_metrics.py
  • tests/unit/records/test_inference_result_parser.py
💤 Files with no reviewable changes (3)
  • src/aiperf/common/enums/metric_enums.py
  • src/aiperf/common/tokenizer_validator.py
  • src/aiperf/post_processors/base_metrics_processor.py
🚧 Files skipped from review as they are similar to previous changes (7)
  • src/aiperf/common/models/model_endpoint_info.py
  • src/aiperf/common/config/user_config.py
  • tests/unit/exporters/test_console_usage_discrepancy_exporter.py
  • src/aiperf/metrics/base_record_metric.py
  • src/aiperf/exporters/console_usage_discrepancy_exporter.py
  • tests/integration/test_use_server_token_counts.py
  • src/aiperf/endpoints/openai_completions.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajcasagrande responding here so we can have a thread

1: Fixed. Uses None everywhere.

2: Reverted to old name to be consistent with the "usage_" prefix preferring openai name.

3: Removed test since it was already covered by another one.

A: Fixed. Forces consistent source of reasoning and output token counts. Either both server or both local.

B: Fixed. Error records fallback to local tokenization now unless user explicitly specifies --no-tokenize-input.

C: Fixed. Added --stream-usage / --no-stream-usage flag + info log.

D: Fixed. Added an info log when input tokenization is disabled. In rare case where user provided input and doesn't trust server token counts, they can locally tokenize with --tokenize-input.

E: Fixed. Added an info log when output tokenization is disabled. In rare case where user doesn't trust server token counts, they can locally tokenize with --tokenize-output.

F: Added TOKENIZES_INPUT_ONLY flag.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/metrics-reference.md`:
- Around line 808-825: The documentation currently says prompt diff metrics
require only `--tokenize-output`, which is incorrect; update the
`UsagePromptTokensDiffMetric` notes to state that prompt-diff values depend on
`token_counts.input_local` and therefore require `--tokenize-input` (or
server-provided prompt token counts/fallback tokenization) to be present for
user-provided datasets; mention `token_counts.input_local`,
`UsagePromptTokensDiffMetric`, and the `--tokenize-input` flag so readers know
when the metric will actually be available.
- Around line 323-333: Update the OSL documentation to state that
OutputSequenceLengthMetric (OSL) uses an all-or-nothing token source selection:
it chooses either all server-reported counts or all client-local counts rather
than doing per-field fallbacks; explicitly call out that OSL does not mix server
`token_counts.*` (e.g., `token_counts.output`, `token_counts.reasoning`) with
local `*_local` fields to avoid double-counting when some backends fold
reasoning into `output` while leaving `reasoning` null; revise the description
and the “Formula” examples for OutputSequenceLengthMetric/OSL to show this
all-server OR all-local behavior (not `(output ?? output_local) + (reasoning ??
reasoning_local)`), and make the same clarification in the other section noted
(lines 623-633).
- Around line 327-329: The docs currently show output_token_count =
token_counts.output or token_counts.output_local which treats 0 as falsy; update
the text to show explicit None-check fallback semantics used in code: use
token_counts.output if token_counts.output is not None else
token_counts.output_local, and explain that token_counts.output_local is only
used when token_counts.output is None so a server-reported 0 is preserved.
- Around line 385-425: Update the ISL documentation to reflect the new storage
fields and correct flag behavior: change the primary/formula references from
usage.prompt_tokens and tokenizer.encode(...) to token_counts.input
(server-preferred) and token_counts.input_local (client-side), update the “Input
Sequence Length — Server (file-only)” and “Input Sequence Length — Local
(file-only)” subsections to state they read token_counts.input and
token_counts.input_local respectively, and clarify that --no-tokenize-input only
prevents explicit client tokenization when configured but does not stop the
fallback local tokenization when token_counts.input is missing (mention
--tokenize-input to force client tokenization). Ensure the formulas and notes
reference the exact symbols token_counts.input and token_counts.input_local and
update phrasing so fallback behavior is accurate.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8049f8ea-4e80-46f5-91b4-d526d40892ba

📥 Commits

Reviewing files that changed from the base of the PR and between 2baa419 and 25f573e.

📒 Files selected for processing (1)
  • docs/metrics-reference.md

@matthewkotila matthewkotila force-pushed the mkotila/aip-758-implement-preference-for-server-token-counts branch from 25f573e to 633b7bd Compare March 13, 2026 02:25
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/aiperf/metrics/base_record_metric.py (1)

27-33: ⚠️ Potential issue | 🟡 Minor

Docstring example now conflicts with actual ISL source precedence.

The example should not return input_local directly; it should demonstrate server-first (input) with local fallback, matching InputSequenceLengthMetric behavior.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/aiperf/metrics/base_record_metric.py` around lines 27 - 33, The
implementation of _parse_record currently returns
record.token_counts.input_local but the docstring/example expects server-first
precedence; update BaseRecordMetric._parse_record to return
record.token_counts.input when present and fall back to
record.token_counts.input_local otherwise (preserving behavior similar to
InputSequenceLengthMetric), ensuring you reference
ParsedResponseRecord.token_counts.input and .input_local and keep the same
return type (int) and signature.
src/aiperf/metrics/types/usage_diff_metrics.py (1)

199-262: ⚠️ Potential issue | 🟠 Major

The exported counter name no longer matches what it counts.

usage_discrepancy_count now depends only on UsagePromptTokensDiffMetric, so completion and reasoning mismatches are silently excluded even though the tag/header still read as a generic usage discrepancy count. That changes the meaning of an exported metric without any signal to downstream dashboards. Either keep the aggregate behavior or introduce a prompt-specific metric/tag and preserve the current one for backward compatibility.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/aiperf/metrics/types/usage_diff_metrics.py` around lines 199 - 262, The
exported metric UsageDiscrepancyCountMetric (tag/header) no longer matches its
implementation since _parse_record only checks UsagePromptTokensDiffMetric;
either restore the original generic behavior by including completion/reasoning
diff metrics in required_metrics and checking all diffs (e.g., take max(prompt,
completion, reasoning) > threshold or increment if any exceed threshold in
UsageDiscrepancyCountMetric._parse_record), or make this a prompt-specific
metric by renaming the tag/header (e.g., tag="usage_prompt_discrepancy_count",
header="Usage Prompt Discrepancy Count", short_header) and leave the original
UsageDiscrepancyCountMetric intact for backward compatibility (create a new
class UsagePromptDiscrepancyCountMetric referencing UsagePromptTokensDiffMetric
if you choose rename).
🧹 Nitpick comments (4)
tests/unit/metrics/test_reasoning_token_count.py (1)

106-178: Align new test names with the project’s test naming convention.

Please rename the newly added server/local tests to the test_<function>_<scenario>_<expected> pattern for consistency (e.g., test_parse_record_server_value_present_returns_server_value).

As per coding guidelines, tests/**/*.py: "Test naming convention: test_<function>_<scenario>_<expected>".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/metrics/test_reasoning_token_count.py` around lines 106 - 178, The
test function names in TestReasoningTokenCountServerMetric and
TestReasoningTokenCountLocalMetric do not follow the project's test naming
convention; rename each test method to the test_<function>_<scenario>_<expected>
pattern (for example, rename test_server_metric_basic ->
test_parse_record_server_value_present_returns_server_value,
test_server_metric_zero -> test_parse_record_server_value_zero_returns_zero,
test_server_metric_none_raises ->
test_parse_record_server_value_none_raises_NoMetricValue, and similarly rename
the local ones: test_local_metric_basic ->
test_parse_record_local_value_present_returns_local_value,
test_local_metric_zero -> test_parse_record_local_value_zero_returns_zero,
test_local_metric_none_raises ->
test_parse_record_local_value_none_raises_NoMetricValue) so the names align with
the guideline; keep the classes ReasoningTokenCountServerMetric and
ReasoningTokenCountLocalMetric and their assertions unchanged.
tests/unit/metrics/test_input_sequence_length_metric.py (1)

25-141: Good coverage overall; please align the new pytest methods with repo conventions.

The added cases are useful, but names like test_server_metric_parse_record and the missing -> None annotations drift from the repo's pytest conventions for test naming and function typing.

As per coding guidelines, "Name tests as test_<function>_<scenario>_<expected>" and "Include type hints on ALL functions (params and return)".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/metrics/test_input_sequence_length_metric.py` around lines 25 -
141, Rename the test methods to follow the repo convention
test_<function>_<scenario>_<expected> (e.g., rename
test_server_metric_parse_record -> test_parse_record_server_returns_or_raises,
test_server_metric_ignores_local ->
test_parse_record_server_ignores_local_raises, test_server_metric_flags ->
test_flags_server_contains_expected, and similarly for the local-metric methods)
and add full type hints to all test function signatures (annotate parameters and
add "-> None" return type) for the methods in
TestInputSequenceLengthServerMetric, TestInputSequenceLengthLocalMetric and any
other newly added tests such as test_input_sequence_length_prefers_server and
test_input_sequence_length_both_none_raises so they conform to the project's
naming and typing conventions.
src/aiperf/records/inference_result_parser.py (1)

325-356: Keep tokenizer failures best-effort, but let parser bugs fail fast.

These except Exception blocks also swallow failures from compute_input_token_count() and _parse_output_and_reasoning_texts(), so regressions in our own prompt/response extraction quietly turn into missing local counts. Catch around tokenizer load/encode only, or re-raise non-tokenizer failures.

Based on learnings, prefer fail-fast at the source of logic errors rather than adding downstream defensive guards that mask bugs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/aiperf/records/inference_result_parser.py` around lines 325 - 356, The
current broad excepts around compute_input_token_count and the block that calls
get_tokenizer, _parse_output_and_reasoning_texts and _compute_token_count
swallow parser/regression errors; narrow the scope so only tokenizer-related
failures are best-effort. For compute_input_token_count, remove the blanket
except or catch and re-raise non-tokenizer errors so only tokenizer load/encode
errors get logged; for the output path, call _parse_output_and_reasoning_texts
outside the tokenizer-specific try and let its exceptions propagate, then wrap
only get_tokenizer and _compute_token_count calls in a try that catches
tokenizer/encoding errors (and logs/warns) while re-raising any other
exceptions. Reference functions: compute_input_token_count, get_tokenizer,
_parse_output_and_reasoning_texts, _compute_token_count.
tests/unit/metrics/test_usage_diff_metrics.py (1)

478-634: Please align the new pytest methods with repo naming and typing rules.

The added completion/reasoning diff tests omit -> None, and names like test_exact_match / test_positive_difference are too generic for the repo's test_<function>_<scenario>_<expected> convention.

As per coding guidelines, "Name tests as test_<function>_<scenario>_<expected>" and "Include type hints on ALL functions (params and return)".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/metrics/test_usage_diff_metrics.py` around lines 478 - 634, The
test methods in TestUsageCompletionTokensDiffMetric and
TestUsageReasoningTokensDiffMetric lack type hints and use generic names; rename
each method to follow the repo convention test_<function>_<scenario>_<expected>
(e.g., replace test_exact_match with test_parse_record_exact_match_returns_zero
for UsageCompletionTokensDiffMetric and analogous names for
UsageReasoningTokensDiffMetric) and add full type annotations to all test method
signatures (include parameter types like self and return type -> None) so every
def in those classes has proper typing; keep the existing assertions and
metric/class references (UsageCompletionTokensDiffMetric,
UsageReasoningTokensDiffMetric, parse_record, MetricRecordDict, NoMetricValue,
MetricFlags) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/cli-options.md`:
- Around line 168-170: The deprecation note for the flag
`--use-server-token-count` incorrectly states that AIPerf "always computes both
client-side and server-reported token counts"; instead update the wording to
clarify that `input_local` and `output_local` token counts are not computed
unconditionally but are only materialized when explicitly enabled or required as
a fallback to server counts, and that server-reported counts remain preferred
for output metrics while client-side counts are optional and may impose
performance cost; edit the `--use-server-token-count` description to reflect
this conditional materialization and the performance implication.
- Around line 681-684: Update the `--tokenize-input`/`--no-tokenize-input` docs
to reflect the conditional runtime default instead of claiming a flat True:
state that tokenization is enabled by default for synthetic inputs (so input
token counts are stored in token_counts.input_local), but it is automatically
disabled for user-provided datasets (`--custom-dataset-type` and
`--public-dataset`) unless explicitly overridden; remove or replace the
misleading "_Default: `True`_" line and ensure the text mentions the fallback
client-side tokenization behavior when the server does not report prompt tokens.

In `@docs/metrics-reference.md`:
- Around line 815-817: The note overstates prompt-diff availability: update the
text to say that UsagePromptTokensDiffMetric requires both API-provided prompt
usage (the server `usage` prompt token count) and client-local prompt token
counts; enabling local tokenization via --tokenize-input (or fallback
tokenization) only supplies the client side and does not create a prompt diff if
the API prompt usage is absent. Modify the paragraph to explicitly state that
prompt diff is available only when the API includes prompt usage plus client
tokenization, and that --tokenize-input enables client counting but cannot
substitute for missing API prompt usage.
- Around line 401-406: Update the documentation text around the client-side
tokenization and the --no-tokenize-input flag to make clear that
--no-tokenize-input only prevents client-side tokenization when the server
reports usage.prompt_tokens, but does not prevent fallback local tokenization
when the server omits prompt token counts; explicitly mention that client-side
tokenization (which uses add_special_tokens=False to count only content tokens)
will still be performed as a fallback if prompt token usage is missing, and note
that --tokenize-input forces local tokenization for user-provided input
datasets.
- Around line 396-398: The current ISL formula uses "usage.prompt_tokens or
len(tokenizer.encode(prompt, add_special_tokens=False))" which treats 0 as
falsy; change it to explicitly fall back only when usage.prompt_tokens is None
by using a conditional expression that returns usage.prompt_tokens if it is not
None, otherwise computes len(tokenizer.encode(prompt,
add_special_tokens=False)); reference the symbols input_sequence_length,
usage.prompt_tokens, tokenizer.encode, prompt and add_special_tokens=False when
applying this fix.

In `@src/aiperf/common/models/record_models.py`:
- Around line 884-907: The TokenCounts dataclass fields (input, input_local,
output, output_local, reasoning, reasoning_local) currently use
pydantic.Field(...) which stores FieldInfo objects when omitted; change each
field to a plain dataclass default (e.g., input: int | None = None) instead of
Field(...) and move the per-field descriptions into the class docstring or
comments so omitted fields are actual None values; update the TokenCounts
declaration (and any dataclass decorator usage) accordingly to preserve typing
and slots while ensuring downstream checks like record.token_counts.input is not
None work correctly.

In `@src/aiperf/records/inference_result_parser.py`:
- Around line 314-323: The current per-record warning when input_token_count,
output_server, and reasoning_server are all None will spam logs and overstate
impact; change it to warn only once by tracking a flag on the parser instance
(e.g., set self._warned_no_server_usage = True after the first call) and update
the message to be accurate and softer: indicate that server-side usage info is
missing but local fallbacks (input_local, output_local, reasoning_local) may
still be available. Modify the branch that calls self.warning (the check using
input_token_count, output_server, reasoning_server) to first check and set this
instance flag and to use the revised message mentioning
input_local/output_local/reasoning_local instead of claiming all token metrics
are unavailable.

---

Outside diff comments:
In `@src/aiperf/metrics/base_record_metric.py`:
- Around line 27-33: The implementation of _parse_record currently returns
record.token_counts.input_local but the docstring/example expects server-first
precedence; update BaseRecordMetric._parse_record to return
record.token_counts.input when present and fall back to
record.token_counts.input_local otherwise (preserving behavior similar to
InputSequenceLengthMetric), ensuring you reference
ParsedResponseRecord.token_counts.input and .input_local and keep the same
return type (int) and signature.

In `@src/aiperf/metrics/types/usage_diff_metrics.py`:
- Around line 199-262: The exported metric UsageDiscrepancyCountMetric
(tag/header) no longer matches its implementation since _parse_record only
checks UsagePromptTokensDiffMetric; either restore the original generic behavior
by including completion/reasoning diff metrics in required_metrics and checking
all diffs (e.g., take max(prompt, completion, reasoning) > threshold or
increment if any exceed threshold in UsageDiscrepancyCountMetric._parse_record),
or make this a prompt-specific metric by renaming the tag/header (e.g.,
tag="usage_prompt_discrepancy_count", header="Usage Prompt Discrepancy Count",
short_header) and leave the original UsageDiscrepancyCountMetric intact for
backward compatibility (create a new class UsagePromptDiscrepancyCountMetric
referencing UsagePromptTokensDiffMetric if you choose rename).

---

Nitpick comments:
In `@src/aiperf/records/inference_result_parser.py`:
- Around line 325-356: The current broad excepts around
compute_input_token_count and the block that calls get_tokenizer,
_parse_output_and_reasoning_texts and _compute_token_count swallow
parser/regression errors; narrow the scope so only tokenizer-related failures
are best-effort. For compute_input_token_count, remove the blanket except or
catch and re-raise non-tokenizer errors so only tokenizer load/encode errors get
logged; for the output path, call _parse_output_and_reasoning_texts outside the
tokenizer-specific try and let its exceptions propagate, then wrap only
get_tokenizer and _compute_token_count calls in a try that catches
tokenizer/encoding errors (and logs/warns) while re-raising any other
exceptions. Reference functions: compute_input_token_count, get_tokenizer,
_parse_output_and_reasoning_texts, _compute_token_count.

In `@tests/unit/metrics/test_input_sequence_length_metric.py`:
- Around line 25-141: Rename the test methods to follow the repo convention
test_<function>_<scenario>_<expected> (e.g., rename
test_server_metric_parse_record -> test_parse_record_server_returns_or_raises,
test_server_metric_ignores_local ->
test_parse_record_server_ignores_local_raises, test_server_metric_flags ->
test_flags_server_contains_expected, and similarly for the local-metric methods)
and add full type hints to all test function signatures (annotate parameters and
add "-> None" return type) for the methods in
TestInputSequenceLengthServerMetric, TestInputSequenceLengthLocalMetric and any
other newly added tests such as test_input_sequence_length_prefers_server and
test_input_sequence_length_both_none_raises so they conform to the project's
naming and typing conventions.

In `@tests/unit/metrics/test_reasoning_token_count.py`:
- Around line 106-178: The test function names in
TestReasoningTokenCountServerMetric and TestReasoningTokenCountLocalMetric do
not follow the project's test naming convention; rename each test method to the
test_<function>_<scenario>_<expected> pattern (for example, rename
test_server_metric_basic ->
test_parse_record_server_value_present_returns_server_value,
test_server_metric_zero -> test_parse_record_server_value_zero_returns_zero,
test_server_metric_none_raises ->
test_parse_record_server_value_none_raises_NoMetricValue, and similarly rename
the local ones: test_local_metric_basic ->
test_parse_record_local_value_present_returns_local_value,
test_local_metric_zero -> test_parse_record_local_value_zero_returns_zero,
test_local_metric_none_raises ->
test_parse_record_local_value_none_raises_NoMetricValue) so the names align with
the guideline; keep the classes ReasoningTokenCountServerMetric and
ReasoningTokenCountLocalMetric and their assertions unchanged.

In `@tests/unit/metrics/test_usage_diff_metrics.py`:
- Around line 478-634: The test methods in TestUsageCompletionTokensDiffMetric
and TestUsageReasoningTokensDiffMetric lack type hints and use generic names;
rename each method to follow the repo convention
test_<function>_<scenario>_<expected> (e.g., replace test_exact_match with
test_parse_record_exact_match_returns_zero for UsageCompletionTokensDiffMetric
and analogous names for UsageReasoningTokensDiffMetric) and add full type
annotations to all test method signatures (include parameter types like self and
return type -> None) so every def in those classes has proper typing; keep the
existing assertions and metric/class references
(UsageCompletionTokensDiffMetric, UsageReasoningTokensDiffMetric, parse_record,
MetricRecordDict, NoMetricValue, MetricFlags) unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1258b632-f717-4506-8d4d-20d744ea73e8

📥 Commits

Reviewing files that changed from the base of the PR and between 25f573e and 633b7bd.

📒 Files selected for processing (34)
  • docs/cli-options.md
  • docs/metrics-reference.md
  • src/aiperf/common/config/cli_parameter.py
  • src/aiperf/common/config/config_defaults.py
  • src/aiperf/common/config/endpoint_config.py
  • src/aiperf/common/config/tokenizer_config.py
  • src/aiperf/common/config/user_config.py
  • src/aiperf/common/enums/metric_enums.py
  • src/aiperf/common/models/model_endpoint_info.py
  • src/aiperf/common/models/record_models.py
  • src/aiperf/common/tokenizer_validator.py
  • src/aiperf/endpoints/openai_chat.py
  • src/aiperf/endpoints/openai_completions.py
  • src/aiperf/exporters/console_usage_discrepancy_exporter.py
  • src/aiperf/metrics/base_record_metric.py
  • src/aiperf/metrics/types/input_sequence_length_metric.py
  • src/aiperf/metrics/types/output_sequence_length_metric.py
  • src/aiperf/metrics/types/output_token_count.py
  • src/aiperf/metrics/types/reasoning_token_count.py
  • src/aiperf/metrics/types/usage_diff_metrics.py
  • src/aiperf/post_processors/base_metrics_processor.py
  • src/aiperf/records/inference_result_parser.py
  • src/aiperf/records/record_processor_service.py
  • tests/integration/test_use_server_token_counts.py
  • tests/unit/endpoints/test_openai_chat_completions.py
  • tests/unit/endpoints/test_openai_completions.py
  • tests/unit/exporters/test_console_usage_discrepancy_exporter.py
  • tests/unit/metrics/conftest.py
  • tests/unit/metrics/test_input_sequence_length_metric.py
  • tests/unit/metrics/test_output_sequence_length_metric.py
  • tests/unit/metrics/test_output_token_count.py
  • tests/unit/metrics/test_reasoning_token_count.py
  • tests/unit/metrics/test_usage_diff_metrics.py
  • tests/unit/records/test_inference_result_parser.py
💤 Files with no reviewable changes (3)
  • src/aiperf/post_processors/base_metrics_processor.py
  • src/aiperf/common/enums/metric_enums.py
  • src/aiperf/common/tokenizer_validator.py
🚧 Files skipped from review as they are similar to previous changes (8)
  • src/aiperf/records/record_processor_service.py
  • src/aiperf/endpoints/openai_chat.py
  • src/aiperf/common/config/config_defaults.py
  • tests/unit/metrics/conftest.py
  • src/aiperf/common/config/cli_parameter.py
  • tests/unit/endpoints/test_openai_completions.py
  • src/aiperf/metrics/types/reasoning_token_count.py
  • tests/unit/exporters/test_console_usage_discrepancy_exporter.py

@matthewkotila matthewkotila force-pushed the mkotila/aip-758-implement-preference-for-server-token-counts branch from 633b7bd to e2c3be0 Compare March 14, 2026 00:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants