Skip to content

Commit a912ba6

Browse files
committed
feat: Implement preference for server token counts
1 parent e294a31 commit a912ba6

32 files changed

+1349
-681
lines changed

docs/cli-options.md

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -225,8 +225,8 @@ Use the legacy 'max_tokens' field instead of 'max_completion_tokens' in request
225225

226226
#### `--use-server-token-count`
227227

228-
Use server-reported token counts from API usage fields instead of client-side tokenization. When enabled, tokenizers are still loaded (needed for dataset generation) but tokenizer.encode() is not called for computing metrics. Token count fields will be None if the server does not provide usage information. For OpenAI-compatible streaming endpoints (chat/completions), stream_options.include_usage is automatically configured when this flag is enabled.
229-
<br/>_Flag (no value required)_
228+
[Deprecated] This flag is a no-op and will be removed in a future release. AIPerf now always computes both client-side and server-reported token counts. Server counts are preferred for output metrics; client counts are used for input validation.
229+
<br>_Flag (no value required)_
230230

231231
#### `--connection-reuse-strategy` `<str>`
232232

@@ -726,6 +726,16 @@ Specific tokenizer version to load from HuggingFace Hub. Can be a branch name (e
726726
Allow execution of custom Python code from HuggingFace Hub tokenizer repositories. Required for tokenizers with custom implementations not in the standard `transformers` library. **Security Warning**: Only enable for trusted repositories, as this executes arbitrary code. Unnecessary for standard tokenizers.
727727
<br/>_Flag (no value required)_
728728

729+
#### `--tokenize-output`
730+
731+
Enable client-side tokenization of output and reasoning tokens, even when the server reports token counts. When enabled, locally computed counts are stored alongside server-reported values for validation and comparison. Without this flag, local output/reasoning tokenization only occurs as a fallback when the server does not report counts.
732+
<br>_Flag (no value required)_
733+
734+
#### `--tokenize-input`, `--no-tokenize-input`
735+
736+
Enable client-side tokenization of input prompts for every request. When enabled, locally computed input token counts are always stored in token_counts.input_local. When disabled, client-side input tokenization only occurs as a fallback when the server does not report prompt tokens. Automatically set to False for user-provided input datasets (--custom-dataset-type or --public-dataset) unless explicitly overridden.
737+
<br>_Default: `True`_
738+
729739
### Load Generator
730740

731741
#### `--benchmark-duration` `<float>`

docs/metrics-reference.md

Lines changed: 121 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -25,8 +25,12 @@ This document provides a comprehensive reference of all metrics available in AIP
2525
- [Prefill Throughput Per User](#prefill-throughput-per-user)
2626
- [Token Based Metrics](#token-based-metrics)
2727
- [Output Token Count](#output-token-count)
28+
- [Output Token Count — Server (file-only)](#output-token-count--server-file-only)
29+
- [Output Token Count — Local (file-only)](#output-token-count--local-file-only)
2830
- [Output Sequence Length (OSL)](#output-sequence-length-osl)
2931
- [Input Sequence Length (ISL)](#input-sequence-length-isl)
32+
- [Input Sequence Length — Server (file-only)](#input-sequence-length--server-file-only)
33+
- [Input Sequence Length — Local (file-only)](#input-sequence-length--local-file-only)
3034
- [Total Output Tokens](#total-output-tokens)
3135
- [Total Output Sequence Length](#total-output-sequence-length)
3236
- [Total Input Sequence Length](#total-input-sequence-length)
@@ -41,6 +45,8 @@ This document provides a comprehensive reference of all metrics available in AIP
4145
- [Video Peak Memory](#video-peak-memory)
4246
- [Reasoning Metrics](#reasoning-metrics)
4347
- [Reasoning Token Count](#reasoning-token-count)
48+
- [Reasoning Token Count — Server (file-only)](#reasoning-token-count--server-file-only)
49+
- [Reasoning Token Count — Local (file-only)](#reasoning-token-count--local-file-only)
4450
- [Total Reasoning Tokens](#total-reasoning-tokens)
4551
- [Usage Field Metrics](#usage-field-metrics)
4652
- [Usage Prompt Tokens](#usage-prompt-tokens)
@@ -52,7 +58,7 @@ This document provides a comprehensive reference of all metrics available in AIP
5258
- [Total Usage Total Tokens](#total-usage-total-tokens)
5359
- [Usage Discrepancy Metrics](#usage-discrepancy-metrics)
5460
- [Usage Prompt Tokens Diff %](#usage-prompt-tokens-diff-)
55-
- [Usage Completion Tokens Diff %](#usage-completion-tokens-diff-)
61+
- [Usage Output Tokens Diff %](#usage-output-tokens-diff-)
5662
- [Usage Reasoning Tokens Diff %](#usage-reasoning-tokens-diff-)
5763
- [Usage Discrepancy Count](#usage-discrepancy-count)
5864
- [OSL Mismatch Metrics](#osl-mismatch-metrics)
@@ -314,18 +320,45 @@ All metrics in this section require token-producing endpoints that return text c
314320

315321
**Type:** [Record Metric](#record-metrics)
316322

317-
The number of output tokens generated for a single request, _excluding reasoning tokens_. This represents the output tokens returned to the user across all responses for the request.
323+
The number of output tokens generated for a single request, _excluding reasoning tokens_. Prefers server-reported `token_counts.output` when available, falling back to client-side `token_counts.output_local`.
318324

319325
**Formula:**
320326
```python
321-
output_token_count = len(tokenizer.encode(content, add_special_tokens=False))
327+
# Server-preferred (falls back to client-side)
328+
output_token_count = token_counts.output or token_counts.output_local
322329
```
323330

324331
**Notes:**
325-
- Tokenization uses `add_special_tokens=False` to count only content tokens, excluding special tokens added by the tokenizer.
332+
- When the server reports completion tokens, that value (minus reasoning) is used.
333+
- Falls back to client-side tokenization when server does not report completion tokens, or when `--tokenize-output` provides a local value.
326334
- For streaming requests with multiple responses, the responses are joined together and then tokens are counted.
327335
- For models that expose reasoning in a separate `reasoning_content` field, this metric counts only non-reasoning output tokens.
328-
- If reasoning appears inside the regular `content` (e.g., `<think>` blocks), those tokens will be counted unless explicitly filtered.
336+
337+
---
338+
339+
### Output Token Count — Server (file-only)
340+
341+
**Type:** [Record Metric](#record-metrics)
342+
343+
The server-reported output token count (`token_counts.output`) for a single request. This metric is **file-only** (`NO_CONSOLE`) and is exported to `profile_export.jsonl` but not shown on the console.
344+
345+
**Formula:**
346+
```python
347+
output_token_count_server = token_counts.output # None → NoMetricValue
348+
```
349+
350+
---
351+
352+
### Output Token Count — Local (file-only)
353+
354+
**Type:** [Record Metric](#record-metrics)
355+
356+
The client-side tokenized output token count for a single request. This metric is **file-only** (`NO_CONSOLE`) and is exported to `profile_export.jsonl` but not shown on the console. Populated when `--tokenize-output` is enabled, or as a fallback when the server does not report completion tokens.
357+
358+
**Formula:**
359+
```python
360+
output_token_count_local = len(tokenizer.encode(content, add_special_tokens=False)) # None → NoMetricValue
361+
```
329362

330363
---
331364

@@ -349,19 +382,50 @@ output_sequence_length = (output_token_count or 0) + (reasoning_token_count or 0
349382

350383
**Type:** [Record Metric](#record-metrics)
351384

352-
The number of input/prompt tokens for a single request. This represents the size of the input sent to the model.
385+
The number of input/prompt tokens for a single request. This represents the size of the input sent to the model. Prefers server-reported `usage.prompt_tokens` when available, falling back to client-side tokenization.
353386

354387
**Formula:**
355388
```python
356-
input_sequence_length = len(tokenizer.encode(prompt, add_special_tokens=False))
389+
# Server-preferred (falls back to client-side)
390+
input_sequence_length = usage.prompt_tokens or len(tokenizer.encode(prompt, add_special_tokens=False))
357391
```
358392

359393
**Notes:**
360-
- Tokenization uses `add_special_tokens=False` to count only content tokens, excluding special tokens added by the tokenizer.
394+
- When the server reports `usage.prompt_tokens`, that value is used for ISL (and thus for console display and derived metrics).
395+
- Falls back to client-side tokenization when server does not report prompt token counts.
396+
- Client-side tokenization uses `add_special_tokens=False` to count only content tokens.
397+
- Automatically disabled for user-provided input datasets; use `--tokenize-input` to force.
398+
- Use `--no-tokenize-input` to skip when relying on server-reported prompt tokens.
361399
- Useful for understanding the relationship between input size and latency/throughput.
362400

363401
---
364402

403+
### Input Sequence Length — Server (file-only)
404+
405+
**Type:** [Record Metric](#record-metrics)
406+
407+
The server-reported prompt token count (`usage.prompt_tokens`) for a single request. This metric is **file-only** (`NO_CONSOLE`) and is exported to `profile_export.jsonl` but not shown on the console.
408+
409+
**Formula:**
410+
```python
411+
input_sequence_length_server = usage.prompt_tokens # None → NoMetricValue
412+
```
413+
414+
---
415+
416+
### Input Sequence Length — Local (file-only)
417+
418+
**Type:** [Record Metric](#record-metrics)
419+
420+
The client-side tokenized prompt token count for a single request. This metric is **file-only** (`NO_CONSOLE`) and is exported to `profile_export.jsonl` but not shown on the console.
421+
422+
**Formula:**
423+
```python
424+
input_sequence_length_local = len(tokenizer.encode(prompt, add_special_tokens=False)) # None → NoMetricValue
425+
```
426+
427+
---
428+
365429
### Total Output Tokens
366430

367431
**Type:** [Derived Metric](#derived-metrics)
@@ -556,19 +620,47 @@ All metrics in this section require models and backends that expose reasoning co
556620

557621
**Type:** [Record Metric](#record-metrics)
558622

559-
The number of reasoning tokens generated for a single request. These are tokens used for "thinking" or chain-of-thought reasoning before generating the final output.
623+
The number of reasoning tokens generated for a single request. Prefers server-reported `token_counts.reasoning` when available, falling back to client-side `token_counts.reasoning_local`.
560624

561625
**Formula:**
562626
```python
563-
reasoning_token_count = len(tokenizer.encode(reasoning_content, add_special_tokens=False))
627+
# Server-preferred (falls back to client-side)
628+
reasoning_token_count = token_counts.reasoning if token_counts.reasoning is not None else token_counts.reasoning_local
564629
```
565630

566631
**Notes:**
567-
- Tokenization uses `add_special_tokens=False` to count only content tokens, excluding special tokens added by the tokenizer.
632+
- When the server reports reasoning tokens via `completion_tokens_details.reasoning_tokens`, that value is used.
633+
- Falls back to client-side tokenization when server does not report reasoning tokens, or when `--tokenize-output` provides a local value.
568634
- Does **not** differentiate `<think>` tags or extract reasoning from within the regular `content` field.
569635

570636
---
571637

638+
### Reasoning Token Count — Server (file-only)
639+
640+
**Type:** [Record Metric](#record-metrics)
641+
642+
The server-reported reasoning token count (`token_counts.reasoning`) for a single request. This metric is **file-only** (`NO_CONSOLE`) and is exported to `profile_export.jsonl` but not shown on the console.
643+
644+
**Formula:**
645+
```python
646+
reasoning_token_count_server = token_counts.reasoning # None → NoMetricValue
647+
```
648+
649+
---
650+
651+
### Reasoning Token Count — Local (file-only)
652+
653+
**Type:** [Record Metric](#record-metrics)
654+
655+
The client-side tokenized reasoning token count for a single request. This metric is **file-only** (`NO_CONSOLE`) and is exported to `profile_export.jsonl` but not shown on the console. Populated when `--tokenize-output` is enabled, or as a fallback when the server does not report reasoning tokens.
656+
657+
**Formula:**
658+
```python
659+
reasoning_token_count_local = len(tokenizer.encode(reasoning_content, add_special_tokens=False)) # None → NoMetricValue
660+
```
661+
662+
---
663+
572664
### Total Reasoning Tokens
573665

574666
**Type:** [Derived Metric](#derived-metrics)
@@ -713,75 +805,75 @@ total_usage_total_tokens = sum(r.usage_total_tokens for r in records if r.valid)
713805

714806
## Usage Discrepancy Metrics
715807

716-
<Note>
717-
These metrics measure the percentage difference between API-reported token counts (`usage` fields) and client-computed token counts. They are **not displayed in console output** but help identify tokenizer mismatches or counting discrepancies.
718-
</Note>
808+
> [!NOTE]
809+
> These metrics measure the percentage difference between API-reported token counts (`usage` fields) and client-computed token counts. They are **not displayed in console output** but help identify tokenizer mismatches or counting discrepancies. Output and reasoning token diff metrics require the `--tokenize-output` flag to populate both server and client values.
719810
720811
### Usage Prompt Tokens Diff %
721812

722813
**Type:** [Record Metric](#record-metrics)
723814

724-
The percentage difference between API-reported prompt tokens and client-computed Input Sequence Length.
815+
The percentage difference between API-reported prompt tokens and client-computed input token count (`token_counts.input_local`).
725816

726817
**Formula:**
727818
```python
728-
usage_prompt_tokens_diff_pct = abs((usage_prompt_tokens - input_sequence_length) / input_sequence_length) * 100
819+
usage_prompt_tokens_diff_pct = abs((usage_prompt_tokens - client_input_tokens) / client_input_tokens) * 100
729820
```
730821

731822
**Notes:**
732-
- Values close to 0% indicate good agreement between client and server token counts.
823+
- Values close to 0% indicate good agreement between client and server prompt token counts.
733824
- Large differences may indicate tokenizer mismatches or special token handling differences.
825+
- Uses client-side `token_counts.input_local` (not `input_sequence_length`, which prefers server values).
734826

735827
---
736828

737-
### Usage Completion Tokens Diff %
829+
### Usage Output Tokens Diff %
738830

739831
**Type:** [Record Metric](#record-metrics)
740832

741-
The percentage difference between API-reported completion tokens and client-computed Output Sequence Length.
833+
The percentage difference between server-reported output tokens (`token_counts.output`) and client-computed output token count (`token_counts.output_local`). Requires `--tokenize-output` to populate both values.
742834

743835
**Formula:**
744836
```python
745-
usage_completion_tokens_diff_pct = abs((usage_completion_tokens - output_sequence_length) / output_sequence_length) * 100
837+
usage_output_tokens_diff_pct = abs((server_output_tokens - client_output_tokens) / client_output_tokens) * 100
746838
```
747839

748840
**Notes:**
749-
- Values close to 0% indicate good agreement between client and server token counts.
750-
- Large differences may indicate tokenizer mismatches or different counting methods.
841+
- Requires `--tokenize-output` flag to enable client-side output tokenization alongside server values.
842+
- Values close to 0% indicate good agreement between client and server output token counts.
751843

752844
---
753845

754846
### Usage Reasoning Tokens Diff %
755847

756848
**Type:** [Record Metric](#record-metrics)
757849

758-
The percentage difference between API-reported reasoning tokens and client-computed Reasoning Token Count.
850+
The percentage difference between server-reported reasoning tokens (`token_counts.reasoning`) and client-computed reasoning token count (`token_counts.reasoning_local`). Requires `--tokenize-output` to populate both values.
759851

760852
**Formula:**
761853
```python
762-
usage_reasoning_tokens_diff_pct = abs((usage_reasoning_tokens - reasoning_token_count) / reasoning_token_count) * 100
854+
usage_reasoning_tokens_diff_pct = abs((server_reasoning_tokens - client_reasoning_tokens) / client_reasoning_tokens) * 100
763855
```
764856

765857
**Notes:**
766-
- Only available for reasoning-enabled models.
767-
- Values close to 0% indicate good agreement between client and server reasoning token counts.
858+
- Requires `--tokenize-output` flag to enable client-side reasoning tokenization alongside server values.
859+
- Only applicable to models that support reasoning tokens.
768860

769861
---
770862

771863
### Usage Discrepancy Count
772864

773865
**Type:** [Aggregate Metric](#aggregate-metrics)
774866

775-
The number of requests where token count differences exceed a threshold (default 10%).
867+
The number of requests where the prompt token count difference exceeds a threshold (default 10%).
776868

777869
**Formula:**
778870
```python
779-
usage_discrepancy_count = sum(1 for r in records if r.any_diff > threshold)
871+
usage_discrepancy_count = sum(1 for r in records if r.prompt_diff > threshold)
780872
```
781873

782874
**Notes:**
783875
- Default threshold is 10% difference.
784-
- Counts requests where prompt, completion, or reasoning token differences are significant.
876+
- Counts requests where the prompt token difference is significant.
785877
- Useful for monitoring overall token count agreement quality.
786878

787879
---
@@ -1356,7 +1448,6 @@ Metric flags are used to control when and how metrics are computed, displayed, a
13561448
| <a id="flag-tokenizes-input-only"></a>`TOKENIZES_INPUT_ONLY` | Only computed when endpoint tokenizes input | Requires endpoints that process and tokenize input text; skipped for non-text endpoints |
13571449
| <a id="flag-http-trace-only"></a>`HTTP_TRACE_ONLY` | Only computed when HTTP trace data is available | Requires HTTP request tracing to be enabled; provides detailed HTTP lifecycle timing metrics |
13581450
| <a id="flag-supports-video-only"></a>`SUPPORTS_VIDEO_ONLY` | Only computed for video endpoints | Requires video-capable endpoints; skipped for other endpoint types |
1359-
| <a id="flag-usage-diff-only"></a>`USAGE_DIFF_ONLY` | Only computed when usage field data is available | Requires API responses to include usage field with token counts for comparison with client-computed values |
13601451
| <a id="flag-produces-video-only"></a>`PRODUCES_VIDEO_ONLY` | Only computed for video-producing endpoints | Requires endpoints that produce video output (e.g., SGLang video generation) |
13611452

13621453
## Composite Flags

src/aiperf/common/config/endpoint_config.py

Lines changed: 15 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,18 @@ class EndpointConfig(BaseConfig):
3232

3333
_CLI_GROUP = Groups.ENDPOINT
3434

35+
@model_validator(mode="after")
36+
def warn_deprecated_use_server_token_count(self) -> Self:
37+
"""Log deprecation warning when --use-server-token-count is True."""
38+
if self.use_server_token_count:
39+
_logger.warning(
40+
"--use-server-token-count is deprecated and will be removed in a future release. "
41+
"AIPerf now always computes both client-side and server-reported token counts. "
42+
"Server counts are preferred for output metrics; client counts are used for input validation. "
43+
"This flag is now a no-op."
44+
)
45+
return self
46+
3547
@model_validator(mode="after")
3648
def validate_streaming(self) -> Self:
3749
"""Validate that streaming is supported for the endpoint type."""
@@ -217,13 +229,9 @@ def url(self) -> str:
217229
bool,
218230
Field(
219231
description=(
220-
"Use server-reported token counts from API usage fields instead of "
221-
"client-side tokenization. When enabled, tokenizers are still loaded "
222-
"(needed for dataset generation) but tokenizer.encode() is not called "
223-
"for computing metrics. Token count fields will be None if the server "
224-
"does not provide usage information. For OpenAI-compatible streaming "
225-
"endpoints (chat/completions), stream_options.include_usage is automatically "
226-
"configured when this flag is enabled."
232+
"[Deprecated] This flag is a no-op and will be removed in a future release. "
233+
"AIPerf now always computes both client-side and server-reported token counts. "
234+
"Server counts are preferred for output metrics; client counts are used for input validation."
227235
),
228236
),
229237
CLIParameter(

0 commit comments

Comments
 (0)