You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/cli-options.md
+12-2Lines changed: 12 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -225,8 +225,8 @@ Use the legacy 'max_tokens' field instead of 'max_completion_tokens' in request
225
225
226
226
#### `--use-server-token-count`
227
227
228
-
Use server-reported token counts from API usage fields instead of client-side tokenization. When enabled, tokenizers are still loaded (needed for dataset generation) but tokenizer.encode() is not called for computing metrics. Token count fields will be None if the server does not provide usage information. For OpenAI-compatible streaming endpoints (chat/completions), stream_options.include_usage is automatically configured when this flag is enabled.
229
-
<br/>_Flag (no value required)_
228
+
[Deprecated] This flag is a no-op and will be removed in a future release. AIPerf now always computes both client-side and server-reported token counts. Server counts are preferred for output metrics; client counts are used for input validation.
229
+
<br>_Flag (no value required)_
230
230
231
231
#### `--connection-reuse-strategy``<str>`
232
232
@@ -726,6 +726,16 @@ Specific tokenizer version to load from HuggingFace Hub. Can be a branch name (e
726
726
Allow execution of custom Python code from HuggingFace Hub tokenizer repositories. Required for tokenizers with custom implementations not in the standard `transformers` library. **Security Warning**: Only enable for trusted repositories, as this executes arbitrary code. Unnecessary for standard tokenizers.
727
727
<br/>_Flag (no value required)_
728
728
729
+
#### `--tokenize-output`
730
+
731
+
Enable client-side tokenization of output and reasoning tokens, even when the server reports token counts. When enabled, locally computed counts are stored alongside server-reported values for validation and comparison. Without this flag, local output/reasoning tokenization only occurs as a fallback when the server does not report counts.
732
+
<br>_Flag (no value required)_
733
+
734
+
#### `--tokenize-input`, `--no-tokenize-input`
735
+
736
+
Enable client-side tokenization of input prompts for every request. When enabled, locally computed input token counts are always stored in token_counts.input_local. When disabled, client-side input tokenization only occurs as a fallback when the server does not report prompt tokens. Automatically set to False for user-provided input datasets (--custom-dataset-type or --public-dataset) unless explicitly overridden.
@@ -314,18 +320,45 @@ All metrics in this section require token-producing endpoints that return text c
314
320
315
321
**Type:**[Record Metric](#record-metrics)
316
322
317
-
The number of output tokens generated for a single request, _excluding reasoning tokens_. This represents the output tokens returned to the user across all responses for the request.
323
+
The number of output tokens generated for a single request, _excluding reasoning tokens_. Prefers server-reported `token_counts.output` when available, falling back to client-side `token_counts.output_local`.
output_token_count = token_counts.output or token_counts.output_local
322
329
```
323
330
324
331
**Notes:**
325
-
- Tokenization uses `add_special_tokens=False` to count only content tokens, excluding special tokens added by the tokenizer.
332
+
- When the server reports completion tokens, that value (minus reasoning) is used.
333
+
- Falls back to client-side tokenization when server does not report completion tokens, or when `--tokenize-output` provides a local value.
326
334
- For streaming requests with multiple responses, the responses are joined together and then tokens are counted.
327
335
- For models that expose reasoning in a separate `reasoning_content` field, this metric counts only non-reasoning output tokens.
328
-
- If reasoning appears inside the regular `content` (e.g., `<think>` blocks), those tokens will be counted unless explicitly filtered.
336
+
337
+
---
338
+
339
+
### Output Token Count — Server (file-only)
340
+
341
+
**Type:**[Record Metric](#record-metrics)
342
+
343
+
The server-reported output token count (`token_counts.output`) for a single request. This metric is **file-only** (`NO_CONSOLE`) and is exported to `profile_export.jsonl` but not shown on the console.
The client-side tokenized output token count for a single request. This metric is **file-only** (`NO_CONSOLE`) and is exported to `profile_export.jsonl` but not shown on the console. Populated when `--tokenize-output` is enabled, or as a fallback when the server does not report completion tokens.
@@ -349,19 +382,50 @@ output_sequence_length = (output_token_count or 0) + (reasoning_token_count or 0
349
382
350
383
**Type:**[Record Metric](#record-metrics)
351
384
352
-
The number of input/prompt tokens for a single request. This represents the size of the input sent to the model.
385
+
The number of input/prompt tokens for a single request. This represents the size of the input sent to the model. Prefers server-reported `usage.prompt_tokens` when available, falling back to client-side tokenization.
- Tokenization uses `add_special_tokens=False` to count only content tokens, excluding special tokens added by the tokenizer.
394
+
- When the server reports `usage.prompt_tokens`, that value is used for ISL (and thus for console display and derived metrics).
395
+
- Falls back to client-side tokenization when server does not report prompt token counts.
396
+
- Client-side tokenization uses `add_special_tokens=False` to count only content tokens.
397
+
- Automatically disabled for user-provided input datasets; use `--tokenize-input` to force.
398
+
- Use `--no-tokenize-input` to skip when relying on server-reported prompt tokens.
361
399
- Useful for understanding the relationship between input size and latency/throughput.
362
400
363
401
---
364
402
403
+
### Input Sequence Length — Server (file-only)
404
+
405
+
**Type:**[Record Metric](#record-metrics)
406
+
407
+
The server-reported prompt token count (`usage.prompt_tokens`) for a single request. This metric is **file-only** (`NO_CONSOLE`) and is exported to `profile_export.jsonl` but not shown on the console.
The client-side tokenized prompt token count for a single request. This metric is **file-only** (`NO_CONSOLE`) and is exported to `profile_export.jsonl` but not shown on the console.
@@ -556,19 +620,47 @@ All metrics in this section require models and backends that expose reasoning co
556
620
557
621
**Type:**[Record Metric](#record-metrics)
558
622
559
-
The number of reasoning tokens generated for a single request. These are tokens used for "thinking" or chain-of-thought reasoning before generating the final output.
623
+
The number of reasoning tokens generated for a single request. Prefers server-reported `token_counts.reasoning` when available, falling back to client-side `token_counts.reasoning_local`.
reasoning_token_count = token_counts.reasoning if token_counts.reasoning isnotNoneelse token_counts.reasoning_local
564
629
```
565
630
566
631
**Notes:**
567
-
- Tokenization uses `add_special_tokens=False` to count only content tokens, excluding special tokens added by the tokenizer.
632
+
- When the server reports reasoning tokens via `completion_tokens_details.reasoning_tokens`, that value is used.
633
+
- Falls back to client-side tokenization when server does not report reasoning tokens, or when `--tokenize-output` provides a local value.
568
634
- Does **not** differentiate `<think>` tags or extract reasoning from within the regular `content` field.
569
635
570
636
---
571
637
638
+
### Reasoning Token Count — Server (file-only)
639
+
640
+
**Type:**[Record Metric](#record-metrics)
641
+
642
+
The server-reported reasoning token count (`token_counts.reasoning`) for a single request. This metric is **file-only** (`NO_CONSOLE`) and is exported to `profile_export.jsonl` but not shown on the console.
The client-side tokenized reasoning token count for a single request. This metric is **file-only** (`NO_CONSOLE`) and is exported to `profile_export.jsonl` but not shown on the console. Populated when `--tokenize-output` is enabled, or as a fallback when the server does not report reasoning tokens.
@@ -713,75 +805,75 @@ total_usage_total_tokens = sum(r.usage_total_tokens for r in records if r.valid)
713
805
714
806
## Usage Discrepancy Metrics
715
807
716
-
<Note>
717
-
These metrics measure the percentage difference between API-reported token counts (`usage` fields) and client-computed token counts. They are **not displayed in console output** but help identify tokenizer mismatches or counting discrepancies.
718
-
</Note>
808
+
> [!NOTE]
809
+
> These metrics measure the percentage difference between API-reported token counts (`usage` fields) and client-computed token counts. They are **not displayed in console output** but help identify tokenizer mismatches or counting discrepancies. Output and reasoning token diff metrics require the `--tokenize-output` flag to populate both server and client values.
719
810
720
811
### Usage Prompt Tokens Diff %
721
812
722
813
**Type:**[Record Metric](#record-metrics)
723
814
724
-
The percentage difference between API-reported prompt tokens and client-computed Input Sequence Length.
815
+
The percentage difference between API-reported prompt tokens and client-computed input token count (`token_counts.input_local`).
- Values close to 0% indicate good agreement between client and server token counts.
823
+
- Values close to 0% indicate good agreement between client and server prompt token counts.
733
824
- Large differences may indicate tokenizer mismatches or special token handling differences.
825
+
- Uses client-side `token_counts.input_local` (not `input_sequence_length`, which prefers server values).
734
826
735
827
---
736
828
737
-
### Usage Completion Tokens Diff %
829
+
### Usage Output Tokens Diff %
738
830
739
831
**Type:**[Record Metric](#record-metrics)
740
832
741
-
The percentage difference between API-reported completion tokens and client-computed Output Sequence Length.
833
+
The percentage difference between server-reported output tokens (`token_counts.output`) and client-computed output token count (`token_counts.output_local`). Requires `--tokenize-output` to populate both values.
-Values close to 0% indicate good agreement between client and server token counts.
750
-
-Large differences may indicate tokenizer mismatches or different counting methods.
841
+
-Requires `--tokenize-output` flag to enable client-side output tokenization alongside server values.
842
+
-Values close to 0% indicate good agreement between client and server output token counts.
751
843
752
844
---
753
845
754
846
### Usage Reasoning Tokens Diff %
755
847
756
848
**Type:**[Record Metric](#record-metrics)
757
849
758
-
The percentage difference between API-reported reasoning tokens and client-computed Reasoning Token Count.
850
+
The percentage difference between server-reported reasoning tokens (`token_counts.reasoning`) and client-computed reasoning token count (`token_counts.reasoning_local`). Requires `--tokenize-output` to populate both values.
-Values close to 0% indicate good agreement between client and server reasoning token counts.
858
+
-Requires `--tokenize-output` flag to enable client-side reasoning tokenization alongside server values.
859
+
-Only applicable to models that support reasoning tokens.
768
860
769
861
---
770
862
771
863
### Usage Discrepancy Count
772
864
773
865
**Type:**[Aggregate Metric](#aggregate-metrics)
774
866
775
-
The number of requests where token count differences exceed a threshold (default 10%).
867
+
The number of requests where the prompt token count difference exceeds a threshold (default 10%).
776
868
777
869
**Formula:**
778
870
```python
779
-
usage_discrepancy_count =sum(1for r in records if r.any_diff> threshold)
871
+
usage_discrepancy_count =sum(1for r in records if r.prompt_diff> threshold)
780
872
```
781
873
782
874
**Notes:**
783
875
- Default threshold is 10% difference.
784
-
- Counts requests where prompt, completion, or reasoning token differences are significant.
876
+
- Counts requests where the prompttoken difference is significant.
785
877
- Useful for monitoring overall token count agreement quality.
786
878
787
879
---
@@ -1356,7 +1448,6 @@ Metric flags are used to control when and how metrics are computed, displayed, a
1356
1448
| <aid="flag-tokenizes-input-only"></a>`TOKENIZES_INPUT_ONLY`| Only computed when endpoint tokenizes input | Requires endpoints that process and tokenize input text; skipped for non-text endpoints |
1357
1449
| <aid="flag-http-trace-only"></a>`HTTP_TRACE_ONLY`| Only computed when HTTP trace data is available | Requires HTTP request tracing to be enabled; provides detailed HTTP lifecycle timing metrics |
1358
1450
| <aid="flag-supports-video-only"></a>`SUPPORTS_VIDEO_ONLY`| Only computed for video endpoints | Requires video-capable endpoints; skipped for other endpoint types |
1359
-
| <aid="flag-usage-diff-only"></a>`USAGE_DIFF_ONLY`| Only computed when usage field data is available | Requires API responses to include usage field with token counts for comparison with client-computed values |
1360
1451
| <aid="flag-produces-video-only"></a>`PRODUCES_VIDEO_ONLY`| Only computed for video-producing endpoints | Requires endpoints that produce video output (e.g., SGLang video generation) |
0 commit comments