Skip to content

Commit 633b7bd

Browse files
committed
Address feedback
1 parent 1f3127a commit 633b7bd

File tree

1 file changed

+10
-3
lines changed

1 file changed

+10
-3
lines changed

docs/metrics-reference.md

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -325,7 +325,7 @@ The number of output tokens generated for a single request, _excluding reasoning
325325
**Formula:**
326326
```python
327327
# Server-preferred (falls back to client-side)
328-
output_token_count = token_counts.output or token_counts.output_local
328+
output_token_count = token_counts.output if token_counts.output is not None else token_counts.output_local
329329
```
330330

331331
**Notes:**
@@ -370,10 +370,17 @@ The total number of completion tokens (output + reasoning) generated for a singl
370370

371371
**Formula:**
372372
```python
373-
output_sequence_length = (output_token_count or 0) + (reasoning_token_count or 0)
373+
# All-server when both are available, otherwise all-client
374+
if token_counts.output is not None and token_counts.reasoning is not None:
375+
output_sequence_length = token_counts.output + token_counts.reasoning
376+
elif token_counts.output_local is not None:
377+
output_sequence_length = token_counts.output_local + (token_counts.reasoning_local or 0)
378+
else:
379+
output_sequence_length = (token_counts.output or 0) + (token_counts.reasoning or 0)
374380
```
375381

376382
**Notes:**
383+
- OSL uses consistent source selection (all-server or all-client) to avoid double-counting. Some servers embed reasoning tokens inside `completion_tokens` but leave `reasoning_tokens` null — mixing server output with client reasoning would count reasoning twice.
377384
- For models that do not support/separate reasoning tokens, OSL equals the output token count.
378385

379386
---
@@ -806,7 +813,7 @@ total_usage_total_tokens = sum(r.usage_total_tokens for r in records if r.valid)
806813
## Usage Discrepancy Metrics
807814

808815
> [!NOTE]
809-
> These metrics measure the percentage difference between API-reported token counts (`usage` fields) and client-computed token counts. They are **not displayed in console output** but help identify tokenizer mismatches or counting discrepancies. Output and reasoning token diff metrics require the `--tokenize-output` flag to populate both server and client values.
816+
> These metrics measure the percentage difference between API-reported token counts (`usage` fields) and client-computed token counts. They are **not displayed in console output** but help identify tokenizer mismatches or counting discrepancies. Prompt diff requires `--tokenize-input` (or fallback tokenization when server omits prompt tokens) for user-provided datasets. Output and reasoning diff metrics require `--tokenize-output` to populate both server and client values.
810817
811818
### Usage Prompt Tokens Diff %
812819

0 commit comments

Comments
 (0)