Skip to content

Commit 7abe7ff

Browse files
LJH-LBJhsliuustc0106
authored andcommitted
[Feature] Opt metrics structure (vllm-project#891)
Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com> Signed-off-by: Junhong Liu <ljh_lbj@163.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
1 parent 4c77379 commit 7abe7ff

File tree

23 files changed

+1151
-728
lines changed

23 files changed

+1151
-728
lines changed

benchmarks/qwen3-omni/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@ bash benchmarks/qwen3-omni/vllm_omni/eval_qwen3_moe_omni.sh
5353
```
5454

5555
What it does:
56-
- Runs `examples/offline_inference/qwen3_omni/end2end.py` with `--enable-stats`.
56+
- Runs `examples/offline_inference/qwen3_omni/end2end.py` with `--log-stats`.
5757
- Uses `benchmarks/build_dataset/top100.txt` and writes to:
5858
- Logs: `benchmarks/qwen3-omni/vllm_omni/logs/`
5959
- `omni_llm_pipeline_text.orchestrator.stats.jsonl` — per-stage latency stats.

benchmarks/qwen3-omni/vllm_omni/eval_qwen3_moe_omni.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ else
2323
python $end2end_script_path --output-wav $outputs_dir \
2424
--query-type text \
2525
--txt-prompts $build_dataset_path \
26-
--enable-stats \
26+
--log-stats \
2727
--log-dir $log_dir
2828
echo "Logs and outputs are saved in ${log_dir} and ${outputs_dir} respectively:"
2929
echo " - omni_llm_pipeline_text run dir/base name"

docs/api/README.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -15,9 +15,6 @@ Main entry points for vLLM-Omni inference and serving.
1515
- [vllm_omni.entrypoints.cli.benchmark.serve.OmniBenchmarkServingSubcommand][]
1616
- [vllm_omni.entrypoints.cli.serve.OmniServeCommand][]
1717
- [vllm_omni.entrypoints.client_request_state.ClientRequestState][]
18-
- [vllm_omni.entrypoints.log_utils.OrchestratorMetrics][]
19-
- [vllm_omni.entrypoints.log_utils.StageRequestMetrics][]
20-
- [vllm_omni.entrypoints.log_utils.StageStats][]
2118
- [vllm_omni.entrypoints.omni.Omni][]
2219
- [vllm_omni.entrypoints.omni.OmniBase][]
2320
- [vllm_omni.entrypoints.omni_diffusion.OmniDiffusion][]
@@ -114,3 +111,8 @@ Worker classes and model runners for distributed inference.
114111
- [vllm_omni.worker.gpu_generation_worker.GPUGenerationWorker][]
115112
- [vllm_omni.worker.gpu_model_runner.OmniGPUModelRunner][]
116113
- [vllm_omni.worker.mixins.OmniWorkerMixin][]
114+
115+
116+
## Metrics
117+
118+
- [vllm_omni.metrics.OrchestratorAggregator][]

docs/contributing/metrics.md

Lines changed: 165 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,165 @@
1+
2+
# Metrics vLLM-Omni:
3+
4+
You can use these metrics in production to monitor the health and performance of the vLLM-omni system. Typical scenarios include:
5+
- **Performance Monitoring**: Track throughput (e.g., `e2e_avg_tokens_per_s`), latency (e.g., `e2e_total_ms`), and resource utilization to verify that the system meets expected standards.
6+
- **Debugging and Troubleshooting**: Use detailed per-request metrics to diagnose issues, such as high transfer times or unexpected token counts.
7+
8+
## How to Enable and View Metrics
9+
10+
### 1. Start the Service with Metrics Logging
11+
12+
```bash
13+
vllm serve /workspace/models/Qwen3-Omni-30B-A3B-Instruct --omni --port 8014 --log-stats
14+
```
15+
16+
### 2. Send a Request
17+
18+
```bash
19+
python openai_chat_completion_client_for_multimodal_generation.py --query-type use_image
20+
```
21+
22+
### 3. What You Will See
23+
24+
With `--log-stats` enabled, the server will output detailed metrics logs after each request. Example output:
25+
26+
27+
#### Overall Summary
28+
29+
| Field | Value |
30+
|-----------------------------|--------------|
31+
| e2e_requests | 1 |
32+
| e2e_wall_time_ms | 41,299.190 |
33+
| e2e_total_tokens | 5,202 |
34+
| e2e_avg_time_per_request_ms | 41,299.190 |
35+
| e2e_avg_tokens_per_s | 125.959 |
36+
| e2e_stage_0_wall_time_ms | 10,192.289 |
37+
| e2e_stage_1_wall_time_ms | 30,541.409 |
38+
| e2e_stage_2_wall_time_ms | 207.496 |
39+
40+
#### RequestE2EStats
41+
42+
| Field | Value |
43+
|-------------------------|------------|
44+
| e2e_total_ms | 41,299.133 |
45+
| e2e_total_tokens | 5,202 |
46+
| transfers_total_time_ms | 245.895 |
47+
| transfers_total_kbytes | 138,089.939|
48+
49+
#### StageRequestStats
50+
51+
| Field | 0 | 1 | 2 |
52+
|------------------------|--------|--------|--------|
53+
| audio_generated_frames | 0 | 0 | 525,525|
54+
| batch_id | 38 | 274 | 0 |
55+
| batch_size | 1 | 1 | 1 |
56+
| num_tokens_in | 4,860 | 4,826 | 4,384 |
57+
| num_tokens_out | 67 | 275 | 0 |
58+
| postprocess_time_ms | 256.158| 0.491 | 0.000 |
59+
| stage_gen_time_ms | 9,910.007|30,379.198|160.745|
60+
61+
#### TransferEdgeStats
62+
63+
| Field | 0->1 | 1->2 |
64+
|---------------------|-------------|------------|
65+
| size_kbytes | 109,277.349 | 28,812.591 |
66+
| tx_time_ms | 78.701 | 18.790 |
67+
| rx_decode_time_ms | 111.865 | 31.706 |
68+
| in_flight_time_ms | 2.015 | 2.819 |
69+
70+
71+
These logs include:
72+
- **Overall summary**: total requests, wall time, average tokens/sec, etc.
73+
- **E2E table**: per-request latency and token counts.
74+
- **Stage table**: per-stage batch and timing details.
75+
- **Transfer table**: data transfer and timing for each edge.
76+
77+
You can use these logs to monitor system health, debug performance, and analyze request-level metrics as described above.
78+
79+
80+
## Metrics Scope: Offline vs Online Inference
81+
82+
For **offline inference** (batch mode), the summary includes both system-level metrics (aggregated across all requests) and per-request metrics. In this case, `e2e_requests` can be greater than 1, reflecting multiple completed requests in a batch.
83+
84+
For **online inference** (serving mode), the summary is always per-request. `e2e_requests` is always 1, and only request-level metrics are reported for each completion.
85+
86+
---
87+
88+
## Parameter Details
89+
90+
| Field | Meaning |
91+
|---------------------------|----------------------------------------------------------------------------------------------|
92+
| `e2e_requests` | Number of completed requests. |
93+
| `e2e_wall_time_ms` | Wall-clock time span from run start to last completion, in ms. |
94+
| `e2e_total_tokens` | Total tokens counted across all completed requests (stage0 input + all stage outputs). |
95+
| `e2e_avg_time_per_request_ms` | Average wall time per request: `e2e_wall_time_ms / e2e_requests`. |
96+
| `e2e_avg_tokens_per_s` | Average token throughput over wall time: `e2e_total_tokens * 1000 / e2e_wall_time_ms`. |
97+
| `e2e_stage_{i}_wall_time_ms` | Wall-clock time span for stage i, in ms. Each stage's wall time is reported as a separate field, e.g., `e2e_stage_0_wall_time_ms`, `e2e_stage_1_wall_time_ms`, etc. |
98+
99+
---
100+
101+
## E2E Table (per request)
102+
103+
| Field | Meaning |
104+
|---------------------------|-----------------------------------------------------------------------|
105+
| `e2e_total_ms` | End-to-end latency in ms. |
106+
| `e2e_total_tokens` | Total tokens for the request (stage0 input + all stage outputs). |
107+
| `transfers_total_time_ms` | Sum of transfer edge `total_time_ms` for this request. |
108+
| `transfers_total_kbytes` | Sum of transfer kbytes for this request. |
109+
110+
111+
---
112+
113+
## Stage Table (per stage event / request)
114+
115+
| Field | Meaning |
116+
|---------------------------|-------------------------------------------------------------------------------------------------|
117+
| `batch_id` | Batch index. |
118+
| `batch_size` | Batch size. |
119+
| `num_tokens_in` | Input tokens to the stage. |
120+
| `num_tokens_out` | Output tokens from the stage. |
121+
| `stage_gen_time_ms` | Stage compute time in ms, excluding postprocessing time (reported separately as `postprocess_time_ms`). |
122+
| `image_num` | Number of images generated (for diffusion/image stages). |
123+
| `resolution` | Image resolution (for diffusion/image stages). |
124+
| `postprocess_time_ms` | Diffusion/image: post-processing time in ms. |
125+
126+
---
127+
128+
## Transfer Table (per edge / request)
129+
130+
| Field | Meaning |
131+
|----------------------|---------------------------------------------------------------------------|
132+
| `size_kbytes` | Total kbytes transferred. |
133+
| `tx_time_ms` | Sender transfer time in ms. |
134+
| `rx_decode_time_ms` | Receiver decode time in ms. |
135+
| `in_flight_time_ms` | In-flight time in ms. |
136+
137+
138+
## Expectation of the Numbers (Verification)
139+
140+
**Formulas:**
141+
- `e2e_total_tokens = Stage0's num_tokens_in + sum(all stages' num_tokens_out)`
142+
- `transfers_total_time_ms = sum(tx_time_ms + rx_decode_time_ms + in_flight_time_ms)` for every edge
143+
144+
**Using the example above:**
145+
146+
### e2e_total_tokens
147+
- Stage0's `num_tokens_in`: **4,860**
148+
- Stage0's `num_tokens_out`: **67**
149+
- Stage1's `num_tokens_out`: **275**
150+
- Stage2's `num_tokens_out`: **0**
151+
152+
So,
153+
```
154+
e2e_total_tokens = 4,860 + 67 + 275 + 0 = 5,202
155+
```
156+
This matches the table value: `e2e_total_tokens = 5,202`.
157+
158+
### transfers_total_time_ms
159+
For each edge:
160+
- 0->1: tx_time_ms (**78.701**) + rx_decode_time_ms (**111.865**) + in_flight_time_ms (**2.015**) = **192.581**
161+
- 1->2: tx_time_ms (**18.790**) + rx_decode_time_ms (**31.706**) + in_flight_time_ms (**2.819**) = **53.315**
162+
163+
Sum: 192.581 + 53.315 = **245.896**
164+
165+
The table shows `transfers_total_time_ms = 245.895`, which matches the calculation (difference is due to rounding).

examples/offline_inference/bagel/end2end.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ def parse_args():
3636
)
3737

3838
# OmniLLM init args
39-
parser.add_argument("--enable-stats", action="store_true", default=False)
39+
parser.add_argument("--log-stats", action="store_true", default=False)
4040
parser.add_argument("--init-sleep-seconds", type=int, default=20)
4141
parser.add_argument("--batch-timeout", type=int, default=5)
4242
parser.add_argument("--init-timeout", type=int, default=300)
@@ -120,7 +120,7 @@ def main():
120120

121121
omni_kwargs.update(
122122
{
123-
"log_stats": args.enable_stats,
123+
"log_stats": args.log_stats,
124124
"init_sleep_seconds": args.init_sleep_seconds,
125125
"batch_timeout": args.batch_timeout,
126126
"init_timeout": args.init_timeout,

examples/offline_inference/qwen2_5_omni/end2end.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -322,7 +322,7 @@ def main(args):
322322
query_result = query_func()
323323
omni_llm = Omni(
324324
model=model_name,
325-
log_stats=args.enable_stats,
325+
log_stats=args.log_stats,
326326
stage_init_timeout=args.stage_init_timeout,
327327
batch_timeout=args.batch_timeout,
328328
init_timeout=args.init_timeout,
@@ -439,7 +439,7 @@ def parse_args():
439439
help="Query type.",
440440
)
441441
parser.add_argument(
442-
"--enable-stats",
442+
"--log-stats",
443443
action="store_true",
444444
default=False,
445445
help="Enable writing detailed statistics (default: disabled)",

examples/offline_inference/qwen3_omni/end2end.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -328,7 +328,7 @@ def main(args):
328328
omni_llm = Omni(
329329
model=model_name,
330330
stage_configs_path=args.stage_configs_path,
331-
log_stats=args.enable_stats,
331+
log_stats=args.log_stats,
332332
stage_init_timeout=args.stage_init_timeout,
333333
)
334334

@@ -455,7 +455,7 @@ def parse_args():
455455
help="Query type.",
456456
)
457457
parser.add_argument(
458-
"--enable-stats",
458+
"--log-stats",
459459
action="store_true",
460460
default=False,
461461
help="Enable writing detailed statistics (default: disabled)",

examples/offline_inference/qwen3_tts/end2end.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -219,7 +219,7 @@ def main(args):
219219
omni = Omni(
220220
model=model_name,
221221
stage_configs_path=args.stage_configs_path,
222-
log_stats=args.enable_stats,
222+
log_stats=args.log_stats,
223223
stage_init_timeout=args.stage_init_timeout,
224224
)
225225

@@ -275,7 +275,7 @@ def parse_args():
275275
help="Query type.",
276276
)
277277
parser.add_argument(
278-
"--enable-stats",
278+
"--log-stats",
279279
action="store_true",
280280
default=False,
281281
help="Enable writing detailed statistics (default: disabled)",

requirements/common.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,3 +12,4 @@ torchsde>=0.2.6
1212
openai-whisper>=20250625
1313
imageio[ffmpeg]>=2.37.2
1414
sox>=1.5.0
15+
prettytable>=3.8.0

tests/e2e/online_serving/test_async_omni.py

Lines changed: 1 addition & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
from vllm.inputs import PromptType
1010

1111
from tests.utils import hardware_test
12-
from vllm_omni.entrypoints.async_omni import AsyncOmni, ClientRequestState
12+
from vllm_omni.entrypoints.async_omni import AsyncOmni
1313

1414
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
1515

@@ -115,54 +115,3 @@ async def test_abort():
115115
num_generated_tokens, request_id = await task
116116
assert num_generated_tokens == NUM_EXPECTED_TOKENS
117117
await asyncio.sleep(5)
118-
119-
120-
@pytest.mark.core_model
121-
@pytest.mark.omni
122-
@hardware_test(res={"cuda": "H100", "rocm": "MI325"}, num_cards=2)
123-
@pytest.mark.asyncio
124-
async def test_build_and_log_summary(monkeypatch):
125-
from vllm_omni.entrypoints.utils import get_final_stage_id_for_e2e
126-
127-
RealCRS = ClientRequestState
128-
capture_metrics = {}
129-
130-
class MockCRS(RealCRS):
131-
def __init__(self, request_id: str):
132-
super().__init__(request_id)
133-
capture_metrics[request_id] = self
134-
135-
monkeypatch.setattr("vllm_omni.entrypoints.async_omni.ClientRequestState", MockCRS)
136-
monkeypatch.setattr("vllm_omni.entrypoints.client_request_state.ClientRequestState", MockCRS)
137-
138-
with ExitStack() as after:
139-
# Avoid SHM IPC in tests to prevent /dev/shm exhaustion and SIGBUS.
140-
engine = AsyncOmni(
141-
model=model,
142-
stage_configs_path=stage_config,
143-
shm_threshold_bytes=sys.maxsize,
144-
)
145-
after.callback(engine.shutdown)
146-
prompt = "Hello my name is Robert and "
147-
NUM_EXPECTED_TOKENS = 64
148-
NUM_REQUESTS = 3
149-
request_ids = [f"request-{i}" for i in range(NUM_REQUESTS)]
150-
151-
# Create concurrent requests.
152-
tasks: list[asyncio.Task] = []
153-
for idx, request_id in enumerate(request_ids):
154-
tasks.append(asyncio.create_task(generate(engine, request_id, prompt, NUM_EXPECTED_TOKENS)))
155-
156-
# Confirm the requests are okay.
157-
for idx, task in enumerate(tasks):
158-
await task
159-
output_modalities = ["text"]
160-
final_stage_id_for_e2e = get_final_stage_id_for_e2e(
161-
output_modalities, engine.output_modalities, engine.stage_list
162-
)
163-
summary = capture_metrics[request_ids[idx]].metrics.build_and_log_summary(final_stage_id_for_e2e)
164-
165-
# Check that total tokens matches sum of stage tokens.
166-
assert summary["e2e_total_tokens"] == sum(stage["tokens"] for stage in summary["stages"])
167-
# Check that total time matches sum of stage times.
168-
assert summary["e2e_total_time_ms"] >= sum(stage["total_time_ms"] for stage in summary["stages"])

0 commit comments

Comments
 (0)