11
2- # Metrics vLLM-Omni:
2+ # Metrics
33
44You can use these metrics in production to monitor the health and performance of the vLLM-omni system. Typical scenarios include:
5+
56- ** Performance Monitoring** : Track throughput (e.g., ` e2e_avg_tokens_per_s ` ), latency (e.g., ` e2e_total_ms ` ), and resource utilization to verify that the system meets expected standards.
7+
68- ** Debugging and Troubleshooting** : Use detailed per-request metrics to diagnose issues, such as high transfer times or unexpected token counts.
79
810## How to Enable and View Metrics
911
10- ### 1. Start the Service with Metrics Logging
12+ ### Start the Service with Metrics Logging
1113
1214``` bash
1315vllm serve /workspace/models/Qwen3-Omni-30B-A3B-Instruct --omni --port 8014 --log-stats
1416```
1517
16- ### 2. Send a Request
18+ ### Send a Request
1719
1820``` bash
1921python openai_chat_completion_client_for_multimodal_generation.py --query-type use_image
2022```
2123
22- ### 3. What You Will See
24+ ### What You Will See
2325
2426With ` --log-stats ` enabled, the server will output detailed metrics logs after each request. Example output:
2527
@@ -69,9 +71,13 @@ With `--log-stats` enabled, the server will output detailed metrics logs after e
6971
7072
7173These logs include:
74+
7275- ** Overall summary** : total requests, wall time, average tokens/sec, etc.
76+
7377- ** E2E table** : per-request latency and token counts.
78+
7479- ** Stage table** : per-stage batch and timing details.
80+
7581- ** Transfer table** : data transfer and timing for each edge.
7682
7783You can use these logs to monitor system health, debug performance, and analyze request-level metrics as described above.
@@ -87,6 +93,8 @@ For **online inference** (serving mode), the summary is always per-request. `e2e
8793
8894## Parameter Details
8995
96+ ### Summary Metrics
97+
9098| Field | Meaning |
9199| ---------------------------| ----------------------------------------------------------------------------------------------|
92100| ` e2e_requests ` | Number of completed requests. |
@@ -98,7 +106,7 @@ For **online inference** (serving mode), the summary is always per-request. `e2e
98106
99107---
100108
101- ## E2E Table (per request)
109+ ### E2E Table (per request)
102110
103111| Field | Meaning |
104112| ---------------------------| -----------------------------------------------------------------------|
@@ -110,7 +118,7 @@ For **online inference** (serving mode), the summary is always per-request. `e2e
110118
111119---
112120
113- ## Stage Table (per stage event / request)
121+ ### Stage Table (per stage event / request)
114122
115123| Field | Meaning |
116124| ---------------------------| -------------------------------------------------------------------------------------------------|
@@ -125,7 +133,7 @@ For **online inference** (serving mode), the summary is always per-request. `e2e
125133
126134---
127135
128- ## Transfer Table (per edge / request)
136+ ### Transfer Table (per edge / request)
129137
130138| Field | Meaning |
131139| ----------------------| ---------------------------------------------------------------------------|
@@ -135,31 +143,31 @@ For **online inference** (serving mode), the summary is always per-request. `e2e
135143| ` in_flight_time_ms ` | In-flight time in ms. |
136144
137145
138- ## Expectation of the Numbers (Verification)
146+ ### Expectation of the Numbers (Verification)
139147
140148** Formulas:**
149+
141150- ` e2e_total_tokens = Stage0's num_tokens_in + sum(all stages' num_tokens_out) `
151+
142152- ` transfers_total_time_ms = sum(tx_time_ms + rx_decode_time_ms + in_flight_time_ms) ` for every edge
143153
144154** Using the example above:**
145155
146- ### e2e_total_tokens
156+ ** e2e_total_tokens**
157+
147158- Stage0's ` num_tokens_in ` : ** 4,860**
148159- Stage0's ` num_tokens_out ` : ** 67**
149160- Stage1's ` num_tokens_out ` : ** 275**
150161- Stage2's ` num_tokens_out ` : ** 0**
151162
152- So,
153- ```
154- e2e_total_tokens = 4,860 + 67 + 275 + 0 = 5,202
155- ```
156- This matches the table value: ` e2e_total_tokens = 5,202 ` .
163+ so ` e2e_total_tokens = 4,860 + 67 + 275 + 0 = 5,202 ` , which matches the table value ` e2e_total_tokens ` .
164+
165+ ** transfers_total_time_ms**
157166
158- ### transfers_total_time_ms
159167For each edge:
168+
160169- 0->1: tx_time_ms (** 78.701** ) + rx_decode_time_ms (** 111.865** ) + in_flight_time_ms (** 2.015** ) = ** 192.581**
161- - 1->2: tx_time_ms (** 18.790** ) + rx_decode_time_ms (** 31.706** ) + in_flight_time_ms (** 2.819** ) = ** 53.315**
162170
163- Sum: 192.581 + 53.315 = ** 245.896 **
171+ - 1->2: tx_time_ms ( ** 18.790 ** ) + rx_decode_time_ms ( ** 31.706 ** ) + in_flight_time_ms ( ** 2.819 ** ) = ** 53.315 **
164172
165- The table shows ` transfers_total_time_ms = 245.895 ` , which matches the calculation (difference is due to rounding).
173+ 192.581 + 53.315 = ** 245.896 ** = transfers_total_time_ms , which matches the calculation (difference is due to rounding)
0 commit comments