-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
π The feature, motivation and pitch
Description
I propose adding detailed, structured logging for the inference process (prefill/decode batches), similar to the excellent logging found in projects like SGLang and vLLM. This would greatly enhance visibility into system performance, debugging, and monitoring.
Motivation & Expected Benefits
Debugging & Monitoring: Easily track request states, token usage, throughput, and CUDA graph status in real-time.
Performance Analysis: Monitor key metrics like #running-req, #queue-req, gen throughput (token/s), and token usage to identify bottlenecks.
Operational Clarity: Provides a clear, consistent log stream that helps developers and operators understand system behavior during inference.
Proposed Log Format Example
The logs should follow a structured, readable format like the sample below (inspired by SGLang/vLLM style):
[YYYY-MM-DD HH:MM:SS] Prefill batch. #new-seq: 1, #new-token: 25, #cached-token: 1, token usage: 0.00, #running-req: 0, #queue-req: 0
[YYYY-MM-DD HH:MM:SS] Decode batch. #running-req: 1, #token: 283, token usage: 0.00, cuda graph: True, gen throughput (token/s): 108.30, #queue-req: 0
(See full example logs in the section below.)
Example Log Output (from a test run)
[2025-12-08 03:57:29] Decode batch. #running-req: 1, #token: 283, token usage: 0.00, cuda graph: True, gen throughput (token/s): 0.07, #queue-req: 0
[2025-12-08 03:57:30] INFO: 10.40.32.80:34224 - "POST /v1/chat/completions HTTP/1.0" 200 OK
[2025-12-08 08:36:21] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 249, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-12-08 08:36:21] Decode batch. #running-req: 1, #token: 271, token usage: 0.00, cuda graph: True, gen throughput (token/s): 0.00, #queue-req: 0
... (additional log lines)
Alternatives
none
Additional context
This feature would be especially useful for high-throughput serving environments and performance tuning.
The implementation should ideally allow the logging level/verbosity to be configurable (e.g., via environment variable or config file).
Reference: SGLang and vLLM both provide similar insightful logging.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.