Skip to content

[Feature]: Add detailed inference logging similar to SGLang and vLLMΒ #9778

@0xd8b

Description

@0xd8b

πŸš€ The feature, motivation and pitch

Description

I propose adding detailed, structured logging for the inference process (prefill/decode batches), similar to the excellent logging found in projects like SGLang and vLLM. This would greatly enhance visibility into system performance, debugging, and monitoring.

Motivation & Expected Benefits

Debugging & Monitoring: Easily track request states, token usage, throughput, and CUDA graph status in real-time.

Performance Analysis: Monitor key metrics like #running-req, #queue-req, gen throughput (token/s), and token usage to identify bottlenecks.

Operational Clarity: Provides a clear, consistent log stream that helps developers and operators understand system behavior during inference.
Proposed Log Format Example
The logs should follow a structured, readable format like the sample below (inspired by SGLang/vLLM style):

[YYYY-MM-DD HH:MM:SS] Prefill batch. #new-seq: 1, #new-token: 25, #cached-token: 1, token usage: 0.00, #running-req: 0, #queue-req: 0
[YYYY-MM-DD HH:MM:SS] Decode batch. #running-req: 1, #token: 283, token usage: 0.00, cuda graph: True, gen throughput (token/s): 108.30, #queue-req: 0
(See full example logs in the section below.)

Example Log Output (from a test run)

[2025-12-08 03:57:29] Decode batch. #running-req: 1, #token: 283, token usage: 0.00, cuda graph: True, gen throughput (token/s): 0.07, #queue-req: 0
[2025-12-08 03:57:30] INFO:     10.40.32.80:34224 - "POST /v1/chat/completions HTTP/1.0" 200 OK
[2025-12-08 08:36:21] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 249, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-12-08 08:36:21] Decode batch. #running-req: 1, #token: 271, token usage: 0.00, cuda graph: True, gen throughput (token/s): 0.00, #queue-req: 0
... (additional log lines)

Alternatives

none

Additional context

This feature would be especially useful for high-throughput serving environments and performance tuning.

The implementation should ideally allow the logging level/verbosity to be configurable (e.g., via environment variable or config file).

Reference: SGLang and vLLM both provide similar insightful logging.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Assignees

Labels

Inference runtime<NV>General operational aspects of TRTLLM execution not in other categories.feature requestNew feature or request. This includes new model, dtype, functionality supportwaiting for feedback

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions