Skip to content

Performance Issue when using tools/llm #3803

@ChiikawaSama

Description

@ChiikawaSama

❓ Question

What you have already tried

Environment

Build information about Torch-TensorRT can be found by turning on debug messages

  • PyTorch Version (e.g., 1.0): 2.8.0
  • CPU Architecture: amd
  • OS (e.g., Linux): ubuntu 22.04
  • How you installed PyTorch (conda, pip, libtorch, source): pip
  • Build command you used (if compiling from source): NO
  • Are you using local sources or building from archives: NO
  • Python version: 3.10
  • CUDA version: 12.8
  • GPU models and configuration: NVIDIA
  • Any other relevant information: directly use torch-tensorrt 2.8.0 wheel with github 2.8.0 tag to run tools/llm

Additional context

Hi there, I tried to use tools/llm with static_cache_v2 to run qwen2.5 model, and I use such script to run:

python run_llm.py --model Qwen/Qwen2.5-0.5B-Instruct --prompt "What is parallel programming?" --precision FP16 --num_tokens 128 --cache static_v2 --benchmark

when i use nsight system to profiling, I found that using static_cache_v2 would bring launch overhead to tensorrt engine in each prefill / decode block, do you have this problem too? thought this overhead is too much, almost make torch-tensorrt the same speed compared to just enable torch.compile

here is the nsys profiling result: the red line shows there is approximately 1.7ms overhead and no gpu activities at all (when disabling static_cache_v2 there is no such bubbles, thought maybe because shape copy or other operators with static_cache_v2?)

Image

looking forward to your reply, thanks a lot!

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions