-
Notifications
You must be signed in to change notification settings - Fork 372
Description
❓ Question
What you have already tried
Environment
Build information about Torch-TensorRT can be found by turning on debug messages
- PyTorch Version (e.g., 1.0): 2.8.0
- CPU Architecture: amd
- OS (e.g., Linux): ubuntu 22.04
- How you installed PyTorch (
conda
,pip
,libtorch
, source): pip - Build command you used (if compiling from source): NO
- Are you using local sources or building from archives: NO
- Python version: 3.10
- CUDA version: 12.8
- GPU models and configuration: NVIDIA
- Any other relevant information: directly use torch-tensorrt 2.8.0 wheel with github 2.8.0 tag to run tools/llm
Additional context
Hi there, I tried to use tools/llm with static_cache_v2 to run qwen2.5 model, and I use such script to run:
python run_llm.py --model Qwen/Qwen2.5-0.5B-Instruct --prompt "What is parallel programming?" --precision FP16 --num_tokens 128 --cache static_v2 --benchmark
when i use nsight system to profiling, I found that using static_cache_v2 would bring launch overhead to tensorrt engine in each prefill / decode block, do you have this problem too? thought this overhead is too much, almost make torch-tensorrt the same speed compared to just enable torch.compile
here is the nsys profiling result: the red line shows there is approximately 1.7ms overhead and no gpu activities at all (when disabling static_cache_v2 there is no such bubbles, thought maybe because shape copy or other operators with static_cache_v2?)

looking forward to your reply, thanks a lot!