-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Closed as not planned
Labels
Inference runtime<NV>General operational aspects of TRTLLM execution not in other categories.<NV>General operational aspects of TRTLLM execution not in other categories.Pytorch<NV>Pytorch backend related issues<NV>Pytorch backend related issuesbugSomething isn't workingSomething isn't workingstalewaiting for feedback
Description
System Info
I run this:
trtllm-serve /tensorstuff/TensorRT-Model-Optimizer/examples/llm_ptq/saved_models_qwen3-32b_nvfp4_hf/ --host --port 8000 --backend pytorch
It outputs this:
--backend pytorch
<frozen importlib._bootstrap_external>:1184: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
<frozen importlib._bootstrap_external>:1184: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
2025-06-28 17:32:17,461 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[TensorRT-LLM] TensorRT-LLM version: 0.20.0rc3
[06/28/2025-17:32:17] [TRT-LLM] [I] Compute capability: (12, 0)
[06/28/2025-17:32:17] [TRT-LLM] [I] SM count: 170
[06/28/2025-17:32:17] [TRT-LLM] [I] SM clock: 3105 MHz
[06/28/2025-17:32:17] [TRT-LLM] [I] int4 TFLOPS: 0
[06/28/2025-17:32:17] [TRT-LLM] [I] int8 TFLOPS: 0
[06/28/2025-17:32:17] [TRT-LLM] [I] fp8 TFLOPS: 0
[06/28/2025-17:32:17] [TRT-LLM] [I] float16 TFLOPS: 0
[06/28/2025-17:32:17] [TRT-LLM] [I] bfloat16 TFLOPS: 0
[06/28/2025-17:32:17] [TRT-LLM] [I] float32 TFLOPS: 0
[06/28/2025-17:32:17] [TRT-LLM] [I] Total Memory: 31 GiB
[06/28/2025-17:32:17] [TRT-LLM] [I] Memory clock: 14001 MHz
[06/28/2025-17:32:17] [TRT-LLM] [I] Memory bus width: 512
[06/28/2025-17:32:17] [TRT-LLM] [I] Memory bandwidth: 1792 GB/s
[06/28/2025-17:32:17] [TRT-LLM] [I] PCIe speed: 2500 Mbps
[06/28/2025-17:32:17] [TRT-LLM] [I] PCIe link width: 8
[06/28/2025-17:32:17] [TRT-LLM] [I] PCIe bandwidth: 2 GB/s
[06/28/2025-17:32:17] [TRT-LLM] [I] Set nccl_plugin to None.
[06/28/2025-17:32:17] [TRT-LLM] [I] Found /tensorstuff/TensorRT-Model-Optimizer/examples/llm_ptq/saved_models_qwen3-32b_nvfp4_hf/hf_quant_config.json, pre-quantized checkpoint is used.
[06/28/2025-17:32:17] [TRT-LLM] [I] Setting quant_algo=NVFP4 form HF quant config.
[06/28/2025-17:32:17] [TRT-LLM] [I] Setting kv_cache_quant_algo=FP8 form HF quant config.
[06/28/2025-17:32:17] [TRT-LLM] [I] Setting group_size=16 from HF quant config.
[06/28/2025-17:32:17] [TRT-LLM] [I] Setting exclude_modules=['lm_head'] from HF quant config.
[06/28/2025-17:32:18] [TRT-LLM] [I] PyTorchConfig(extra_resource_managers={}, use_cuda_graph=False, cuda_graph_batch_sizes=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 64, 128], cuda_graph_max_batch_size=128, cuda_graph_padding_enabled=False, disable_overlap_scheduler=False, moe_max_num_tokens=None, attn_backend='TRTLLM', moe_backend='CUTLASS', mixed_sampler=False, enable_trtllm_sampler=False, kv_cache_dtype='auto', use_kv_cache=True, enable_iter_perf_stats=False, enable_iter_req_stats=False, print_iter_log=False, torch_compile_enabled=False, torch_compile_fullgraph=True, torch_compile_inductor_enabled=False, torch_compile_piecewise_cuda_graph=False, torch_compile_enable_userbuffers=True, autotuner_enabled=True, enable_layerwise_nvtx_marker=False, load_format=<LoadFormat.AUTO: 0>)
rank 0 using MpiPoolSession to spawn MPI processes
[06/28/2025-17:32:18] [TRT-LLM] [I] Generating a new HMAC key for server proxy_request_queue
[06/28/2025-17:32:18] [TRT-LLM] [I] Generating a new HMAC key for server proxy_request_error_queue
[06/28/2025-17:32:18] [TRT-LLM] [I] Generating a new HMAC key for server proxy_result_queue
[06/28/2025-17:32:18] [TRT-LLM] [I] Generating a new HMAC key for server proxy_stats_queue
[06/28/2025-17:32:18] [TRT-LLM] [I] Generating a new HMAC key for server proxy_kv_cache_events_queue
Hanging at that bit.
I see 577MB of vram allocated to GPU 1 thats it, no further action for hours.
CPU shows 1 to 2 cores firing off every now and then. I tried a --verbose flag but I cannot get more information than this.
NVIDIA-SMI
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.169 Driver Version: 570.169 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 5090 On | 00000000:21:00.0 Off | N/A |
| 0% 26C P8 13W / 575W | 0MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 5090 On | 00000000:22:00.0 Off | N/A |
| 0% 27C P8 12W / 575W | 0MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA GeForce RTX 5090 On | 00000000:61:00.0 Off | N/A |
| 0% 26C P8 3W / 575W | 0MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA GeForce RTX 5090 On | 00000000:62:00.0 Off | N/A |
| 0% 27C P8 4W / 575W | 0MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Run tensorrt with rtx 5090s on latest drivers
I followed the steps to run a huggingface model on Tensorrt by using Qwen3-32b, quantizing to fp4 and running.
This was in the examples folder
Expected behavior
loads
actual behavior
hangs
additional notes
no real outputs from tensorrt to help sorry
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
Inference runtime<NV>General operational aspects of TRTLLM execution not in other categories.<NV>General operational aspects of TRTLLM execution not in other categories.Pytorch<NV>Pytorch backend related issues<NV>Pytorch backend related issuesbugSomething isn't workingSomething isn't workingstalewaiting for feedback