Skip to content

TensorRT just hangs when starting #4501

@Notbici

Description

@Notbici

I run this:
trtllm-serve /tensorstuff/TensorRT-Model-Optimizer/examples/llm_ptq/saved_models_qwen3-32b_nvfp4_hf/ --host --port 8000 --backend pytorch

It outputs this:

--backend pytorch
<frozen importlib._bootstrap_external>:1184: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
<frozen importlib._bootstrap_external>:1184: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
2025-06-28 17:32:17,461 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[TensorRT-LLM] TensorRT-LLM version: 0.20.0rc3
[06/28/2025-17:32:17] [TRT-LLM] [I] Compute capability: (12, 0)
[06/28/2025-17:32:17] [TRT-LLM] [I] SM count: 170
[06/28/2025-17:32:17] [TRT-LLM] [I] SM clock: 3105 MHz
[06/28/2025-17:32:17] [TRT-LLM] [I] int4 TFLOPS: 0
[06/28/2025-17:32:17] [TRT-LLM] [I] int8 TFLOPS: 0
[06/28/2025-17:32:17] [TRT-LLM] [I] fp8 TFLOPS: 0
[06/28/2025-17:32:17] [TRT-LLM] [I] float16 TFLOPS: 0
[06/28/2025-17:32:17] [TRT-LLM] [I] bfloat16 TFLOPS: 0
[06/28/2025-17:32:17] [TRT-LLM] [I] float32 TFLOPS: 0
[06/28/2025-17:32:17] [TRT-LLM] [I] Total Memory: 31 GiB
[06/28/2025-17:32:17] [TRT-LLM] [I] Memory clock: 14001 MHz
[06/28/2025-17:32:17] [TRT-LLM] [I] Memory bus width: 512
[06/28/2025-17:32:17] [TRT-LLM] [I] Memory bandwidth: 1792 GB/s
[06/28/2025-17:32:17] [TRT-LLM] [I] PCIe speed: 2500 Mbps
[06/28/2025-17:32:17] [TRT-LLM] [I] PCIe link width: 8
[06/28/2025-17:32:17] [TRT-LLM] [I] PCIe bandwidth: 2 GB/s
[06/28/2025-17:32:17] [TRT-LLM] [I] Set nccl_plugin to None.
[06/28/2025-17:32:17] [TRT-LLM] [I] Found /tensorstuff/TensorRT-Model-Optimizer/examples/llm_ptq/saved_models_qwen3-32b_nvfp4_hf/hf_quant_config.json, pre-quantized checkpoint is used.
[06/28/2025-17:32:17] [TRT-LLM] [I] Setting quant_algo=NVFP4 form HF quant config.
[06/28/2025-17:32:17] [TRT-LLM] [I] Setting kv_cache_quant_algo=FP8 form HF quant config.
[06/28/2025-17:32:17] [TRT-LLM] [I] Setting group_size=16 from HF quant config.
[06/28/2025-17:32:17] [TRT-LLM] [I] Setting exclude_modules=['lm_head'] from HF quant config.
[06/28/2025-17:32:18] [TRT-LLM] [I] PyTorchConfig(extra_resource_managers={}, use_cuda_graph=False, cuda_graph_batch_sizes=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 64, 128], cuda_graph_max_batch_size=128, cuda_graph_padding_enabled=False, disable_overlap_scheduler=False, moe_max_num_tokens=None, attn_backend='TRTLLM', moe_backend='CUTLASS', mixed_sampler=False, enable_trtllm_sampler=False, kv_cache_dtype='auto', use_kv_cache=True, enable_iter_perf_stats=False, enable_iter_req_stats=False, print_iter_log=False, torch_compile_enabled=False, torch_compile_fullgraph=True, torch_compile_inductor_enabled=False, torch_compile_piecewise_cuda_graph=False, torch_compile_enable_userbuffers=True, autotuner_enabled=True, enable_layerwise_nvtx_marker=False, load_format=<LoadFormat.AUTO: 0>)
rank 0 using MpiPoolSession to spawn MPI processes
[06/28/2025-17:32:18] [TRT-LLM] [I] Generating a new HMAC key for server proxy_request_queue
[06/28/2025-17:32:18] [TRT-LLM] [I] Generating a new HMAC key for server proxy_request_error_queue
[06/28/2025-17:32:18] [TRT-LLM] [I] Generating a new HMAC key for server proxy_result_queue
[06/28/2025-17:32:18] [TRT-LLM] [I] Generating a new HMAC key for server proxy_stats_queue
[06/28/2025-17:32:18] [TRT-LLM] [I] Generating a new HMAC key for server proxy_kv_cache_events_queue

Hanging at that bit.

I see 577MB of vram allocated to GPU 1 thats it, no further action for hours.

CPU shows 1 to 2 cores firing off every now and then. I tried a --verbose flag but I cannot get more information than this.

NVIDIA-SMI

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.169                Driver Version: 570.169        CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5090        On  |   00000000:21:00.0 Off |                  N/A |
|  0%   26C    P8             13W /  575W |       0MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 5090        On  |   00000000:22:00.0 Off |                  N/A |
|  0%   27C    P8             12W /  575W |       0MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 5090        On  |   00000000:61:00.0 Off |                  N/A |
|  0%   26C    P8              3W /  575W |       0MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA GeForce RTX 5090        On  |   00000000:62:00.0 Off |                  N/A |
|  0%   27C    P8              4W /  575W |       0MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Metadata

Metadata

Assignees

No one assigned

    Labels

    InvestigatingIssue is under investigation by TensorRT devsModule:RuntimeOther generic runtime issues that does not fall into other modules

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions