TensorRT just hangs when starting

I run this:
trtllm-serve /tensorstuff/TensorRT-Model-Optimizer/examples/llm_ptq/saved_models_qwen3-32b_nvfp4_hf/ --host <vpn ip> --port 8000 --backend pytorch

It outputs this:
```
--backend pytorch
<frozen importlib._bootstrap_external>:1184: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
<frozen importlib._bootstrap_external>:1184: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
2025-06-28 17:32:17,461 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[TensorRT-LLM] TensorRT-LLM version: 0.20.0rc3
[06/28/2025-17:32:17] [TRT-LLM] [I] Compute capability: (12, 0)
[06/28/2025-17:32:17] [TRT-LLM] [I] SM count: 170
[06/28/2025-17:32:17] [TRT-LLM] [I] SM clock: 3105 MHz
[06/28/2025-17:32:17] [TRT-LLM] [I] int4 TFLOPS: 0
[06/28/2025-17:32:17] [TRT-LLM] [I] int8 TFLOPS: 0
[06/28/2025-17:32:17] [TRT-LLM] [I] fp8 TFLOPS: 0
[06/28/2025-17:32:17] [TRT-LLM] [I] float16 TFLOPS: 0
[06/28/2025-17:32:17] [TRT-LLM] [I] bfloat16 TFLOPS: 0
[06/28/2025-17:32:17] [TRT-LLM] [I] float32 TFLOPS: 0
[06/28/2025-17:32:17] [TRT-LLM] [I] Total Memory: 31 GiB
[06/28/2025-17:32:17] [TRT-LLM] [I] Memory clock: 14001 MHz
[06/28/2025-17:32:17] [TRT-LLM] [I] Memory bus width: 512
[06/28/2025-17:32:17] [TRT-LLM] [I] Memory bandwidth: 1792 GB/s
[06/28/2025-17:32:17] [TRT-LLM] [I] PCIe speed: 2500 Mbps
[06/28/2025-17:32:17] [TRT-LLM] [I] PCIe link width: 8
[06/28/2025-17:32:17] [TRT-LLM] [I] PCIe bandwidth: 2 GB/s
[06/28/2025-17:32:17] [TRT-LLM] [I] Set nccl_plugin to None.
[06/28/2025-17:32:17] [TRT-LLM] [I] Found /tensorstuff/TensorRT-Model-Optimizer/examples/llm_ptq/saved_models_qwen3-32b_nvfp4_hf/hf_quant_config.json, pre-quantized checkpoint is used.
[06/28/2025-17:32:17] [TRT-LLM] [I] Setting quant_algo=NVFP4 form HF quant config.
[06/28/2025-17:32:17] [TRT-LLM] [I] Setting kv_cache_quant_algo=FP8 form HF quant config.
[06/28/2025-17:32:17] [TRT-LLM] [I] Setting group_size=16 from HF quant config.
[06/28/2025-17:32:17] [TRT-LLM] [I] Setting exclude_modules=['lm_head'] from HF quant config.
[06/28/2025-17:32:18] [TRT-LLM] [I] PyTorchConfig(extra_resource_managers={}, use_cuda_graph=False, cuda_graph_batch_sizes=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 64, 128], cuda_graph_max_batch_size=128, cuda_graph_padding_enabled=False, disable_overlap_scheduler=False, moe_max_num_tokens=None, attn_backend='TRTLLM', moe_backend='CUTLASS', mixed_sampler=False, enable_trtllm_sampler=False, kv_cache_dtype='auto', use_kv_cache=True, enable_iter_perf_stats=False, enable_iter_req_stats=False, print_iter_log=False, torch_compile_enabled=False, torch_compile_fullgraph=True, torch_compile_inductor_enabled=False, torch_compile_piecewise_cuda_graph=False, torch_compile_enable_userbuffers=True, autotuner_enabled=True, enable_layerwise_nvtx_marker=False, load_format=<LoadFormat.AUTO: 0>)
rank 0 using MpiPoolSession to spawn MPI processes
[06/28/2025-17:32:18] [TRT-LLM] [I] Generating a new HMAC key for server proxy_request_queue
[06/28/2025-17:32:18] [TRT-LLM] [I] Generating a new HMAC key for server proxy_request_error_queue
[06/28/2025-17:32:18] [TRT-LLM] [I] Generating a new HMAC key for server proxy_result_queue
[06/28/2025-17:32:18] [TRT-LLM] [I] Generating a new HMAC key for server proxy_stats_queue
[06/28/2025-17:32:18] [TRT-LLM] [I] Generating a new HMAC key for server proxy_kv_cache_events_queue
```

Hanging at that bit.

I see 577MB of vram allocated to GPU 1 thats it, no further action for hours.

CPU shows 1 to 2 cores firing off every now and then. I tried a --verbose flag but I cannot get more information than this.

NVIDIA-SMI
```
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.169                Driver Version: 570.169        CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5090        On  |   00000000:21:00.0 Off |                  N/A |
|  0%   26C    P8             13W /  575W |       0MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 5090        On  |   00000000:22:00.0 Off |                  N/A |
|  0%   27C    P8             12W /  575W |       0MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 5090        On  |   00000000:61:00.0 Off |                  N/A |
|  0%   26C    P8              3W /  575W |       0MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA GeForce RTX 5090        On  |   00000000:62:00.0 Off |                  N/A |
|  0%   27C    P8              4W /  575W |       0MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TensorRT just hangs when starting #4501

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TensorRT just hangs when starting #4501

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions