-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Closed
Labels
InvestigatingIssue is under investigation by TensorRT devsIssue is under investigation by TensorRT devsModule:RuntimeOther generic runtime issues that does not fall into other modulesOther generic runtime issues that does not fall into other modules
Description
I run this:
trtllm-serve /tensorstuff/TensorRT-Model-Optimizer/examples/llm_ptq/saved_models_qwen3-32b_nvfp4_hf/ --host --port 8000 --backend pytorch
It outputs this:
--backend pytorch
<frozen importlib._bootstrap_external>:1184: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
<frozen importlib._bootstrap_external>:1184: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
2025-06-28 17:32:17,461 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[TensorRT-LLM] TensorRT-LLM version: 0.20.0rc3
[06/28/2025-17:32:17] [TRT-LLM] [I] Compute capability: (12, 0)
[06/28/2025-17:32:17] [TRT-LLM] [I] SM count: 170
[06/28/2025-17:32:17] [TRT-LLM] [I] SM clock: 3105 MHz
[06/28/2025-17:32:17] [TRT-LLM] [I] int4 TFLOPS: 0
[06/28/2025-17:32:17] [TRT-LLM] [I] int8 TFLOPS: 0
[06/28/2025-17:32:17] [TRT-LLM] [I] fp8 TFLOPS: 0
[06/28/2025-17:32:17] [TRT-LLM] [I] float16 TFLOPS: 0
[06/28/2025-17:32:17] [TRT-LLM] [I] bfloat16 TFLOPS: 0
[06/28/2025-17:32:17] [TRT-LLM] [I] float32 TFLOPS: 0
[06/28/2025-17:32:17] [TRT-LLM] [I] Total Memory: 31 GiB
[06/28/2025-17:32:17] [TRT-LLM] [I] Memory clock: 14001 MHz
[06/28/2025-17:32:17] [TRT-LLM] [I] Memory bus width: 512
[06/28/2025-17:32:17] [TRT-LLM] [I] Memory bandwidth: 1792 GB/s
[06/28/2025-17:32:17] [TRT-LLM] [I] PCIe speed: 2500 Mbps
[06/28/2025-17:32:17] [TRT-LLM] [I] PCIe link width: 8
[06/28/2025-17:32:17] [TRT-LLM] [I] PCIe bandwidth: 2 GB/s
[06/28/2025-17:32:17] [TRT-LLM] [I] Set nccl_plugin to None.
[06/28/2025-17:32:17] [TRT-LLM] [I] Found /tensorstuff/TensorRT-Model-Optimizer/examples/llm_ptq/saved_models_qwen3-32b_nvfp4_hf/hf_quant_config.json, pre-quantized checkpoint is used.
[06/28/2025-17:32:17] [TRT-LLM] [I] Setting quant_algo=NVFP4 form HF quant config.
[06/28/2025-17:32:17] [TRT-LLM] [I] Setting kv_cache_quant_algo=FP8 form HF quant config.
[06/28/2025-17:32:17] [TRT-LLM] [I] Setting group_size=16 from HF quant config.
[06/28/2025-17:32:17] [TRT-LLM] [I] Setting exclude_modules=['lm_head'] from HF quant config.
[06/28/2025-17:32:18] [TRT-LLM] [I] PyTorchConfig(extra_resource_managers={}, use_cuda_graph=False, cuda_graph_batch_sizes=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 64, 128], cuda_graph_max_batch_size=128, cuda_graph_padding_enabled=False, disable_overlap_scheduler=False, moe_max_num_tokens=None, attn_backend='TRTLLM', moe_backend='CUTLASS', mixed_sampler=False, enable_trtllm_sampler=False, kv_cache_dtype='auto', use_kv_cache=True, enable_iter_perf_stats=False, enable_iter_req_stats=False, print_iter_log=False, torch_compile_enabled=False, torch_compile_fullgraph=True, torch_compile_inductor_enabled=False, torch_compile_piecewise_cuda_graph=False, torch_compile_enable_userbuffers=True, autotuner_enabled=True, enable_layerwise_nvtx_marker=False, load_format=<LoadFormat.AUTO: 0>)
rank 0 using MpiPoolSession to spawn MPI processes
[06/28/2025-17:32:18] [TRT-LLM] [I] Generating a new HMAC key for server proxy_request_queue
[06/28/2025-17:32:18] [TRT-LLM] [I] Generating a new HMAC key for server proxy_request_error_queue
[06/28/2025-17:32:18] [TRT-LLM] [I] Generating a new HMAC key for server proxy_result_queue
[06/28/2025-17:32:18] [TRT-LLM] [I] Generating a new HMAC key for server proxy_stats_queue
[06/28/2025-17:32:18] [TRT-LLM] [I] Generating a new HMAC key for server proxy_kv_cache_events_queue
Hanging at that bit.
I see 577MB of vram allocated to GPU 1 thats it, no further action for hours.
CPU shows 1 to 2 cores firing off every now and then. I tried a --verbose flag but I cannot get more information than this.
NVIDIA-SMI
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.169 Driver Version: 570.169 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 5090 On | 00000000:21:00.0 Off | N/A |
| 0% 26C P8 13W / 575W | 0MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 5090 On | 00000000:22:00.0 Off | N/A |
| 0% 27C P8 12W / 575W | 0MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA GeForce RTX 5090 On | 00000000:61:00.0 Off | N/A |
| 0% 26C P8 3W / 575W | 0MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA GeForce RTX 5090 On | 00000000:62:00.0 Off | N/A |
| 0% 27C P8 4W / 575W | 0MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
Metadata
Metadata
Assignees
Labels
InvestigatingIssue is under investigation by TensorRT devsIssue is under investigation by TensorRT devsModule:RuntimeOther generic runtime issues that does not fall into other modulesOther generic runtime issues that does not fall into other modules