Skip to content

[Bug]: try to run tensorrt-llm offline exmaple but the program hangs #9766

@w58296

Description

@w58296

System Info

  • CPU: i5-14000kf
  • GPU: 4070super
  • python3.10
  • nvidia-smi: NVIDIA-SMI 580.105.08 Driver Version: 572.16 CUDA Version: 12.8
  • nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2025 NVIDIA Corporation
    Built on Wed_Jan_15_19:20:09_PST_2025
    Cuda compilation tools, release 12.8, V12.8.61
    Build cuda_12.8.r12.8/compiler.35404655_0
  • tensorrt-llm 1.0.0
  • torch 2.7.1
    torchprofile 0.0.4
    torchvision 0.22.1
  • tensorrt 10.11.0.33

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

import os
os.environ["TLLM_LLM_ENABLE_DEBUG"] = "1"
from tensorrt_llm import LLM, SamplingParams


def main():

    # Model could accept HF model name, a path to local HF model,
    # or TensorRT Model Optimizer's quantized checkpoints like nvidia/Llama-3.1-8B-Instruct-FP8 on HF.
    llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0", backend="pytorch")

    # Sample prompts.
    prompts = [
        "Hello, my name is",
        "The capital of France is",
        "The future of AI is",
    ]

    # Create a sampling params.
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

    for output in llm.generate(prompts, sampling_params):
        print(
            f"Prompt: {output.prompt!r}, Generated text: {output.outputs[0].text!r}"
        )

    # Got output like
    # Prompt: 'Hello, my name is', Generated text: '\n\nJane Smith. I am a student pursuing my degree in Computer Science at [university]. I enjoy learning new things, especially technology and programming'
    # Prompt: 'The president of the United States is', Generated text: 'likely to nominate a new Supreme Court justice to fill the seat vacated by the death of Antonin Scalia. The Senate should vote to confirm the'
    # Prompt: 'The capital of France is', Generated text: 'Paris.'
    # Prompt: 'The future of AI is', Generated text: 'an exciting time for us. We are constantly researching, developing, and improving our platform to create the most advanced and efficient model available. We are'


if __name__ == '__main__':
    main()

Expected behavior

successfully get model output

actual behavior

/home/rookie/Qwen/.venv/bin/python3.10 /home/rookie/Qwen/trt_llm_demo.py
:1184: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
:1184: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
[2025-12-07 13:29:44] INFO config.py:54: PyTorch version 2.7.1 available.
LLM debug mode enabled.
[12/07/2025-13:29:46] [TRT-LLM] [I] Starting TensorRT LLM init.
2025-12-07 13:29:46,319 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[12/07/2025-13:29:46] [TRT-LLM] [I] TensorRT LLM inited.
[TensorRT-LLM] TensorRT LLM version: 1.0.0
[12/07/2025-13:29:46] [TRT-LLM] [I] Using LLM with PyTorch backend
[12/07/2025-13:29:46] [TRT-LLM] [W] Using default gpus_per_node: 1
[12/07/2025-13:29:46] [TRT-LLM] [I] Set nccl_plugin to None.
[12/07/2025-13:29:46] [TRT-LLM] [I] neither checkpoint_format nor checkpoint_loader were provided, checkpoint_format will be set to HF.
LLM.args.mpi_session: None
/home/rookie/Qwen/.venv/lib/python3.10/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
[12/07/2025-13:29:47] [TRT-LLM] [I] PyTorchConfig(extra_resource_managers={}, use_cuda_graph=True, cuda_graph_batch_sizes=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 64, 128], cuda_graph_max_batch_size=128, cuda_graph_padding_enabled=False, disable_overlap_scheduler=False, moe_max_num_tokens=None, moe_load_balancer=None, attention_dp_enable_balance=False, attention_dp_time_out_iters=50, attention_dp_batching_wait_iters=10, attn_backend='TRTLLM', moe_backend='CUTLASS', enable_mixed_sampler=False, enable_trtllm_sampler=False, kv_cache_dtype='auto', enable_iter_perf_stats=False, enable_iter_req_stats=False, print_iter_log=False, torch_compile_enabled=False, torch_compile_fullgraph=True, torch_compile_inductor_enabled=False, torch_compile_piecewise_cuda_graph=False, torch_compile_enable_userbuffers=True, torch_compile_max_num_streams=1, enable_autotuner=True, enable_layerwise_nvtx_marker=False, load_format=<LoadFormat.AUTO: 0>, enable_min_latency=False, allreduce_strategy='AUTO', stream_interval=1, force_dynamic_quantization=False, _limit_torch_cuda_mem_fraction=True)
create pool session ...
rank 0 using MpiPoolSession to spawn MPI processes
Server [proxy_request_queue] bound to tcp://127.0.0.1:45799 in PAIR
[12/07/2025-13:29:47] [TRT-LLM] [I] Generating a new HMAC key for server proxy_request_queue
Server [worker_init_status_queue] bound to tcp://127.0.0.1:45971 in PAIR
[12/07/2025-13:29:47] [TRT-LLM] [I] Generating a new HMAC key for server worker_init_status_queue
Server [proxy_result_queue] bound to tcp://127.0.0.1:46275 in PAIR
[12/07/2025-13:29:47] [TRT-LLM] [I] Generating a new HMAC key for server proxy_result_queue
Server [proxy_stats_queue] bound to tcp://127.0.0.1:45893 in PAIR
[12/07/2025-13:29:47] [TRT-LLM] [I] Generating a new HMAC key for server proxy_stats_queue
Server [proxy_kv_cache_events_queue] bound to tcp://127.0.0.1:45993 in PAIR
[12/07/2025-13:29:47] [TRT-LLM] [I] Generating a new HMAC key for server proxy_kv_cache_events_queue

additional notes

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions