Error with runner.generate in TensorRT-LLM 0.14.0 for Qwen Example

Environment

	•	Docker Image: nvcr.io/nvidia/tritonserver:24.10-trtllm-python-py3
	•	TensorRT-LLM Version: 0.14.0
	•	Run Command:

python3 ../run.py \
  --input_text "你好，请问你叫什么？" \
  --max_output_len=50 \
  --tokenizer_dir /data/models/Qwen1.5-7B-Chat/ \
  --engine_dir ./tmp/qwen/7B/trt_engines/fp16/1-gpu/


•Example Code: examples/qwen/run.py (from README)

Description

While running the run.py script as described in the README of the examples/qwen/ directory, the following error occurs when invoking runner.generate:

Error Traceback

Traceback (most recent call last):
  File "/triton/TensorRT-LLM-release-0.14/examples/qwen/../run.py", line 887, in <module>
    main(args)
  File "/triton/TensorRT-LLM-release-0.14/examples/qwen/../run.py", line 711, in main
    outputs = runner.generate(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 624, in generate
    requests = [
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 625, in <listcomp>
    trtllm.Request(
TypeError: __init__(): incompatible constructor arguments. The following argument types are supported:
    1. tensorrt_llm.bindings.executor.Request(input_token_ids: list[int], *, max_tokens: Optional[int] = None, max_new_tokens: Optional[int] = None, streaming: bool = False, sampling_config: tensorrt_llm.bindings.executor.SamplingConfig = SamplingConfig(), output_config: tensorrt_llm.bindings.executor.OutputConfig = OutputConfig(), end_id: Optional[int] = None, pad_id: Optional[int] = None, position_ids: Optional[list[int]] = None, bad_words: Optional[list[list[int]]] = None, stop_words: Optional[list[list[int]]] = None, embedding_bias: Optional[torch.Tensor] = None, external_draft_tokens_config: Optional[tensorrt_llm.bindings.executor.ExternalDraftTokensConfig] = None, prompt_tuning_config: Optional[tensorrt_llm.bindings.executor.PromptTuningConfig = None, lora_config: Optional[tensorrt_llm.bindings.executor.LoraConfig] = None, lookahead_config: Optional[tensorrt_llm.bindings.executor.LookaheadDecodingConfig] = None, logits_post_processor_name: Optional[str] = None, encoder_input_token_ids: Optional[list[int]] = None, client_id: Optional[int] = None, return_all_generated_tokens: bool = False, priority: float = 0.5, type: tensorrt_llm.bindings.executor.RequestType = RequestType.REQUEST_TYPE_CONTEXT_AND_GENERATION, context_phase_params: Optional[tensorrt_llm.bindings.executor.ContextPhaseParams] = None, encoder_input_features: Optional[torch.Tensor] = None, encoder_output_length: Optional[int] = None, num_return_sequences: int = 1)

Invoked with: kwargs: 
    input_token_ids=[151644, 8948, 198, 2610, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 108386, 37945, 56007, 56568, 99882, 99245, 11319, 151645, 198, 151644, 77091, 198], 
    encoder_input_token_ids=None, 
    encoder_output_length=None, 
    encoder_input_features=None, 
    position_ids=None, 
    max_tokens=50, 
    num_return_sequences=None, 
    pad_id=151643, 
    end_id=151645, 
    stop_words=None, 
    bad_words=None, 
    sampling_config=<tensorrt_llm.bindings.executor.SamplingConfig object at 0x7f000502f830>, 
    lookahead_config=None, 
    streaming=False, 
    output_config=<tensorrt_llm.bindings.executor.OutputConfig object at 0x7f0001cca270>, 
    prompt_tuning_config=None, 
    lora_config=None, 
    return_all_generated_tokens=False, 
    logits_post_processor_name=None, 
    external_draft_tokens_config=None

Additional Context

The engine and tokenizer paths are configured as follows:
	•	--tokenizer_dir: /data/models/Qwen1.5-7B-Chat/
	•	--engine_dir: ./tmp/qwen/7B/trt_engines/fp16/1-gpu/

The engine appears to load successfully, as indicated by the log output:

[TensorRT-LLM][INFO] Engine version 0.14.0 found in the config file, assuming engine(s) built by new builder API.
...
[11/18/2024-02:33:18] [TRT-LLM] [I] Load engine takes: 12.188158512115479 sec

However, the error seems to indicate a problem with the argument types for the tensorrt_llm.bindings.executor.Request class, particularly with sampling_config and output_config.


If more logs or information are needed, please let me know! Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error with runner.generate in TensorRT-LLM 0.14.0 for Qwen Example #2452

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Error with runner.generate in TensorRT-LLM 0.14.0 for Qwen Example #2452

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions