-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Description
System Info
- CPU Architecture: x86_64
- GPU Properties:
- Name: H100
- Memory: 80GB
- TensorRT-LLM Branc: v1.0.0rc1
- Versions: cuda12.8
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
- My Script
export TORCH_CUDA_ARCH_LIST="8.0;8.6;8.9;9.0"
base_model=/mnt/modelops/models/Qwen3-30B-A3B/Qwen__Qwen3-30B-A3B
eagle_model=/mnt/modelops/models/qwen3_30b_moe_eagle3_8b1ecfe164a4efb3
max_batch_size=32
max_num_tokens=32768
tp_size=8
cat <<EOF > extra_llm_api_options.yaml
kv_cache_config:
enable_block_reuse: False
free_gpu_memory_fraction: 0.85
use_cuda_graph: True
cuda_graph_max_batch_size: 1
cuda_graph_padding_enabled: True
attn_backend: TRTLLM
enable_iter_perf_stats: False
enable_iter_req_stats: False
print_iter_log: False
enable_chunked_prefill: False
disable_overlap_scheduler: False
dtype: auto
# EAGLE-3
speculative_config:
decoding_type: Eagle
max_draft_len: 4
pytorch_weights_path: ${eagle_model}
#
EOF
trtllm-serve \
${base_model} \
--host 127.0.0.1 \
--port 9122 \
--tp_size ${tp_size} \
--max_batch_size ${max_batch_size} \
--max_num_tokens ${max_num_tokens} \
--kv_cache_free_gpu_memory_fraction 0.85 \
--log_level info \
--trust_remote_code \
--backend pytorch \
--extra_llm_api_options extra_llm_api_options.yaml
- curl command
curl http:/127.0.0.1:9122/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "test",
"messages":[{"role": "user", "content": "Python bubble sort code."}],
"max_tokens": 128,
"stream":false
}'
- bad case
{"id":"chatcmpl-18967b0c68294089bd6188da8b8eac77","object":"chat.completion","created":1751878126,"model":"test","choices":[{"index":0,"message":{"role":"assistant","content":"<think>\n[…]\nOkay,颧, I控制 the user is askingDTV for Python bubble[method for bubble \nOkay, dri, Iyre, I need \n\n\nOkay,licted, the user ///</span> \n\nOkay, vk, the user站 is asking|\n\nOkay,一个职业, the \n\nOkay,ANTA, the站 is|\n\nOkay \nOkay, Tomas, the cott, theiris, the \n\nOkay,[method for bubble奶粉, \n\nOkay ///</span> \n\nOkay站 is|\n\nOkay \nOkay,iris, \n\nOkay,OAD, \n\nOkay站 is|\n\nOkay \n\nOkay,[method for \n\nOkay站 is","reasoning_content":null,"tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null,"disaggregated_params":null}],"usage":{"prompt_tokens":13,"total_tokens":137,"completion_tokens":124},"prompt_token_ids":null}
Expected behavior
right output:
{"id":"chatcmpl-eb60fe9f6ea44a079a0affc9b63044a8","object":"chat.completion","created":1751879316,"model":"test","choices":[{"index":0,"message":{"role":"assistant","content":"\nOkay, I need to write a Python code for bubble sort. Let me think about how bubble sort works. From what I remember, bubble sort is a simple sorting algorithm that repeatedly steps through the list, compares adjacent elements, and swaps them if they are in the wrong order. The pass through the list is repeated until the list is sorted.\n\nSo the basic steps are: iterate through the list, compare each pair of adjacent elements, and swap them if they are not in the correct order. Each pass moves the largest unsorted element to its correct position at the end of the list. That's why it's called bubble sort","reasoning_content":null,"tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null,"disaggregated_params":null}],"usage":{"prompt_tokens":13,"total_tokens":141,"completion_tokens":128},"prompt_token_ids":null}
actual behavior
{"id":"chatcmpl-18967b0c68294089bd6188da8b8eac77","object":"chat.completion","created":1751878126,"model":"test","choices":[{"index":0,"message":{"role":"assistant","content":"\n[…]\nOkay,颧, I控制 the user is askingDTV for Python bubble[method for bubble \nOkay, dri, Iyre, I need \n\n\nOkay,licted, the user /// \n\nOkay, vk, the user站 is asking|\n\nOkay,一个职业, the \n\nOkay,ANTA, the站 is|\n\nOkay \nOkay, Tomas, the cott, theiris, the \n\nOkay,[method for bubble奶粉, \n\nOkay /// \n\nOkay站 is|\n\nOkay \nOkay,iris, \n\nOkay,OAD, \n\nOkay站 is|\n\nOkay \n\nOkay,[method for \n\nOkay站 is","reasoning_content":null,"tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null,"disaggregated_params":null}],"usage":{"prompt_tokens":13,"total_tokens":137,"completion_tokens":124},"prompt_token_ids":null}
additional notes
none