-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Closed
Labels
Inference runtime<NV>General operational aspects of TRTLLM execution not in other categories.<NV>General operational aspects of TRTLLM execution not in other categories.Pytorch<NV>Pytorch backend related issues<NV>Pytorch backend related issuesbugSomething isn't workingSomething isn't workingwaiting for feedback
Description
System Info
- System: Nvidia DGX Spark
- CPU : ARM
- GPU architecture Blackwell GB10
- unified Memory -128Gb
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Follow the quick start recipe on TensorRT-LLM Official Docs
docker run --rm -it \
--ipc=host \
--gpus all \
-p 8001:8000 \
-v ~/.cache/model:/root/.cache:rw \
--name tensorrt_llm \
nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc0 \
/bin/bash
Inside the container run -
# using cutlass backend
EXTRA_LLM_API_FILE=/tmp/cutlass-config.yml
cat << EOF > ${EXTRA_LLM_API_FILE}
enable_attention_dp: true
cuda_graph_config:
enable_padding: true
max_batch_size: 720
moe_config:
backend: CUTLASS
stream_interval: 20
num_postprocess_workers: 4
attention_dp_config:
enable_balance: true
batching_wait_iters: 50
timeout_iters: 1
EOF
# start the server
trtllm-serve openai/gpt-oss-20b \
--host 0.0.0.0 \
--port 8000 \
--backend pytorch \
--max_batch_size 720 \
--max_num_tokens 16384 \
--kv_cache_free_gpu_memory_fraction 0.9 \
--tp_size 1 \
--ep_size 1 \
--trust_remote_code \
--extra_llm_api_options ${EXTRA_LLM_API_FILE}
The container errors our saying
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: The attention sinks is only supported on SM90.
Expected behavior
The server should start and model should be useable.
actual behavior
The server fails to start
additional notes
Also tried with TRTLLM backend instead of cutlas
cat << EOF > ${EXTRA_LLM_API_FILE}
enable_attention_dp: false
cuda_graph_config:
enable_padding: true
max_batch_size: 720
moe_config:
backend: TRTLLM
stream_interval: 20
num_postprocess_workers: 4
EOF
it also failed with error different NotImplementedError: TRTLLMGenFusedMoE does not support SM120 and above.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
Inference runtime<NV>General operational aspects of TRTLLM execution not in other categories.<NV>General operational aspects of TRTLLM execution not in other categories.Pytorch<NV>Pytorch backend related issues<NV>Pytorch backend related issuesbugSomething isn't workingSomething isn't workingwaiting for feedback