Skip to content

[Bug]: Can't run GPT-OSS models on DGX Spark #8474

@kalelegs

Description

@kalelegs

System Info

  • System: Nvidia DGX Spark
  • CPU : ARM
  • GPU architecture Blackwell GB10
  • unified Memory -128Gb

Who can help?

@Tracin

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Follow the quick start recipe on TensorRT-LLM Official Docs

docker run --rm -it \
    --ipc=host \
    --gpus all \
    -p 8001:8000 \
    -v ~/.cache/model:/root/.cache:rw \
    --name tensorrt_llm \
    nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc0 \
    /bin/bash

Inside the container run -

# using  cutlass backend
EXTRA_LLM_API_FILE=/tmp/cutlass-config.yml

cat << EOF > ${EXTRA_LLM_API_FILE}
enable_attention_dp: true
cuda_graph_config:
  enable_padding: true
  max_batch_size: 720
moe_config:
    backend: CUTLASS
stream_interval: 20
num_postprocess_workers: 4
attention_dp_config:
    enable_balance: true
    batching_wait_iters: 50
    timeout_iters: 1
EOF


# start the server
trtllm-serve openai/gpt-oss-20b \
    --host 0.0.0.0 \
    --port 8000 \
    --backend pytorch \
    --max_batch_size 720 \
    --max_num_tokens 16384 \
    --kv_cache_free_gpu_memory_fraction 0.9 \
    --tp_size 1 \
    --ep_size 1 \
    --trust_remote_code \
    --extra_llm_api_options ${EXTRA_LLM_API_FILE}

The container errors our saying

RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: The attention sinks is only supported on SM90.

Expected behavior

The server should start and model should be useable.

actual behavior

The server fails to start

additional notes

Also tried with TRTLLM backend instead of cutlas

cat << EOF > ${EXTRA_LLM_API_FILE}
enable_attention_dp: false
cuda_graph_config:
  enable_padding: true
  max_batch_size: 720
moe_config:
    backend: TRTLLM
stream_interval: 20
num_postprocess_workers: 4
EOF

it also failed with error different NotImplementedError: TRTLLMGenFusedMoE does not support SM120 and above.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Assignees

Labels

Inference runtime<NV>General operational aspects of TRTLLM execution not in other categories.Pytorch<NV>Pytorch backend related issuesbugSomething isn't workingwaiting for feedback

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions