[Bug]: Can't run GPT-OSS models on DGX Spark

### System Info

- System: Nvidia DGX Spark 
- CPU : ARM
- GPU architecture Blackwell GB10
- unified Memory -128Gb


### Who can help?

@Tracin 

### Information

- [x] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

Follow the quick start recipe on [TensorRT-LLM Official Docs](https://nvidia.github.io/TensorRT-LLM/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.html)

```
docker run --rm -it \
    --ipc=host \
    --gpus all \
    -p 8001:8000 \
    -v ~/.cache/model:/root/.cache:rw \
    --name tensorrt_llm \
    nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc0 \
    /bin/bash
```

Inside the container run - 

```
# using  cutlass backend
EXTRA_LLM_API_FILE=/tmp/cutlass-config.yml

cat << EOF > ${EXTRA_LLM_API_FILE}
enable_attention_dp: true
cuda_graph_config:
  enable_padding: true
  max_batch_size: 720
moe_config:
    backend: CUTLASS
stream_interval: 20
num_postprocess_workers: 4
attention_dp_config:
    enable_balance: true
    batching_wait_iters: 50
    timeout_iters: 1
EOF


# start the server
trtllm-serve openai/gpt-oss-20b \
    --host 0.0.0.0 \
    --port 8000 \
    --backend pytorch \
    --max_batch_size 720 \
    --max_num_tokens 16384 \
    --kv_cache_free_gpu_memory_fraction 0.9 \
    --tp_size 1 \
    --ep_size 1 \
    --trust_remote_code \
    --extra_llm_api_options ${EXTRA_LLM_API_FILE}

```

The container errors our saying 
```
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: The attention sinks is only supported on SM90.
```


### Expected behavior

The server should start and model should be useable. 

### actual behavior

The server fails to start 

### additional notes

Also tried with `TRTLLM` backend instead of cutlas

```
cat << EOF > ${EXTRA_LLM_API_FILE}
enable_attention_dp: false
cuda_graph_config:
  enable_padding: true
  max_batch_size: 720
moe_config:
    backend: TRTLLM
stream_interval: 20
num_postprocess_workers: 4
EOF
```

it also failed with error different  `NotImplementedError: TRTLLMGenFusedMoE does not support SM120 and above.` 

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and checked the [documentation](https://nvidia.github.io/TensorRT-LLM/) and [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) for answers to frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Can't run GPT-OSS models on DGX Spark #8474

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: Can't run GPT-OSS models on DGX Spark #8474

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions