[Blackwell/DGX Spark] Nemotron-3-Nano NVFP4: Illegal Instruction (cudaErrorIllegalInstruction) in vLLM V1 Engine

Hardware & Software Environment

    System: NVIDIA DGX Spark

    GPU Architecture: Blackwell sm_121 (72GB HBM3e/LPDDR5X Unified)

    OS/Driver: NVIDIA DGX OS (latest 2026 stable) 

    Container: vllm/vllm-openai:v0.17.1-cu130

    Model: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4

    Architecture: Hybrid Mamba-2 + MoE + Attention

Problem Description

The official NVFP4 checkpoint for Nemotron-3-Nano triggers a fatal cudaErrorIllegalInstruction on Blackwell hardware when served via vLLM's new V1 engine.

The error appears to be tied to CUDA Graph capture for batch sizes > 1. While the model initializes and serves single requests, it consistently crashes when the engine attempts to synchronize/record graphs for concurrent batches. Given that this model utilizes a unique hybrid architecture (Mamba-2 + Attention) and the new NVFP4 quantization format, this suggests a kernel-level incompatibility or an invalid instruction being generated for the sm_121 architecture during graph replay.
Steps to Reproduce

1. Launch the vLLM Server (Targeting DGX Spark):
```Bash
docker pull vllm/vllm-openai:v0.17.1-cu130

docker run --gpus all \
  --ipc=host \
  --entrypoint vllm \
  -v ~/models:/workspace/models \
  -p 8000:8000 \
  -e VLLM_NVFP4_GEMM_BACKEND=marlin \
  -e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
  -e VLLM_FLASHINFER_ALLREDUCE_BACKEND=trtllm \
  vllm/vllm-openai:v0.17.1-cu130 \
  serve /workspace/models/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
  --served-model-name nemotron-3-nano \
  --host 0.0.0.0 \
  --port 8000 \
  --dtype auto \
  --kv-cache-dtype fp8 \
  --trust-remote-code \
  --gpu-memory-utilization 0.85 \
  --max-model-len 1048576 \
  --enable-chunked-prefill \
  --attention-backend TRITON_ATTN \
  --reasoning-parser-plugin /workspace/models/nano_v3_reasoning_parser.py \
  --reasoning-parser nano_v3
```
2. Execute Benchmark (Host side):
Run llama-benchy with concurrency 8 to force batching and graph capture:

```Bash

uvx --with torch llama-benchy \
  --base-url http://127.0.0.1:8000/v1 \
  --model ~/models/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
  --served-model-name nemotron-3-nano \
  --concurrency 8 \
  --pp 2048 \
  --tg 32
```

### Actual Output / Error Logs

The server crashes during the benchmark with:
```text

(EngineCore_DP0 pid=99) ERROR [core.py:1102] Traceback:
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 251, in get_output
self.async_copy_ready_event.synchronize()
torch.AcceleratorError: CUDA error: an illegal instruction was encountered
```
Note: Full logs indicate cudagraph_capture_sizes: [1, 2, 4, 8, 16] was active.

Workaround

The model is only stable on the DGX Spark if the V1 engine is disabled and Eager mode is enforced:

    VLLM_USE_V1=0

    --enforce-eager

Disabling asynchronous scheduling (--no-async-scheduling) also appears to mitigate some stability issues with the V1 engine on this hardware.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Blackwell/DGX Spark] Nemotron-3-Nano NVFP4: Illegal Instruction (cudaErrorIllegalInstruction) in vLLM V1 Engine #125

Actual Output / Error Logs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Blackwell/DGX Spark] Nemotron-3-Nano NVFP4: Illegal Instruction (cudaErrorIllegalInstruction) in vLLM V1 Engine #125

Description

Actual Output / Error Logs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions