Skip to content

[Blackwell/DGX Spark] Nemotron-3-Nano NVFP4: Illegal Instruction (cudaErrorIllegalInstruction) in vLLM V1 Engine #125

@dennis-lynch

Description

@dennis-lynch

Hardware & Software Environment

System: NVIDIA DGX Spark

GPU Architecture: Blackwell sm_121 (72GB HBM3e/LPDDR5X Unified)

OS/Driver: NVIDIA DGX OS (latest 2026 stable) 

Container: vllm/vllm-openai:v0.17.1-cu130

Model: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4

Architecture: Hybrid Mamba-2 + MoE + Attention

Problem Description

The official NVFP4 checkpoint for Nemotron-3-Nano triggers a fatal cudaErrorIllegalInstruction on Blackwell hardware when served via vLLM's new V1 engine.

The error appears to be tied to CUDA Graph capture for batch sizes > 1. While the model initializes and serves single requests, it consistently crashes when the engine attempts to synchronize/record graphs for concurrent batches. Given that this model utilizes a unique hybrid architecture (Mamba-2 + Attention) and the new NVFP4 quantization format, this suggests a kernel-level incompatibility or an invalid instruction being generated for the sm_121 architecture during graph replay.
Steps to Reproduce

  1. Launch the vLLM Server (Targeting DGX Spark):
docker pull vllm/vllm-openai:v0.17.1-cu130

docker run --gpus all \
  --ipc=host \
  --entrypoint vllm \
  -v ~/models:/workspace/models \
  -p 8000:8000 \
  -e VLLM_NVFP4_GEMM_BACKEND=marlin \
  -e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
  -e VLLM_FLASHINFER_ALLREDUCE_BACKEND=trtllm \
  vllm/vllm-openai:v0.17.1-cu130 \
  serve /workspace/models/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
  --served-model-name nemotron-3-nano \
  --host 0.0.0.0 \
  --port 8000 \
  --dtype auto \
  --kv-cache-dtype fp8 \
  --trust-remote-code \
  --gpu-memory-utilization 0.85 \
  --max-model-len 1048576 \
  --enable-chunked-prefill \
  --attention-backend TRITON_ATTN \
  --reasoning-parser-plugin /workspace/models/nano_v3_reasoning_parser.py \
  --reasoning-parser nano_v3
  1. Execute Benchmark (Host side):
    Run llama-benchy with concurrency 8 to force batching and graph capture:
uvx --with torch llama-benchy \
  --base-url http://127.0.0.1:8000/v1 \
  --model ~/models/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
  --served-model-name nemotron-3-nano \
  --concurrency 8 \
  --pp 2048 \
  --tg 32

Actual Output / Error Logs

The server crashes during the benchmark with:


(EngineCore_DP0 pid=99) ERROR [core.py:1102] Traceback:
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 251, in get_output
self.async_copy_ready_event.synchronize()
torch.AcceleratorError: CUDA error: an illegal instruction was encountered

Note: Full logs indicate cudagraph_capture_sizes: [1, 2, 4, 8, 16] was active.

Workaround

The model is only stable on the DGX Spark if the V1 engine is disabled and Eager mode is enforced:

VLLM_USE_V1=0

--enforce-eager

Disabling asynchronous scheduling (--no-async-scheduling) also appears to mitigate some stability issues with the V1 engine on this hardware.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions