-
Notifications
You must be signed in to change notification settings - Fork 183
[Blackwell/DGX Spark] Nemotron-3-Nano NVFP4: Illegal Instruction (cudaErrorIllegalInstruction) in vLLM V1 Engine #125
Description
Hardware & Software Environment
System: NVIDIA DGX Spark
GPU Architecture: Blackwell sm_121 (72GB HBM3e/LPDDR5X Unified)
OS/Driver: NVIDIA DGX OS (latest 2026 stable)
Container: vllm/vllm-openai:v0.17.1-cu130
Model: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
Architecture: Hybrid Mamba-2 + MoE + Attention
Problem Description
The official NVFP4 checkpoint for Nemotron-3-Nano triggers a fatal cudaErrorIllegalInstruction on Blackwell hardware when served via vLLM's new V1 engine.
The error appears to be tied to CUDA Graph capture for batch sizes > 1. While the model initializes and serves single requests, it consistently crashes when the engine attempts to synchronize/record graphs for concurrent batches. Given that this model utilizes a unique hybrid architecture (Mamba-2 + Attention) and the new NVFP4 quantization format, this suggests a kernel-level incompatibility or an invalid instruction being generated for the sm_121 architecture during graph replay.
Steps to Reproduce
- Launch the vLLM Server (Targeting DGX Spark):
docker pull vllm/vllm-openai:v0.17.1-cu130
docker run --gpus all \
--ipc=host \
--entrypoint vllm \
-v ~/models:/workspace/models \
-p 8000:8000 \
-e VLLM_NVFP4_GEMM_BACKEND=marlin \
-e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
-e VLLM_FLASHINFER_ALLREDUCE_BACKEND=trtllm \
vllm/vllm-openai:v0.17.1-cu130 \
serve /workspace/models/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
--served-model-name nemotron-3-nano \
--host 0.0.0.0 \
--port 8000 \
--dtype auto \
--kv-cache-dtype fp8 \
--trust-remote-code \
--gpu-memory-utilization 0.85 \
--max-model-len 1048576 \
--enable-chunked-prefill \
--attention-backend TRITON_ATTN \
--reasoning-parser-plugin /workspace/models/nano_v3_reasoning_parser.py \
--reasoning-parser nano_v3- Execute Benchmark (Host side):
Run llama-benchy with concurrency 8 to force batching and graph capture:
uvx --with torch llama-benchy \
--base-url http://127.0.0.1:8000/v1 \
--model ~/models/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
--served-model-name nemotron-3-nano \
--concurrency 8 \
--pp 2048 \
--tg 32Actual Output / Error Logs
The server crashes during the benchmark with:
(EngineCore_DP0 pid=99) ERROR [core.py:1102] Traceback:
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 251, in get_output
self.async_copy_ready_event.synchronize()
torch.AcceleratorError: CUDA error: an illegal instruction was encountered
Note: Full logs indicate cudagraph_capture_sizes: [1, 2, 4, 8, 16] was active.
Workaround
The model is only stable on the DGX Spark if the V1 engine is disabled and Eager mode is enforced:
VLLM_USE_V1=0
--enforce-eager
Disabling asynchronous scheduling (--no-async-scheduling) also appears to mitigate some stability issues with the V1 engine on this hardware.