-
Notifications
You must be signed in to change notification settings - Fork 14.2k
Open
Labels
CUDARelated to the CUDA backendRelated to the CUDA backendNvidia GPUIssues specific to Nvidia GPUsIssues specific to Nvidia GPUsbug-unconfirmed
Description
Git commit
$ git log -1 --oneline
5ee4e43 (HEAD -> master, tag: b7524, origin/master, origin/HEAD) server: return_progress to also report 0% processing state (#18305)
$ git rev-parse HEAD
5ee4e43
Operating systems
Linux
GGML backends
CUDA
Problem description & steps to reproduce
Environment
- GPU: NVIDIA GB10 (Blackwell)
- Machine: NVIDIA DGX Spark
- CUDA Architecture: sm_121
- CUDA Toolkit: 13.0
- llama.cpp: master branch (2025-12-24)
- OS: Ubuntu Linux 24.04 (aarch64)
- Quantization: MXFP4
Server Options
--embedding --pooling last
--parallel 30
-b 32768 # batch size
-ub 16384 # micro batch size (critical!)
-c 131072 # context size
--cont-batching
--defrag-thold 0.1
Symptoms
- CUDA error: an illegal memory access was encountered
- Crash in MUL_MAT (MMQ kernel) at ggml_cuda_compute_forward
- Crash occurs at ubatch boundary (batch.n_tokens = 16384)
Diagnosis
| Build Config | Result |
|--------------------------------------|------------------|
| sm_121 + O3 | Immediate crash |
| sm_121 + O2 | Occasional crash |
| sm_121 + O2 + CUDA_LAUNCH_BLOCKING=1 | Still crashes |
| sm_89 + O2 | Stable ✓ |
Key observations:
- CUDA_LAUNCH_BLOCKING=1 still crashes → NOT a race condition
- Adding assert() or printf() in MMQ kernel → crash disappears (compiler optimization affected)
- Building with sm_89 (Ada PTX fallback) → stable on Blackwell hardware
Root Cause
nvcc generates incorrect code for Blackwell architecture (sm_121), particularly in MMQ kernels with MXFP4 quantization. This appears to be a compiler optimization bug in CUDA Toolkit 13.0.
Workaround
Build with Ada architecture (PTX fallback for Blackwell):
-DCMAKE_CUDA_ARCHITECTURES=89
Related Issues
- #18310
- #18313
### First Bad Commit
_No response_
### Compile command
```shell
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Aug_20_01:57:39_PM_PDT_2025
Cuda compilation tools, release 13.0, V13.0.88
Build cuda_13.0.r13.0/compiler.36424714_0
# CUDA (CMAKE_CUDA_ARCHITECTURES)
# -----------------------------------------------
# 50 - Maxwell (GTX 900 series)
# 60 - Pascal (GTX 1000 series)
# 70 - Volta (V100)
# 75 - Turing (RTX 2000 series)
# 80 - Ampere (A100)
# 86 - Ampere (RTX 3000 series)
# 89 - Ada Lovelace (RTX 4000 series) - CUDA 11.8+
# 90a - Hopper (H100) - CUDA 11.8+
# 120 - Blackwell (RTX 5000 series) - CUDA 13.0+
# 121 - Blackwell (GH200) - CUDA 13.0+
# native - Automatic
# Crash-Prone Build (Avoid)
cmake -DCMAKE_BUILD_TYPE=Release \
-DBUILD_SHARED_LIBS=ON \
-DCMAKE_CUDA_ARCHITECTURES=121 \ # or 120
-DGGML_VULKAN=OFF \
-DGGML_DML=OFF \
-DGGML_CUDA=ON \
-DGGML_HIP=OFF \
-DGGML_METAL=OFF \
-DGGML_BLAS=ON \
-DGGML_CCACHE=OFF \
-DGGML_F16C=ON \
-DGGML_FMA=ON \
-DCMAKE_C_FLAGS="-O2" \
-DCMAKE_EXE_LINKER_FLAGS="-lpthread -lm" \
-DLLAMA_CURL=ON \
-DLLAMA_BUILD_TESTS=OFF \
-DLLAMA_BUILD_EXAMPLES=OFF \
-DLLAMA_BUILD_SERVER=ON \
-DCMAKE_VERBOSE_MAKEFILE=ON \
..
make -j12
# Safe Build (Recommended)
cmake -DCMAKE_BUILD_TYPE=Release \
-DBUILD_SHARED_LIBS=ON \
-DCMAKE_CUDA_ARCHITECTURES=89 \
-DGGML_VULKAN=OFF \
-DGGML_DML=OFF \
-DGGML_CUDA=ON \
-DCMAKE_CUDA_FLAGS="-O2" # -O3 default crashes on Blackwell
-DGGML_HIP=OFF \
-DGGML_METAL=OFF \
-DGGML_BLAS=ON \
-DGGML_CCACHE=OFF \
-DGGML_F16C=ON \
-DGGML_FMA=ON \
-DCMAKE_C_FLAGS="-O2" \
-DCMAKE_EXE_LINKER_FLAGS="-lpthread -lm" \
-DLLAMA_CURL=ON \
-DLLAMA_BUILD_TESTS=OFF \
-DLLAMA_BUILD_EXAMPLES=OFF \
-DLLAMA_BUILD_SERVER=ON \
-DCMAKE_VERBOSE_MAKEFILE=ON \
..
make -j12Relevant log output
Related Issues
- #18310
- #18313Metadata
Metadata
Assignees
Labels
CUDARelated to the CUDA backendRelated to the CUDA backendNvidia GPUIssues specific to Nvidia GPUsIssues specific to Nvidia GPUsbug-unconfirmed