Compile bug: CUDA MUL_MAT crash on Blackwell (sm_121) - nvcc O3 optimization bug

### Git commit

$ git log -1 --oneline
5ee4e43f2 (HEAD -> master, tag: b7524, origin/master, origin/HEAD) server: return_progress to also report 0% processing state (#18305)

$ git rev-parse HEAD
5ee4e43f2629fc88c3c3ff428a47ffb842fa8d84

### Operating systems

Linux

### GGML backends

CUDA

### Problem description & steps to reproduce

  ### Environment
  - **GPU**: NVIDIA GB10 (Blackwell)
  - **Machine**: NVIDIA DGX Spark
  - **CUDA Architecture**: sm_121
  - **CUDA Toolkit**: 13.0
  - **llama.cpp**: master branch (2025-12-24)
  - **OS**: Ubuntu Linux 24.04 (aarch64)
  - **Quantization**: MXFP4

  ### Server Options
  ```bash
  --embedding --pooling last
  --parallel 30
  -b 32768        # batch size
  -ub 16384       # micro batch size (critical!)
  -c 131072       # context size
  --cont-batching
  --defrag-thold 0.1

  Symptoms

  - CUDA error: an illegal memory access was encountered
  - Crash in MUL_MAT (MMQ kernel) at ggml_cuda_compute_forward
  - Crash occurs at ubatch boundary (batch.n_tokens = 16384)

  Diagnosis

  | Build Config                         | Result           |
  |--------------------------------------|------------------|
  | sm_121 + O3                          | Immediate crash  |
  | sm_121 + O2                          | Occasional crash |
  | sm_121 + O2 + CUDA_LAUNCH_BLOCKING=1 | Still crashes    |
  | sm_89 + O2                           | Stable ✓         |

  Key observations:
  - CUDA_LAUNCH_BLOCKING=1 still crashes → NOT a race condition
  - Adding assert() or printf() in MMQ kernel → crash disappears (compiler optimization affected)
  - Building with sm_89 (Ada PTX fallback) → stable on Blackwell hardware

  Root Cause

  nvcc generates incorrect code for Blackwell architecture (sm_121), particularly in MMQ kernels with MXFP4 quantization. This appears to be a compiler optimization bug in CUDA Toolkit 13.0.

  Workaround

  Build with Ada architecture (PTX fallback for Blackwell):
  -DCMAKE_CUDA_ARCHITECTURES=89

  Related Issues

  - #18310
  - #18313

### First Bad Commit

_No response_

### Compile command

```shell
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Aug_20_01:57:39_PM_PDT_2025
Cuda compilation tools, release 13.0, V13.0.88
Build cuda_13.0.r13.0/compiler.36424714_0

# CUDA (CMAKE_CUDA_ARCHITECTURES)
# -----------------------------------------------
# 50  - Maxwell (GTX 900 series)
# 60  - Pascal (GTX 1000 series)
# 70  - Volta (V100)
# 75  - Turing (RTX 2000 series)
# 80  - Ampere (A100)
# 86  - Ampere (RTX 3000 series)
# 89  - Ada Lovelace (RTX 4000 series) - CUDA 11.8+
# 90a - Hopper (H100) - CUDA 11.8+
# 120 - Blackwell (RTX 5000 series) - CUDA 13.0+
# 121 - Blackwell (GH200) - CUDA 13.0+
# native - Automatic

# Crash-Prone Build (Avoid)
cmake -DCMAKE_BUILD_TYPE=Release \
      -DBUILD_SHARED_LIBS=ON \
      -DCMAKE_CUDA_ARCHITECTURES=121 \ # or 120
      -DGGML_VULKAN=OFF \
      -DGGML_DML=OFF \
      -DGGML_CUDA=ON \
      -DGGML_HIP=OFF \
      -DGGML_METAL=OFF \
      -DGGML_BLAS=ON \
      -DGGML_CCACHE=OFF \
      -DGGML_F16C=ON \
      -DGGML_FMA=ON \
      -DCMAKE_C_FLAGS="-O2" \
      -DCMAKE_EXE_LINKER_FLAGS="-lpthread -lm" \
      -DLLAMA_CURL=ON \
      -DLLAMA_BUILD_TESTS=OFF \
      -DLLAMA_BUILD_EXAMPLES=OFF \
      -DLLAMA_BUILD_SERVER=ON \
      -DCMAKE_VERBOSE_MAKEFILE=ON \
      ..

make -j12

# Safe Build (Recommended)
cmake -DCMAKE_BUILD_TYPE=Release \
      -DBUILD_SHARED_LIBS=ON \
      -DCMAKE_CUDA_ARCHITECTURES=89 \
      -DGGML_VULKAN=OFF \
      -DGGML_DML=OFF \
      -DGGML_CUDA=ON \
      -DCMAKE_CUDA_FLAGS="-O2"  # -O3 default crashes on Blackwell
      -DGGML_HIP=OFF \
      -DGGML_METAL=OFF \
      -DGGML_BLAS=ON \
      -DGGML_CCACHE=OFF \
      -DGGML_F16C=ON \
      -DGGML_FMA=ON \
      -DCMAKE_C_FLAGS="-O2" \
      -DCMAKE_EXE_LINKER_FLAGS="-lpthread -lm" \
      -DLLAMA_CURL=ON \
      -DLLAMA_BUILD_TESTS=OFF \
      -DLLAMA_BUILD_EXAMPLES=OFF \
      -DLLAMA_BUILD_SERVER=ON \
      -DCMAKE_VERBOSE_MAKEFILE=ON \
      ..

make -j12
```

### Relevant log output

```shell
Related Issues

  - #18310
  - #18313
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Compile bug: CUDA MUL_MAT crash on Blackwell (sm_121) - nvcc O3 optimization bug #18331

Git commit

Operating systems

GGML backends

Problem description & steps to reproduce

Environment

Server Options

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Compile bug: CUDA MUL_MAT crash on Blackwell (sm_121) - nvcc O3 optimization bug #18331

Description

Git commit

Operating systems

GGML backends

Problem description & steps to reproduce

Environment

Server Options

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions