ggml-org · creatorrr · Sep 13, 2025 · Sep 14, 2025 · Sep 14, 2025 · Sep 14, 2025
diff --git a/.gitignore b/.gitignore
@@ -140,6 +140,7 @@ poetry.toml
 
 # Scripts
 !/scripts/install-oneapi.bat
+!/scripts/build-in-container.sh
 
 # Test models for lora adapters
 /lora-tests

diff --git a/03-deterministic-attention/report.md b/03-deterministic-attention/report.md
@@ -0,0 +1,11 @@
+Perfect. I’ll now put together a comprehensive roadmap for achieving full deterministic inference in llama.cpp using the CUDA backend.
+
+This final report will:
+
+* Integrate deterministic RMSNorm, MatMul, and Attention (Flash Attention or standard) with fixed tile sizes and stable reduction order.
+* Ensure deterministic, position-invariant KV-cache layout regardless of prompt length.
+* Maintain output determinism across restarts, including reproducibility for the same prompt.
+* Use compile-time flag `LLAMA_DETERMINISTIC` to enable this mode.
+* Include exact file paths, key line numbers, modular implementation phases, and test-driven validation strategies.
+
+I’ll get back to you shortly with the full implementation blueprint.
@@ -17,6 +17,8 @@ LLM inference in C/C++
 
 ## Hot topics
 
+- Deterministic numerics: see `docs/DETERMINISM.md` for how to enable deterministic mode and the current guarantees (RMSNorm batch‑invariance and bitwise stability).
+
 - **[guide : running gpt-oss with llama.cpp](https://github.com/ggml-org/llama.cpp/discussions/15396)**
 - **[[FEEDBACK] Better packaging for llama.cpp to support downstream consumers 🤗](https://github.com/ggml-org/llama.cpp/discussions/15313)**
 - Support for the `gpt-oss` model with native MXFP4 format has been added | [PR](https://github.com/ggml-org/llama.cpp/pull/15091) | [Collaboration with NVIDIA](https://blogs.nvidia.com/blog/rtx-ai-garage-openai-oss) | [Comment](https://github.com/ggml-org/llama.cpp/discussions/15095)

@@ -1442,6 +1442,17 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
             exit(0);
         }
     ));
+    add_opt(common_arg(
+        {"--deterministic"},
+        "enable deterministic numerics where supported (sets GGML_DETERMINISTIC=1)",
+        [](common_params &) {
+#if defined(_WIN32)
+            SetEnvironmentVariableA("GGML_DETERMINISTIC", "1");
+#else
+            setenv("GGML_DETERMINISTIC", "1", 1);
+#endif
+        }
+    ).set_examples({LLAMA_EXAMPLE_MAIN, LLAMA_EXAMPLE_SERVER}));
     add_opt(common_arg(
         {"--completion-bash"},
         "print source-able bash completion script for llama.cpp",

diff --git a/docs/DETERMINISM.md b/docs/DETERMINISM.md
@@ -0,0 +1,177 @@
+Deterministic Numerics (RMSNorm, MatMul, Attention)
+========================================
+
+This document describes the deterministic mode added for ggml/llama.cpp and the guarantees we currently make for RMSNorm.
+
+Overview
+--------
+
+- Run‑to‑run determinism means: same inputs, same software stack → bitwise‑identical outputs.
+- Batch invariance means: the result for a given row does not change when other rows are present in the batch (i.e., reduction order per row is fixed and independent of batch size).
+- Current scope: RMSNorm (all backends), MatMul (CUDA), and Attention forward (CUDA) under `GGML_DETERMINISTIC`.
+
+What We Guarantee (Current Scope)
+---------------------------------
+
+- RMSNorm forward (and its common fused variants RMSNorm+MUL[+ADD]) are batch‑invariant and bitwise deterministic on supported backends (CPU, CUDA, Vulkan, Metal, SYCL/OpenCL) for a fixed model shape.
+- Within a given backend on a given machine and build, re‑running the same RMSNorm invocation yields identical bits.
+
+What We Do Not Guarantee (Yet)
+------------------------------
+
+- Cross‑device or cross‑driver bitwise parity. Different GPU models/driver versions or CPU instruction sets may produce different bit patterns. For parity across hosts, pin container image, drivers, compiler versions, and disable/align fast‑math or codegen heuristics as needed.
+- Determinism for attention on non‑CUDA backends (Metal/Vulkan/OpenCL/HIP) and for quantized K/V in all cases (planned in 03B/03C).
+
+How To Enable Deterministic Mode
+--------------------------------
+
+You can enable determinism at runtime or build time.
+
+- Runtime (recommended):
+  - CLI: add `--deterministic` to `llama-cli` or `llama-server`. This sets `GGML_DETERMINISTIC=1` in the process.
+  - Environment variable: `GGML_DETERMINISTIC=1` before running any tool using ggml.
+
+- Build time (forces it across the library):
+  - `-DGGML_DETERMINISTIC=ON` to CMake.
+
+Examples
+--------
+
+- Default CPU build with runtime determinism:
+
+```
+scripts/build-in-container.sh
+build-container/bin/llama-cli --deterministic -m <model.gguf> -p "Hello" -n 32
+```
+
+- Enable at build time:
+
+```
+CMAKE_ARGS='-DGGML_DETERMINISTIC=ON' scripts/build-in-container.sh
+```
+
+- With CUDA (example arch=86):
+
+```
+CMAKE_ARGS='-DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86' scripts/build-in-container.sh
+GGML_DETERMINISTIC=1 build-container/bin/test-rmsnorm-determinism
+```
+
+What Changes Under The Hood
+---------------------------
+
+- A new helper `ggml_is_deterministic()` returns true if either the library was built with `GGML_DETERMINISTIC` or the `GGML_DETERMINISTIC` environment variable is set to a truthy value.
+- RMSNorm: implementations are already batch‑invariant: per‑row reductions are kept within a single block/workgroup or a serial loop, avoiding atomics or split‑reductions that would change reduction order with batch size.
+- The CLI adds `--deterministic` which sets the environment flag.
+
+MatMul (CUDA)
+--------------
+
+- Policy: when `ggml_is_deterministic()` is true, CUDA matmul never uses cuBLAS GEMM (including strided/batched). This avoids split‑K and algorithmic variance in accumulation order.
+- Dispatcher changes:
+  - Prefer `mmf` when eligible (N ≤ 16, alignment holds). This path is already batch‑invariant.
+  - Otherwise, use a deterministic `mmvf` fallback that tiles output columns in fixed 8‑wide groups left→right, calling a stable reduction kernel per tile.
+  - Quantized matmul is unchanged for now (stretch goal).
+- Supported dtypes: F32, F16, BF16 for `mul_mat`; `src1` is promoted/handled as F32.
+
+Testing
+-------
+
+- Unit tests:
+  - `tests/test-rmsnorm-determinism.cpp` (RMSNorm invariance).
+  - `tests/test-matmul-determinism.cpp` (CUDA only; program skips if CUDA not present):
+    - Batch‑size invariance: compare first output column for B=1 vs B∈{4,16,33}.
+    - Cross‑run determinism: same inputs twice → identical bits.
+    - Dtypes: F32, F16, BF16; shapes chosen to exercise both `mmf` and wide `mmvf` tiling.
+
+Testing
+-------
+
+- Unit test: `tests/test-rmsnorm-determinism.cpp`.
+  - Batch‑size invariance: compares the first row of outputs for `B=1` and `B∈{3,8,32}` bitwise.
+  - Cross‑run determinism: repeats the same call and compares outputs bitwise.
+  - Enumerates all available backends; prints `[OK] BACKEND_NAME` on success.
+
+Run the test in the container after building:
+
+```
+scripts/build-in-container.sh
+ENGINE=${ENGINE:-podman} IMAGE=${IMAGE:-docker.io/library/fedora:41} \
+  $ENGINE run --rm -v "$(pwd):/src:Z" -w /src/build-container/bin "$IMAGE" \
+  bash -lc "./test-rmsnorm-determinism"
+```
+
+Notes & Caveats
+---------------
+
+- Determinism currently covers RMSNorm, MatMul (CUDA), and Attention forward (CUDA) when enabled. End‑to‑end inference also depends on scheduler choices and fused kernels.
+- Performance: deterministic RMSNorm uses the existing per‑row reduction tree, which is already efficient. We do not change performance characteristics in this scope.
+- Performance (MatMul/CUDA): avoiding cuBLAS may reduce throughput for some shapes; disable determinism to restore peak speed.
+- If you add new RMSNorm variants, keep reductions per row within a single block/workgroup and avoid batch‑size‑dependent split strategies. In deterministic mode, prefer a single reduction policy per row.
+
+Attention (CUDA)
+----------------
+
+- Policy in deterministic mode:
+  - Dispatch avoids algorithm switching and uses kernels with one query column per block (vector paths) when available; otherwise a tile variant.
+  - `launch_fattn` enforces `parallel_blocks = 1` and disables `stream_k`, so no cross‑block combination occurs. This fixes the reduction order and batch invariance.
+  - Masks, ALiBi, sinks, and GQA are supported.
+  - K/V dtypes:
+    - F16 K/V: preferred path is vec‑f16 (or vec‑f32 if precision is forced to F32); tile fallback remains deterministic but slower.
+    - Quantized K/V: supported via vec kernels for selected shapes. Minimal guaranteed coverage: D=128 with pairs q4_0/q4_0 and q8_0/q8_0. Unsupported quantized shapes will error in det mode (no tile fallback for quantized K/V).
+    - Note: F16 K/V may automatically fall back to the deterministic tile path; quantized K/V does not have a tile fallback.
+  - Special head sizes: D ∈ {80, 96, 112} are supported in deterministic mode via a single‑column F16 tile path (correctness‑first; slower than vec for 64/128/256). Mask and ALiBi are supported; logit_softcap is not supported for these head sizes. MMA is available as an opt‑in prototype for these sizes via `GGML_DETERMINISTIC_ATTENTION_ALLOW_MMA=1`. D=576 remains experimental and is gated behind `GGML_DETERMINISTIC_ATTENTION_ALLOW_MMA=1`.
+- Supported shapes (03A):
+  - Head sizes D ∈ {64, 128, 256}; KV length must be a multiple of 256.
+  - Typical LLaMA head counts and GQA ratios (e.g., 8 heads; GQA {1,2,4}).
+  - Mask must be padded to `GGML_KQ_MASK_PAD` (64) and be at least `N` (queries) in length.
+  - 03B additions:
+    - Quantized K/V: D=128 with q4_0/q4_0 and q8_0/q8_0, KV ∈ {256, 1024}, B ∈ {1,2,8,33}. Additional pairs may be available when built with `GGML_CUDA_FA_ALL_QUANTS`.
+    - Additional head sizes: D ∈ {80, 96, 112} via tile; D=576 experimental (ALLOW_MMA).
+- Caveats:
+  - Throughput is lower than default (no multi‑block combine and no stream‑k).
+  - Some shapes may fall back to deterministic tile with additional slowdown.
+
+KV‑Cache Invariance (03C)
+-------------------------
+
+- Goal: logits for the same absolute position P are bitwise‑identical whether computed via single‑shot prefill to P or via incremental decode (including chunked prefill/streaming), when `GGML_DETERMINISTIC=1`.
+- Host‑side policy (enforced when determinism is ON):
+  - KV padding: use a fixed padding of 256 tokens so that the effective KV length is always a multiple of the FlashAttention stride (`FATTN_KQ_STRIDE`, currently 256). This pins the reduction tree and avoids tail‑block boundary effects between flows. A one‑time INFO log announces the setting.
+  - Mask padding: shape mask tensors as `[KV, PAD(N, GGML_KQ_MASK_PAD), 1, 1]` with `GGML_KQ_MASK_PAD=64` to keep the mask layout identical across flows.
+  - Validation: if FlashAttention is selected and either condition is not met (KV not multiple of 256, or mask N not padded to 64), the graph aborts with guidance rather than proceeding with a near‑miss configuration.
+- Tests: `tests/test-kvcache-invariance.cpp` compares single‑shot vs incremental outputs across a grid (e.g., D∈{64,128,256}, KV∈{256,1024}, GQA∈{1,2}).
+
+Quick test run (CUDA)
+---------------------
+
+Build with CUDA (choose correct arch id, e.g., 86=Ampere, 89=Ada):
+
+```
+ENGINE=docker IMAGE=nvidia/cuda:12.4.1-devel-ubuntu22.04 \
+CMAKE_ARGS='-DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86' \
+scripts/build-in-container.sh
+```
+
+Run the attention determinism test on a specific GPU (index 2 in this example):
+
+```
+ENGINE=docker IMAGE=nvidia/cuda:12.4.1-devel-ubuntu22.04 \
+$ENGINE run --rm --gpus all -e CUDA_VISIBLE_DEVICES=2 \
+  -v "$(pwd):/src" -w /src/build-container/bin "$IMAGE" \
+  bash -lc './test-attention-determinism'
+```
+
+Debug controls (optional)
+-------------------------
+
+- `GGML_DETERMINISTIC_ATTENTION_FORCE_VEC=1` forces the deterministic dispatcher to take a vec path when possible.
+- `GGML_DETERMINISTIC_ATTENTION_FORCE_TILE=1` forces the deterministic dispatcher to take the tile path (F16 K/V only) and logs an info message once.
+- `GGML_DETERMINISTIC_ATTENTION_ALLOW_MMA=1` explicitly allows MMA path for special head sizes when available (prototype; opt‑in).
+- `GGML_DET_ATTENTION_DISABLE_TILE_80_96_112=1` optional: disables the deterministic tile path for D∈{80,96,112}. If set and MMA isn’t explicitly allowed/available, attention aborts with guidance. Useful for perf trials to prevent slow fallbacks.
+
+
+Roadmap
+-------
+
+- Broaden deterministic attention coverage (quantized K/V; additional head sizes) and extend to other backends (HIP/Metal/Vulkan/OpenCL).
diff --git a/docs/build.md b/docs/build.md
@@ -49,6 +49,33 @@ cmake --build build --config Release
   cmake --build build --config Release
   ```
 
+### Containerized Build (Fedora toolchain)
+
+If your host toolchain is unusual (e.g., mixed Homebrew GCC on Fedora Silverblue) and you prefer a clean, reproducible build environment, use the helper script:
+
+```
+scripts/build-in-container.sh
+```
+
+This runs a CPU build inside a Fedora container, installing `gcc-c++`, `cmake`, `make`, and `libcurl-devel`, and outputs binaries under `build-container/bin/`.
+
+Customize via environment variables:
+
+- `ENGINE` (default: auto; prefers `podman`, falls back to `docker`)
+- `IMAGE` (default: `docker.io/library/fedora:41`)
+- `BUILD_TYPE` (default: `Release`)
+- `BUILD_DIR` (default: `build-container`)
+- `JOBS` (default: `nproc`)
+- `CMAKE_ARGS` (extra CMake flags, e.g. `-DGGML_CUDA=ON`)
+
+Examples:
+
+```
+BUILD_TYPE=Debug scripts/build-in-container.sh
+CMAKE_ARGS='-DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86' scripts/build-in-container.sh
+ENGINE=docker scripts/build-in-container.sh
+```
+
 - Building for Windows (x86, x64 and arm64) with MSVC or clang as compilers:
     - Install Visual Studio 2022, e.g. via the [Community Edition](https://visualstudio.microsoft.com/vs/community/). In the installer, select at least the following options (this also automatically installs the required additional tools like CMake,...):
     - Tab Workload: Desktop-development with C++

diff --git a/ggml/CMakeLists.txt b/ggml/CMakeLists.txt
@@ -215,6 +215,9 @@ option(GGML_OPENCL_USE_ADRENO_KERNELS       "ggml: use optimized kernels for Adr
 set   (GGML_OPENCL_TARGET_VERSION "300" CACHE STRING
                                             "gmml: OpenCL API version to target")
 
+# Deterministic numerics controls
+option(GGML_DETERMINISTIC                   "ggml: enable deterministic numerics where supported" OFF)
+
 # toolchain for vulkan-shaders-gen
 set   (GGML_VULKAN_SHADERS_GEN_TOOLCHAIN "" CACHE FILEPATH "ggml: toolchain file for vulkan-shaders-gen")
 
@@ -371,6 +374,12 @@ target_compile_definitions(ggml-base PRIVATE
     GGML_VERSION="${GGML_INSTALL_VERSION}"
     GGML_COMMIT="${GGML_BUILD_COMMIT}"
 )
+
+# Propagate GGML_DETERMINISTIC to compilation units and dependents
+if (GGML_DETERMINISTIC)
+    target_compile_definitions(ggml-base PRIVATE GGML_DETERMINISTIC)
+    target_compile_definitions(ggml-base PUBLIC  GGML_DETERMINISTIC)
+endif()
 message(STATUS "ggml version: ${GGML_INSTALL_VERSION}")
 message(STATUS "ggml commit:  ${GGML_BUILD_COMMIT}")
 

@@ -683,6 +683,10 @@ extern "C" {
     GGML_API int64_t ggml_cycles(void);
     GGML_API int64_t ggml_cycles_per_ms(void);
 
+    // Deterministic numerics – returns true if either built with GGML_DETERMINISTIC
+    // or the environment variable GGML_DETERMINISTIC is set to a truthy value.
+    GGML_API bool    ggml_is_deterministic(void);
+
     // accepts a UTF-8 path, even on Windows
     GGML_API FILE *  ggml_fopen(const char * fname, const char * mode);
 

diff --git a/ggml/src/ggml-cuda/CMakeLists.txt b/ggml/src/ggml-cuda/CMakeLists.txt
@@ -47,6 +47,9 @@ if (CUDAToolkit_FOUND)
     file(GLOB   SRCS "template-instances/mmf*.cu")
     list(APPEND GGML_SOURCES_CUDA ${SRCS})
 
+    # det note: in det mode we only rely on a minimal, always‑built set of
+    # vector attention instances. FA_ALL_QUANTS expands the template matrix for
+    # experiments; tests and dispatcher probes gate usage accordingly.
     if (GGML_CUDA_FA_ALL_QUANTS)
         file(GLOB   SRCS "template-instances/fattn-vec*.cu")
         list(APPEND GGML_SOURCES_CUDA ${SRCS})