Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,7 @@ poetry.toml

# Scripts
!/scripts/install-oneapi.bat
!/scripts/build-in-container.sh

# Test models for lora adapters
/lora-tests
Expand Down
11 changes: 11 additions & 0 deletions 03-deterministic-attention/report.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
Perfect. I’ll now put together a comprehensive roadmap for achieving full deterministic inference in llama.cpp using the CUDA backend.

This final report will:

* Integrate deterministic RMSNorm, MatMul, and Attention (Flash Attention or standard) with fixed tile sizes and stable reduction order.
* Ensure deterministic, position-invariant KV-cache layout regardless of prompt length.
* Maintain output determinism across restarts, including reproducibility for the same prompt.
* Use compile-time flag `LLAMA_DETERMINISTIC` to enable this mode.
* Include exact file paths, key line numbers, modular implementation phases, and test-driven validation strategies.

I’ll get back to you shortly with the full implementation blueprint.
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ LLM inference in C/C++

## Hot topics

- Deterministic numerics: see `docs/DETERMINISM.md` for how to enable deterministic mode and the current guarantees (RMSNorm batch‑invariance and bitwise stability).

- **[guide : running gpt-oss with llama.cpp](https://github.com/ggml-org/llama.cpp/discussions/15396)**
- **[[FEEDBACK] Better packaging for llama.cpp to support downstream consumers 🤗](https://github.com/ggml-org/llama.cpp/discussions/15313)**
- Support for the `gpt-oss` model with native MXFP4 format has been added | [PR](https://github.com/ggml-org/llama.cpp/pull/15091) | [Collaboration with NVIDIA](https://blogs.nvidia.com/blog/rtx-ai-garage-openai-oss) | [Comment](https://github.com/ggml-org/llama.cpp/discussions/15095)
Expand Down
11 changes: 11 additions & 0 deletions common/arg.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1442,6 +1442,17 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
exit(0);
}
));
add_opt(common_arg(
{"--deterministic"},
"enable deterministic numerics where supported (sets GGML_DETERMINISTIC=1)",
[](common_params &) {
#if defined(_WIN32)
SetEnvironmentVariableA("GGML_DETERMINISTIC", "1");
#else
setenv("GGML_DETERMINISTIC", "1", 1);
#endif
}
).set_examples({LLAMA_EXAMPLE_MAIN, LLAMA_EXAMPLE_SERVER}));
add_opt(common_arg(
{"--completion-bash"},
"print source-able bash completion script for llama.cpp",
Expand Down
177 changes: 177 additions & 0 deletions docs/DETERMINISM.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
Deterministic Numerics (RMSNorm, MatMul, Attention)
========================================

This document describes the deterministic mode added for ggml/llama.cpp and the guarantees we currently make for RMSNorm.

Overview
--------

- Run‑to‑run determinism means: same inputs, same software stack → bitwise‑identical outputs.
- Batch invariance means: the result for a given row does not change when other rows are present in the batch (i.e., reduction order per row is fixed and independent of batch size).
- Current scope: RMSNorm (all backends), MatMul (CUDA), and Attention forward (CUDA) under `GGML_DETERMINISTIC`.

What We Guarantee (Current Scope)
---------------------------------

- RMSNorm forward (and its common fused variants RMSNorm+MUL[+ADD]) are batch‑invariant and bitwise deterministic on supported backends (CPU, CUDA, Vulkan, Metal, SYCL/OpenCL) for a fixed model shape.
- Within a given backend on a given machine and build, re‑running the same RMSNorm invocation yields identical bits.

What We Do Not Guarantee (Yet)
------------------------------

- Cross‑device or cross‑driver bitwise parity. Different GPU models/driver versions or CPU instruction sets may produce different bit patterns. For parity across hosts, pin container image, drivers, compiler versions, and disable/align fast‑math or codegen heuristics as needed.
- Determinism for attention on non‑CUDA backends (Metal/Vulkan/OpenCL/HIP) and for quantized K/V in all cases (planned in 03B/03C).

How To Enable Deterministic Mode
--------------------------------

You can enable determinism at runtime or build time.

- Runtime (recommended):
- CLI: add `--deterministic` to `llama-cli` or `llama-server`. This sets `GGML_DETERMINISTIC=1` in the process.
- Environment variable: `GGML_DETERMINISTIC=1` before running any tool using ggml.

- Build time (forces it across the library):
- `-DGGML_DETERMINISTIC=ON` to CMake.

Examples
--------

- Default CPU build with runtime determinism:

```
scripts/build-in-container.sh
build-container/bin/llama-cli --deterministic -m <model.gguf> -p "Hello" -n 32
```

- Enable at build time:

```
CMAKE_ARGS='-DGGML_DETERMINISTIC=ON' scripts/build-in-container.sh
```

- With CUDA (example arch=86):

```
CMAKE_ARGS='-DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86' scripts/build-in-container.sh
GGML_DETERMINISTIC=1 build-container/bin/test-rmsnorm-determinism
```

What Changes Under The Hood
---------------------------

- A new helper `ggml_is_deterministic()` returns true if either the library was built with `GGML_DETERMINISTIC` or the `GGML_DETERMINISTIC` environment variable is set to a truthy value.
- RMSNorm: implementations are already batch‑invariant: per‑row reductions are kept within a single block/workgroup or a serial loop, avoiding atomics or split‑reductions that would change reduction order with batch size.
- The CLI adds `--deterministic` which sets the environment flag.

MatMul (CUDA)
--------------

- Policy: when `ggml_is_deterministic()` is true, CUDA matmul never uses cuBLAS GEMM (including strided/batched). This avoids split‑K and algorithmic variance in accumulation order.
- Dispatcher changes:
- Prefer `mmf` when eligible (N ≤ 16, alignment holds). This path is already batch‑invariant.
- Otherwise, use a deterministic `mmvf` fallback that tiles output columns in fixed 8‑wide groups left→right, calling a stable reduction kernel per tile.
- Quantized matmul is unchanged for now (stretch goal).
- Supported dtypes: F32, F16, BF16 for `mul_mat`; `src1` is promoted/handled as F32.

Testing
-------

- Unit tests:
- `tests/test-rmsnorm-determinism.cpp` (RMSNorm invariance).
- `tests/test-matmul-determinism.cpp` (CUDA only; program skips if CUDA not present):
- Batch‑size invariance: compare first output column for B=1 vs B∈{4,16,33}.
- Cross‑run determinism: same inputs twice → identical bits.
- Dtypes: F32, F16, BF16; shapes chosen to exercise both `mmf` and wide `mmvf` tiling.

Testing
-------

- Unit test: `tests/test-rmsnorm-determinism.cpp`.
- Batch‑size invariance: compares the first row of outputs for `B=1` and `B∈{3,8,32}` bitwise.
- Cross‑run determinism: repeats the same call and compares outputs bitwise.
- Enumerates all available backends; prints `[OK] BACKEND_NAME` on success.

Run the test in the container after building:

```
scripts/build-in-container.sh
ENGINE=${ENGINE:-podman} IMAGE=${IMAGE:-docker.io/library/fedora:41} \
$ENGINE run --rm -v "$(pwd):/src:Z" -w /src/build-container/bin "$IMAGE" \
bash -lc "./test-rmsnorm-determinism"
```

Notes & Caveats
---------------

- Determinism currently covers RMSNorm, MatMul (CUDA), and Attention forward (CUDA) when enabled. End‑to‑end inference also depends on scheduler choices and fused kernels.
- Performance: deterministic RMSNorm uses the existing per‑row reduction tree, which is already efficient. We do not change performance characteristics in this scope.
- Performance (MatMul/CUDA): avoiding cuBLAS may reduce throughput for some shapes; disable determinism to restore peak speed.
- If you add new RMSNorm variants, keep reductions per row within a single block/workgroup and avoid batch‑size‑dependent split strategies. In deterministic mode, prefer a single reduction policy per row.

Attention (CUDA)
----------------

- Policy in deterministic mode:
- Dispatch avoids algorithm switching and uses kernels with one query column per block (vector paths) when available; otherwise a tile variant.
- `launch_fattn` enforces `parallel_blocks = 1` and disables `stream_k`, so no cross‑block combination occurs. This fixes the reduction order and batch invariance.
- Masks, ALiBi, sinks, and GQA are supported.
- K/V dtypes:
- F16 K/V: preferred path is vec‑f16 (or vec‑f32 if precision is forced to F32); tile fallback remains deterministic but slower.
- Quantized K/V: supported via vec kernels for selected shapes. Minimal guaranteed coverage: D=128 with pairs q4_0/q4_0 and q8_0/q8_0. Unsupported quantized shapes will error in det mode (no tile fallback for quantized K/V).
- Note: F16 K/V may automatically fall back to the deterministic tile path; quantized K/V does not have a tile fallback.
- Special head sizes: D ∈ {80, 96, 112} are supported in deterministic mode via a single‑column F16 tile path (correctness‑first; slower than vec for 64/128/256). Mask and ALiBi are supported; logit_softcap is not supported for these head sizes. MMA is available as an opt‑in prototype for these sizes via `GGML_DETERMINISTIC_ATTENTION_ALLOW_MMA=1`. D=576 remains experimental and is gated behind `GGML_DETERMINISTIC_ATTENTION_ALLOW_MMA=1`.
- Supported shapes (03A):
- Head sizes D ∈ {64, 128, 256}; KV length must be a multiple of 256.
- Typical LLaMA head counts and GQA ratios (e.g., 8 heads; GQA {1,2,4}).
- Mask must be padded to `GGML_KQ_MASK_PAD` (64) and be at least `N` (queries) in length.
- 03B additions:
- Quantized K/V: D=128 with q4_0/q4_0 and q8_0/q8_0, KV ∈ {256, 1024}, B ∈ {1,2,8,33}. Additional pairs may be available when built with `GGML_CUDA_FA_ALL_QUANTS`.
- Additional head sizes: D ∈ {80, 96, 112} via tile; D=576 experimental (ALLOW_MMA).
- Caveats:
- Throughput is lower than default (no multi‑block combine and no stream‑k).
- Some shapes may fall back to deterministic tile with additional slowdown.

KV‑Cache Invariance (03C)
-------------------------

- Goal: logits for the same absolute position P are bitwise‑identical whether computed via single‑shot prefill to P or via incremental decode (including chunked prefill/streaming), when `GGML_DETERMINISTIC=1`.
- Host‑side policy (enforced when determinism is ON):
- KV padding: use a fixed padding of 256 tokens so that the effective KV length is always a multiple of the FlashAttention stride (`FATTN_KQ_STRIDE`, currently 256). This pins the reduction tree and avoids tail‑block boundary effects between flows. A one‑time INFO log announces the setting.
- Mask padding: shape mask tensors as `[KV, PAD(N, GGML_KQ_MASK_PAD), 1, 1]` with `GGML_KQ_MASK_PAD=64` to keep the mask layout identical across flows.
- Validation: if FlashAttention is selected and either condition is not met (KV not multiple of 256, or mask N not padded to 64), the graph aborts with guidance rather than proceeding with a near‑miss configuration.
- Tests: `tests/test-kvcache-invariance.cpp` compares single‑shot vs incremental outputs across a grid (e.g., D∈{64,128,256}, KV∈{256,1024}, GQA∈{1,2}).

Quick test run (CUDA)
---------------------

Build with CUDA (choose correct arch id, e.g., 86=Ampere, 89=Ada):

```
ENGINE=docker IMAGE=nvidia/cuda:12.4.1-devel-ubuntu22.04 \
CMAKE_ARGS='-DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86' \
scripts/build-in-container.sh
```

Run the attention determinism test on a specific GPU (index 2 in this example):

```
ENGINE=docker IMAGE=nvidia/cuda:12.4.1-devel-ubuntu22.04 \
$ENGINE run --rm --gpus all -e CUDA_VISIBLE_DEVICES=2 \
-v "$(pwd):/src" -w /src/build-container/bin "$IMAGE" \
bash -lc './test-attention-determinism'
```

Debug controls (optional)
-------------------------

- `GGML_DETERMINISTIC_ATTENTION_FORCE_VEC=1` forces the deterministic dispatcher to take a vec path when possible.
- `GGML_DETERMINISTIC_ATTENTION_FORCE_TILE=1` forces the deterministic dispatcher to take the tile path (F16 K/V only) and logs an info message once.
- `GGML_DETERMINISTIC_ATTENTION_ALLOW_MMA=1` explicitly allows MMA path for special head sizes when available (prototype; opt‑in).
- `GGML_DET_ATTENTION_DISABLE_TILE_80_96_112=1` optional: disables the deterministic tile path for D∈{80,96,112}. If set and MMA isn’t explicitly allowed/available, attention aborts with guidance. Useful for perf trials to prevent slow fallbacks.


Roadmap
-------

- Broaden deterministic attention coverage (quantized K/V; additional head sizes) and extend to other backends (HIP/Metal/Vulkan/OpenCL).
27 changes: 27 additions & 0 deletions docs/build.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,33 @@ cmake --build build --config Release
cmake --build build --config Release
```

### Containerized Build (Fedora toolchain)

If your host toolchain is unusual (e.g., mixed Homebrew GCC on Fedora Silverblue) and you prefer a clean, reproducible build environment, use the helper script:

```
scripts/build-in-container.sh
```

This runs a CPU build inside a Fedora container, installing `gcc-c++`, `cmake`, `make`, and `libcurl-devel`, and outputs binaries under `build-container/bin/`.

Customize via environment variables:

- `ENGINE` (default: auto; prefers `podman`, falls back to `docker`)
- `IMAGE` (default: `docker.io/library/fedora:41`)
- `BUILD_TYPE` (default: `Release`)
- `BUILD_DIR` (default: `build-container`)
- `JOBS` (default: `nproc`)
- `CMAKE_ARGS` (extra CMake flags, e.g. `-DGGML_CUDA=ON`)

Examples:

```
BUILD_TYPE=Debug scripts/build-in-container.sh
CMAKE_ARGS='-DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86' scripts/build-in-container.sh
ENGINE=docker scripts/build-in-container.sh
```

- Building for Windows (x86, x64 and arm64) with MSVC or clang as compilers:
- Install Visual Studio 2022, e.g. via the [Community Edition](https://visualstudio.microsoft.com/vs/community/). In the installer, select at least the following options (this also automatically installs the required additional tools like CMake,...):
- Tab Workload: Desktop-development with C++
Expand Down
9 changes: 9 additions & 0 deletions ggml/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -215,6 +215,9 @@ option(GGML_OPENCL_USE_ADRENO_KERNELS "ggml: use optimized kernels for Adr
set (GGML_OPENCL_TARGET_VERSION "300" CACHE STRING
"gmml: OpenCL API version to target")

# Deterministic numerics controls
option(GGML_DETERMINISTIC "ggml: enable deterministic numerics where supported" OFF)

# toolchain for vulkan-shaders-gen
set (GGML_VULKAN_SHADERS_GEN_TOOLCHAIN "" CACHE FILEPATH "ggml: toolchain file for vulkan-shaders-gen")

Expand Down Expand Up @@ -371,6 +374,12 @@ target_compile_definitions(ggml-base PRIVATE
GGML_VERSION="${GGML_INSTALL_VERSION}"
GGML_COMMIT="${GGML_BUILD_COMMIT}"
)

# Propagate GGML_DETERMINISTIC to compilation units and dependents
if (GGML_DETERMINISTIC)
target_compile_definitions(ggml-base PRIVATE GGML_DETERMINISTIC)
target_compile_definitions(ggml-base PUBLIC GGML_DETERMINISTIC)
endif()
message(STATUS "ggml version: ${GGML_INSTALL_VERSION}")
message(STATUS "ggml commit: ${GGML_BUILD_COMMIT}")

Expand Down
4 changes: 4 additions & 0 deletions ggml/include/ggml.h
Original file line number Diff line number Diff line change
Expand Up @@ -683,6 +683,10 @@ extern "C" {
GGML_API int64_t ggml_cycles(void);
GGML_API int64_t ggml_cycles_per_ms(void);

// Deterministic numerics – returns true if either built with GGML_DETERMINISTIC
// or the environment variable GGML_DETERMINISTIC is set to a truthy value.
GGML_API bool ggml_is_deterministic(void);

// accepts a UTF-8 path, even on Windows
GGML_API FILE * ggml_fopen(const char * fname, const char * mode);

Expand Down
3 changes: 3 additions & 0 deletions ggml/src/ggml-cuda/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,9 @@ if (CUDAToolkit_FOUND)
file(GLOB SRCS "template-instances/mmf*.cu")
list(APPEND GGML_SOURCES_CUDA ${SRCS})

# det note: in det mode we only rely on a minimal, always‑built set of
# vector attention instances. FA_ALL_QUANTS expands the template matrix for
# experiments; tests and dispatcher probes gate usage accordingly.
if (GGML_CUDA_FA_ALL_QUANTS)
file(GLOB SRCS "template-instances/fattn-vec*.cu")
list(APPEND GGML_SOURCES_CUDA ${SRCS})
Expand Down
Loading
Loading