ggml-org · yeahdongcn · Mar 16, 2025
diff --git a/docs/android.md b/docs/android.md
@@ -1,6 +1,10 @@
 
 # Android
 
+- [Android](#android)
+  - [Build on Android using Termux](#build-on-android-using-termux)
+  - [Cross-compile using Android NDK](#cross-compile-using-android-ndk)
+
 ## Build on Android using Termux
 
 [Termux](https://termux.dev/en/) is an Android terminal emulator and Linux environment app (no root required). As of writing, Termux is available experimentally in the Google Play Store; otherwise, it may be obtained directly from the project repo or on F-Droid.

diff --git a/docs/build.md b/docs/build.md
@@ -1,5 +1,39 @@
 # Build llama.cpp locally
 
+- [Build llama.cpp locally](#build-llamacpp-locally)
+  - [CPU Build](#cpu-build)
+  - [BLAS Build](#blas-build)
+    - [Accelerate Framework](#accelerate-framework)
+    - [OpenBLAS](#openblas)
+    - [BLIS](#blis)
+    - [Intel oneMKL](#intel-onemkl)
+    - [Other BLAS libraries](#other-blas-libraries)
+  - [Metal Build](#metal-build)
+  - [SYCL](#sycl)
+  - [CUDA](#cuda)
+      - [Download directly from NVIDIA](#download-directly-from-nvidia)
+      - [Compile and run inside a Fedora Toolbox Container](#compile-and-run-inside-a-fedora-toolbox-container)
+    - [Compilation](#compilation)
+    - [Override Compute Capability Specifications](#override-compute-capability-specifications)
+    - [Runtime CUDA environmental variables](#runtime-cuda-environmental-variables)
+    - [Unified Memory](#unified-memory)
+    - [Performance Tuning](#performance-tuning)
+  - [MUSA](#musa)
+      - [Download directly from Moore Threads](#download-directly-from-moore-threads)
+    - [Compilation](#compilation-1)
+      - [Override Compute Capability Specifications](#override-compute-capability-specifications-1)
+      - [Compilation options](#compilation-options)
+    - [Runtime MUSA environmental variables](#runtime-musa-environmental-variables)
+    - [Unified Memory](#unified-memory-1)
+  - [HIP](#hip)
+  - [Vulkan](#vulkan)
+    - [w64devkit](#w64devkit)
+    - [Git Bash MINGW64](#git-bash-mingw64)
+    - [MSYS2](#msys2)
+  - [CANN](#cann)
+  - [Android](#android)
+  - [Notes about GPU-accelerated backends](#notes-about-gpu-accelerated-backends)
+
 **To get the Code:**
 
 ```bash
@@ -156,19 +190,19 @@ nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
 
 To override the `native` GPU detection:
 
-#### 1. Take note of the `Compute Capability` of your NVIDIA devices: ["CUDA: Your GPU Compute > Capability"](https://developer.nvidia.com/cuda-gpus).
+1. Take note of the `Compute Capability` of your NVIDIA devices: ["CUDA: Your GPU Compute > Capability"](https://developer.nvidia.com/cuda-gpus).
 
-```text
-GeForce RTX 4090      8.9
-GeForce RTX 3080 Ti   8.6
-GeForce RTX 3070      8.6
-```
+   ```text
+   GeForce RTX 4090      8.9
+   GeForce RTX 3080 Ti   8.6
+   GeForce RTX 3070      8.6
+   ```
 
-#### 2. Manually list each varying `Compute Capability` in the `CMAKE_CUDA_ARCHITECTURES` list.
+2. Manually list each varying `Compute Capability` in the `CMAKE_CUDA_ARCHITECTURES` list.
 
-```bash
-cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="86;89"
-```
+   ```bash
+   cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="86;89"
+   ```
 
 ### Runtime CUDA environmental variables
 
@@ -216,6 +250,7 @@ By default, all supported compute capabilities are enabled. To customize this be
 
 ```bash
 cmake -B build -DGGML_MUSA=ON -DMUSA_ARCHITECTURES="21"
+cmake --build build --config Release
 ```
 
 This configuration enables only compute capability `2.1` (MTT S80) during compilation, which can help reduce compilation time.

diff --git a/docs/development/HOWTO-add-model.md b/docs/development/HOWTO-add-model.md
@@ -1,5 +1,12 @@
 # Add a new model architecture to `llama.cpp`
 
+- [Add a new model architecture to `llama.cpp`](#add-a-new-model-architecture-to-llamacpp)
+    - [1. Convert the model to GGUF](#1-convert-the-model-to-gguf)
+    - [2. Define the model architecture in `llama.cpp`](#2-define-the-model-architecture-in-llamacpp)
+    - [3. Build the GGML graph implementation](#3-build-the-ggml-graph-implementation)
+  - [GGUF specification](#gguf-specification)
+  - [Resources](#resources)
+
 Adding a model requires few steps:
 
 1. Convert the model to GGUF

diff --git a/docs/development/debugging-tests.md b/docs/development/debugging-tests.md
@@ -1,5 +1,14 @@
 # Debugging Tests Tips
 
+- [Debugging Tests Tips](#debugging-tests-tips)
+  - [How to run \& execute or debug a specific test without anything else to keep the feedback loop short?](#how-to-run--execute-or-debug-a-specific-test-without-anything-else-to-keep-the-feedback-loop-short)
+    - [How does the script work?](#how-does-the-script-work)
+      - [Step 1: Reset and Setup folder context](#step-1-reset-and-setup-folder-context)
+      - [Step 2: Setup Build Environment and Compile Test Binaries](#step-2-setup-build-environment-and-compile-test-binaries)
+      - [Step 3: Find all tests available that matches REGEX](#step-3-find-all-tests-available-that-matches-regex)
+      - [Step 4: Identify Test Command for Debugging](#step-4-identify-test-command-for-debugging)
+      - [Step 5: Run GDB on test command](#step-5-run-gdb-on-test-command)
+
 ## How to run & execute or debug a specific test without anything else to keep the feedback loop short?
 
 There is a script called debug-test.sh in the scripts folder whose parameter takes a REGEX and an optional test number.

diff --git a/docs/development/token_generation_performance_tips.md b/docs/development/token_generation_performance_tips.md
@@ -1,5 +1,10 @@
 # Token generation performance troubleshooting
 
+- [Token generation performance troubleshooting](#token-generation-performance-troubleshooting)
+  - [Verifying that the model is running on the GPU with CUDA](#verifying-that-the-model-is-running-on-the-gpu-with-cuda)
+  - [Verifying that the CPU is not oversaturated](#verifying-that-the-cpu-is-not-oversaturated)
+- [Example of runtime flags effect on inference speed benchmark](#example-of-runtime-flags-effect-on-inference-speed-benchmark)
+
 ## Verifying that the model is running on the GPU with CUDA
 Make sure you compiled llama with the correct env variables according to [this guide](/docs/build.md#cuda), so that llama accepts the `-ngl N` (or `--n-gpu-layers N`) flag. When running llama, you may configure `N` to be very large, and llama will offload the maximum possible number of layers to the GPU, even if it's less than the number you configured. For example:
 ```shell

diff --git a/docs/docker.md b/docs/docker.md
@@ -1,5 +1,16 @@
 # Docker
 
+- [Docker](#docker)
+  - [Prerequisites](#prerequisites)
+  - [Images](#images)
+  - [Usage](#usage)
+  - [Docker With CUDA](#docker-with-cuda)
+  - [Building Docker locally](#building-docker-locally)
+  - [Usage](#usage-1)
+  - [Docker With MUSA](#docker-with-musa)
+  - [Building Docker locally](#building-docker-locally-1)
+  - [Usage](#usage-2)
+
 ## Prerequisites
 * Docker must be installed and running on your system.
 * Create a folder to store big models & intermediate files (ex. /llama/models)

diff --git a/docs/function-calling.md b/docs/function-calling.md
@@ -1,5 +1,9 @@
 # Function Calling
 
+- [Function Calling](#function-calling)
+  - [Universal support w/ Native \& Generic handlers](#universal-support-w-native--generic-handlers)
+- [Usage - need tool-aware Jinja template](#usage---need-tool-aware-jinja-template)
+
 [chat.h](../common/chat.h) (https://github.com/ggml-org/llama.cpp/pull/9639) adds support for [OpenAI-style function calling](https://platform.openai.com/docs/guides/function-calling) and is used in:
 - `llama-server` when started w/ `--jinja` flag
 - `llama-cli` (WIP: https://github.com/ggml-org/llama.cpp/pull/11556)

diff --git a/docs/install.md b/docs/install.md
@@ -1,5 +1,10 @@
 # Install pre-built version of llama.cpp
 
+- [Install pre-built version of llama.cpp](#install-pre-built-version-of-llamacpp)
+  - [Homebrew](#homebrew)
+  - [Nix](#nix)
+  - [Flox](#flox)
+
 ## Homebrew
 
 On Mac and Linux, the homebrew package manager can be used via

diff --git a/docs/llguidance.md b/docs/llguidance.md
@@ -1,5 +1,13 @@
 # LLGuidance Support in llama.cpp
 
+- [LLGuidance Support in llama.cpp](#llguidance-support-in-llamacpp)
+  - [Building](#building)
+  - [Interface](#interface)
+  - [Performance](#performance)
+  - [JSON Schema](#json-schema)
+  - [Why Not Reuse GBNF Format?](#why-not-reuse-gbnf-format)
+  - [Error Handling](#error-handling)
+
 [LLGuidance](https://github.com/guidance-ai/llguidance) is a library for constrained decoding (also called constrained sampling or structured outputs) for Large Language Models (LLMs). Initially developed as the backend for the [Guidance](https://github.com/guidance-ai/guidance) library, it can also be used independently.
 
 LLGuidance supports JSON Schemas and arbitrary context-free grammars (CFGs) written in a [variant](https://github.com/guidance-ai/llguidance/blob/main/docs/syntax.md) of Lark syntax. It is [very fast](https://github.com/guidance-ai/jsonschemabench/tree/main/maskbench) and has [excellent](https://github.com/guidance-ai/llguidance/blob/main/docs/json_schema.md) JSON Schema coverage but requires the Rust compiler, which complicates the llama.cpp build process.