diff --git a/docs/android.md b/docs/android.md index d2a835653fe5d..79546f51d8d2a 100644 --- a/docs/android.md +++ b/docs/android.md @@ -1,6 +1,10 @@ # Android +- [Android](#android) + - [Build on Android using Termux](#build-on-android-using-termux) + - [Cross-compile using Android NDK](#cross-compile-using-android-ndk) + ## Build on Android using Termux [Termux](https://termux.dev/en/) is an Android terminal emulator and Linux environment app (no root required). As of writing, Termux is available experimentally in the Google Play Store; otherwise, it may be obtained directly from the project repo or on F-Droid. diff --git a/docs/build.md b/docs/build.md index 2e3975c145360..9c88a2b6475db 100644 --- a/docs/build.md +++ b/docs/build.md @@ -1,5 +1,39 @@ # Build llama.cpp locally +- [Build llama.cpp locally](#build-llamacpp-locally) + - [CPU Build](#cpu-build) + - [BLAS Build](#blas-build) + - [Accelerate Framework](#accelerate-framework) + - [OpenBLAS](#openblas) + - [BLIS](#blis) + - [Intel oneMKL](#intel-onemkl) + - [Other BLAS libraries](#other-blas-libraries) + - [Metal Build](#metal-build) + - [SYCL](#sycl) + - [CUDA](#cuda) + - [Download directly from NVIDIA](#download-directly-from-nvidia) + - [Compile and run inside a Fedora Toolbox Container](#compile-and-run-inside-a-fedora-toolbox-container) + - [Compilation](#compilation) + - [Override Compute Capability Specifications](#override-compute-capability-specifications) + - [Runtime CUDA environmental variables](#runtime-cuda-environmental-variables) + - [Unified Memory](#unified-memory) + - [Performance Tuning](#performance-tuning) + - [MUSA](#musa) + - [Download directly from Moore Threads](#download-directly-from-moore-threads) + - [Compilation](#compilation-1) + - [Override Compute Capability Specifications](#override-compute-capability-specifications-1) + - [Compilation options](#compilation-options) + - [Runtime MUSA environmental variables](#runtime-musa-environmental-variables) + - [Unified Memory](#unified-memory-1) + - [HIP](#hip) + - [Vulkan](#vulkan) + - [w64devkit](#w64devkit) + - [Git Bash MINGW64](#git-bash-mingw64) + - [MSYS2](#msys2) + - [CANN](#cann) + - [Android](#android) + - [Notes about GPU-accelerated backends](#notes-about-gpu-accelerated-backends) + **To get the Code:** ```bash @@ -156,19 +190,19 @@ nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used To override the `native` GPU detection: -#### 1. Take note of the `Compute Capability` of your NVIDIA devices: ["CUDA: Your GPU Compute > Capability"](https://developer.nvidia.com/cuda-gpus). +1. Take note of the `Compute Capability` of your NVIDIA devices: ["CUDA: Your GPU Compute > Capability"](https://developer.nvidia.com/cuda-gpus). -```text -GeForce RTX 4090 8.9 -GeForce RTX 3080 Ti 8.6 -GeForce RTX 3070 8.6 -``` + ```text + GeForce RTX 4090 8.9 + GeForce RTX 3080 Ti 8.6 + GeForce RTX 3070 8.6 + ``` -#### 2. Manually list each varying `Compute Capability` in the `CMAKE_CUDA_ARCHITECTURES` list. +2. Manually list each varying `Compute Capability` in the `CMAKE_CUDA_ARCHITECTURES` list. -```bash -cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="86;89" -``` + ```bash + cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="86;89" + ``` ### Runtime CUDA environmental variables @@ -216,6 +250,7 @@ By default, all supported compute capabilities are enabled. To customize this be ```bash cmake -B build -DGGML_MUSA=ON -DMUSA_ARCHITECTURES="21" +cmake --build build --config Release ``` This configuration enables only compute capability `2.1` (MTT S80) during compilation, which can help reduce compilation time. diff --git a/docs/development/HOWTO-add-model.md b/docs/development/HOWTO-add-model.md index 78c6f76077a2b..0518dcc92dc66 100644 --- a/docs/development/HOWTO-add-model.md +++ b/docs/development/HOWTO-add-model.md @@ -1,5 +1,12 @@ # Add a new model architecture to `llama.cpp` +- [Add a new model architecture to `llama.cpp`](#add-a-new-model-architecture-to-llamacpp) + - [1. Convert the model to GGUF](#1-convert-the-model-to-gguf) + - [2. Define the model architecture in `llama.cpp`](#2-define-the-model-architecture-in-llamacpp) + - [3. Build the GGML graph implementation](#3-build-the-ggml-graph-implementation) + - [GGUF specification](#gguf-specification) + - [Resources](#resources) + Adding a model requires few steps: 1. Convert the model to GGUF diff --git a/docs/development/debugging-tests.md b/docs/development/debugging-tests.md index 18407f688f9db..db82d20f7ca4a 100644 --- a/docs/development/debugging-tests.md +++ b/docs/development/debugging-tests.md @@ -1,5 +1,14 @@ # Debugging Tests Tips +- [Debugging Tests Tips](#debugging-tests-tips) + - [How to run \& execute or debug a specific test without anything else to keep the feedback loop short?](#how-to-run--execute-or-debug-a-specific-test-without-anything-else-to-keep-the-feedback-loop-short) + - [How does the script work?](#how-does-the-script-work) + - [Step 1: Reset and Setup folder context](#step-1-reset-and-setup-folder-context) + - [Step 2: Setup Build Environment and Compile Test Binaries](#step-2-setup-build-environment-and-compile-test-binaries) + - [Step 3: Find all tests available that matches REGEX](#step-3-find-all-tests-available-that-matches-regex) + - [Step 4: Identify Test Command for Debugging](#step-4-identify-test-command-for-debugging) + - [Step 5: Run GDB on test command](#step-5-run-gdb-on-test-command) + ## How to run & execute or debug a specific test without anything else to keep the feedback loop short? There is a script called debug-test.sh in the scripts folder whose parameter takes a REGEX and an optional test number. diff --git a/docs/development/token_generation_performance_tips.md b/docs/development/token_generation_performance_tips.md index 41b7232c976b3..6c1cacd80ba2a 100644 --- a/docs/development/token_generation_performance_tips.md +++ b/docs/development/token_generation_performance_tips.md @@ -1,5 +1,10 @@ # Token generation performance troubleshooting +- [Token generation performance troubleshooting](#token-generation-performance-troubleshooting) + - [Verifying that the model is running on the GPU with CUDA](#verifying-that-the-model-is-running-on-the-gpu-with-cuda) + - [Verifying that the CPU is not oversaturated](#verifying-that-the-cpu-is-not-oversaturated) +- [Example of runtime flags effect on inference speed benchmark](#example-of-runtime-flags-effect-on-inference-speed-benchmark) + ## Verifying that the model is running on the GPU with CUDA Make sure you compiled llama with the correct env variables according to [this guide](/docs/build.md#cuda), so that llama accepts the `-ngl N` (or `--n-gpu-layers N`) flag. When running llama, you may configure `N` to be very large, and llama will offload the maximum possible number of layers to the GPU, even if it's less than the number you configured. For example: ```shell diff --git a/docs/docker.md b/docs/docker.md index 343146dbd214f..4877d9f79d14b 100644 --- a/docs/docker.md +++ b/docs/docker.md @@ -1,5 +1,16 @@ # Docker +- [Docker](#docker) + - [Prerequisites](#prerequisites) + - [Images](#images) + - [Usage](#usage) + - [Docker With CUDA](#docker-with-cuda) + - [Building Docker locally](#building-docker-locally) + - [Usage](#usage-1) + - [Docker With MUSA](#docker-with-musa) + - [Building Docker locally](#building-docker-locally-1) + - [Usage](#usage-2) + ## Prerequisites * Docker must be installed and running on your system. * Create a folder to store big models & intermediate files (ex. /llama/models) diff --git a/docs/function-calling.md b/docs/function-calling.md index c3873c3fa63d1..b48e346882f17 100644 --- a/docs/function-calling.md +++ b/docs/function-calling.md @@ -1,5 +1,9 @@ # Function Calling +- [Function Calling](#function-calling) + - [Universal support w/ Native \& Generic handlers](#universal-support-w-native--generic-handlers) +- [Usage - need tool-aware Jinja template](#usage---need-tool-aware-jinja-template) + [chat.h](../common/chat.h) (https://github.com/ggml-org/llama.cpp/pull/9639) adds support for [OpenAI-style function calling](https://platform.openai.com/docs/guides/function-calling) and is used in: - `llama-server` when started w/ `--jinja` flag - `llama-cli` (WIP: https://github.com/ggml-org/llama.cpp/pull/11556) diff --git a/docs/install.md b/docs/install.md index 0e23a2c9e7ae1..94b943e34b73a 100644 --- a/docs/install.md +++ b/docs/install.md @@ -1,5 +1,10 @@ # Install pre-built version of llama.cpp +- [Install pre-built version of llama.cpp](#install-pre-built-version-of-llamacpp) + - [Homebrew](#homebrew) + - [Nix](#nix) + - [Flox](#flox) + ## Homebrew On Mac and Linux, the homebrew package manager can be used via diff --git a/docs/llguidance.md b/docs/llguidance.md index cda787b14de04..0d5eacc72aa37 100644 --- a/docs/llguidance.md +++ b/docs/llguidance.md @@ -1,5 +1,13 @@ # LLGuidance Support in llama.cpp +- [LLGuidance Support in llama.cpp](#llguidance-support-in-llamacpp) + - [Building](#building) + - [Interface](#interface) + - [Performance](#performance) + - [JSON Schema](#json-schema) + - [Why Not Reuse GBNF Format?](#why-not-reuse-gbnf-format) + - [Error Handling](#error-handling) + [LLGuidance](https://github.com/guidance-ai/llguidance) is a library for constrained decoding (also called constrained sampling or structured outputs) for Large Language Models (LLMs). Initially developed as the backend for the [Guidance](https://github.com/guidance-ai/guidance) library, it can also be used independently. LLGuidance supports JSON Schemas and arbitrary context-free grammars (CFGs) written in a [variant](https://github.com/guidance-ai/llguidance/blob/main/docs/syntax.md) of Lark syntax. It is [very fast](https://github.com/guidance-ai/jsonschemabench/tree/main/maskbench) and has [excellent](https://github.com/guidance-ai/llguidance/blob/main/docs/json_schema.md) JSON Schema coverage but requires the Rust compiler, which complicates the llama.cpp build process.