Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs/android.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@

# Android

- [Android](#android)
- [Build on Android using Termux](#build-on-android-using-termux)
- [Cross-compile using Android NDK](#cross-compile-using-android-ndk)

## Build on Android using Termux

[Termux](https://termux.dev/en/) is an Android terminal emulator and Linux environment app (no root required). As of writing, Termux is available experimentally in the Google Play Store; otherwise, it may be obtained directly from the project repo or on F-Droid.
Expand Down
55 changes: 45 additions & 10 deletions docs/build.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,39 @@
# Build llama.cpp locally

- [Build llama.cpp locally](#build-llamacpp-locally)
- [CPU Build](#cpu-build)
- [BLAS Build](#blas-build)
- [Accelerate Framework](#accelerate-framework)
- [OpenBLAS](#openblas)
- [BLIS](#blis)
- [Intel oneMKL](#intel-onemkl)
- [Other BLAS libraries](#other-blas-libraries)
- [Metal Build](#metal-build)
- [SYCL](#sycl)
- [CUDA](#cuda)
- [Download directly from NVIDIA](#download-directly-from-nvidia)
- [Compile and run inside a Fedora Toolbox Container](#compile-and-run-inside-a-fedora-toolbox-container)
- [Compilation](#compilation)
- [Override Compute Capability Specifications](#override-compute-capability-specifications)
- [Runtime CUDA environmental variables](#runtime-cuda-environmental-variables)
- [Unified Memory](#unified-memory)
- [Performance Tuning](#performance-tuning)
- [MUSA](#musa)
- [Download directly from Moore Threads](#download-directly-from-moore-threads)
- [Compilation](#compilation-1)
- [Override Compute Capability Specifications](#override-compute-capability-specifications-1)
- [Compilation options](#compilation-options)
- [Runtime MUSA environmental variables](#runtime-musa-environmental-variables)
- [Unified Memory](#unified-memory-1)
- [HIP](#hip)
- [Vulkan](#vulkan)
- [w64devkit](#w64devkit)
- [Git Bash MINGW64](#git-bash-mingw64)
- [MSYS2](#msys2)
- [CANN](#cann)
- [Android](#android)
- [Notes about GPU-accelerated backends](#notes-about-gpu-accelerated-backends)

**To get the Code:**

```bash
Expand Down Expand Up @@ -156,19 +190,19 @@ nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used

To override the `native` GPU detection:

#### 1. Take note of the `Compute Capability` of your NVIDIA devices: ["CUDA: Your GPU Compute > Capability"](https://developer.nvidia.com/cuda-gpus).
1. Take note of the `Compute Capability` of your NVIDIA devices: ["CUDA: Your GPU Compute > Capability"](https://developer.nvidia.com/cuda-gpus).

```text
GeForce RTX 4090 8.9
GeForce RTX 3080 Ti 8.6
GeForce RTX 3070 8.6
```
```text
GeForce RTX 4090 8.9
GeForce RTX 3080 Ti 8.6
GeForce RTX 3070 8.6
```

#### 2. Manually list each varying `Compute Capability` in the `CMAKE_CUDA_ARCHITECTURES` list.
2. Manually list each varying `Compute Capability` in the `CMAKE_CUDA_ARCHITECTURES` list.

```bash
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="86;89"
```
```bash
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="86;89"
```

### Runtime CUDA environmental variables

Expand Down Expand Up @@ -216,6 +250,7 @@ By default, all supported compute capabilities are enabled. To customize this be

```bash
cmake -B build -DGGML_MUSA=ON -DMUSA_ARCHITECTURES="21"
cmake --build build --config Release
```

This configuration enables only compute capability `2.1` (MTT S80) during compilation, which can help reduce compilation time.
Expand Down
7 changes: 7 additions & 0 deletions docs/development/HOWTO-add-model.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,12 @@
# Add a new model architecture to `llama.cpp`

- [Add a new model architecture to `llama.cpp`](#add-a-new-model-architecture-to-llamacpp)
- [1. Convert the model to GGUF](#1-convert-the-model-to-gguf)
- [2. Define the model architecture in `llama.cpp`](#2-define-the-model-architecture-in-llamacpp)
- [3. Build the GGML graph implementation](#3-build-the-ggml-graph-implementation)
- [GGUF specification](#gguf-specification)
- [Resources](#resources)

Adding a model requires few steps:

1. Convert the model to GGUF
Expand Down
9 changes: 9 additions & 0 deletions docs/development/debugging-tests.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,14 @@
# Debugging Tests Tips

- [Debugging Tests Tips](#debugging-tests-tips)
- [How to run \& execute or debug a specific test without anything else to keep the feedback loop short?](#how-to-run--execute-or-debug-a-specific-test-without-anything-else-to-keep-the-feedback-loop-short)
- [How does the script work?](#how-does-the-script-work)
- [Step 1: Reset and Setup folder context](#step-1-reset-and-setup-folder-context)
- [Step 2: Setup Build Environment and Compile Test Binaries](#step-2-setup-build-environment-and-compile-test-binaries)
- [Step 3: Find all tests available that matches REGEX](#step-3-find-all-tests-available-that-matches-regex)
- [Step 4: Identify Test Command for Debugging](#step-4-identify-test-command-for-debugging)
- [Step 5: Run GDB on test command](#step-5-run-gdb-on-test-command)

## How to run & execute or debug a specific test without anything else to keep the feedback loop short?

There is a script called debug-test.sh in the scripts folder whose parameter takes a REGEX and an optional test number.
Expand Down
5 changes: 5 additions & 0 deletions docs/development/token_generation_performance_tips.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
# Token generation performance troubleshooting

- [Token generation performance troubleshooting](#token-generation-performance-troubleshooting)
- [Verifying that the model is running on the GPU with CUDA](#verifying-that-the-model-is-running-on-the-gpu-with-cuda)
- [Verifying that the CPU is not oversaturated](#verifying-that-the-cpu-is-not-oversaturated)
- [Example of runtime flags effect on inference speed benchmark](#example-of-runtime-flags-effect-on-inference-speed-benchmark)

## Verifying that the model is running on the GPU with CUDA
Make sure you compiled llama with the correct env variables according to [this guide](/docs/build.md#cuda), so that llama accepts the `-ngl N` (or `--n-gpu-layers N`) flag. When running llama, you may configure `N` to be very large, and llama will offload the maximum possible number of layers to the GPU, even if it's less than the number you configured. For example:
```shell
Expand Down
11 changes: 11 additions & 0 deletions docs/docker.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,16 @@
# Docker

- [Docker](#docker)
- [Prerequisites](#prerequisites)
- [Images](#images)
- [Usage](#usage)
- [Docker With CUDA](#docker-with-cuda)
- [Building Docker locally](#building-docker-locally)
- [Usage](#usage-1)
- [Docker With MUSA](#docker-with-musa)
- [Building Docker locally](#building-docker-locally-1)
- [Usage](#usage-2)

## Prerequisites
* Docker must be installed and running on your system.
* Create a folder to store big models & intermediate files (ex. /llama/models)
Expand Down
4 changes: 4 additions & 0 deletions docs/function-calling.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
# Function Calling

- [Function Calling](#function-calling)
- [Universal support w/ Native \& Generic handlers](#universal-support-w-native--generic-handlers)
- [Usage - need tool-aware Jinja template](#usage---need-tool-aware-jinja-template)

[chat.h](../common/chat.h) (https://github.com/ggml-org/llama.cpp/pull/9639) adds support for [OpenAI-style function calling](https://platform.openai.com/docs/guides/function-calling) and is used in:
- `llama-server` when started w/ `--jinja` flag
- `llama-cli` (WIP: https://github.com/ggml-org/llama.cpp/pull/11556)
Expand Down
5 changes: 5 additions & 0 deletions docs/install.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
# Install pre-built version of llama.cpp

- [Install pre-built version of llama.cpp](#install-pre-built-version-of-llamacpp)
- [Homebrew](#homebrew)
- [Nix](#nix)
- [Flox](#flox)

## Homebrew

On Mac and Linux, the homebrew package manager can be used via
Expand Down
8 changes: 8 additions & 0 deletions docs/llguidance.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,13 @@
# LLGuidance Support in llama.cpp

- [LLGuidance Support in llama.cpp](#llguidance-support-in-llamacpp)
- [Building](#building)
- [Interface](#interface)
- [Performance](#performance)
- [JSON Schema](#json-schema)
- [Why Not Reuse GBNF Format?](#why-not-reuse-gbnf-format)
- [Error Handling](#error-handling)

[LLGuidance](https://github.com/guidance-ai/llguidance) is a library for constrained decoding (also called constrained sampling or structured outputs) for Large Language Models (LLMs). Initially developed as the backend for the [Guidance](https://github.com/guidance-ai/guidance) library, it can also be used independently.

LLGuidance supports JSON Schemas and arbitrary context-free grammars (CFGs) written in a [variant](https://github.com/guidance-ai/llguidance/blob/main/docs/syntax.md) of Lark syntax. It is [very fast](https://github.com/guidance-ai/jsonschemabench/tree/main/maskbench) and has [excellent](https://github.com/guidance-ai/llguidance/blob/main/docs/json_schema.md) JSON Schema coverage but requires the Rust compiler, which complicates the llama.cpp build process.
Expand Down