Skip to content

Commit 24fe9f1

Browse files
Merge pull request #115 from menloresearch/update-dev-from-master-2025-06-05-00-08
Sync master with upstream release b5590
2 parents 31facce + 0d39844 commit 24fe9f1

24 files changed

+624
-240
lines changed

.github/workflows/build.yml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -839,12 +839,12 @@ jobs:
839839
-DGGML_CUDA=ON
840840
cmake --build build
841841
842-
windows-2019-cmake-cuda:
843-
runs-on: windows-2019
842+
windows-2022-cmake-cuda:
843+
runs-on: windows-2022
844844

845845
strategy:
846846
matrix:
847-
cuda: ['12.4', '11.7']
847+
cuda: ['12.4']
848848

849849
steps:
850850
- name: Clone
@@ -878,7 +878,7 @@ jobs:
878878
env:
879879
CURL_PATH: ${{ steps.get_libcurl.outputs.curl_path }}
880880
run: |
881-
call "C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\VC\Auxiliary\Build\vcvars64.bat"
881+
call "C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Auxiliary\Build\vcvarsall.bat" x64
882882
cmake -S . -B build -G "Ninja Multi-Config" ^
883883
-DLLAMA_BUILD_SERVER=ON ^
884884
-DGGML_NATIVE=OFF ^

.github/workflows/release.yml

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -131,8 +131,9 @@ jobs:
131131
include:
132132
- build: 'x64'
133133
os: ubuntu-22.04
134-
- build: 'arm64'
135-
os: ubuntu-22.04-arm
134+
# GGML_BACKEND_DL and GGML_CPU_ALL_VARIANTS are not currently supported on arm
135+
# - build: 'arm64'
136+
# os: ubuntu-22.04-arm
136137

137138
runs-on: ${{ matrix.os }}
138139

@@ -159,6 +160,9 @@ jobs:
159160
id: cmake_build
160161
run: |
161162
cmake -B build \
163+
-DGGML_BACKEND_DL=ON \
164+
-DGGML_NATIVE=OFF \
165+
-DGGML_CPU_ALL_VARIANTS=ON \
162166
-DLLAMA_FATAL_WARNINGS=ON \
163167
${{ env.CMAKE_ARGS }}
164168
cmake --build build --config Release -j $(nproc)
@@ -207,6 +211,9 @@ jobs:
207211
id: cmake_build
208212
run: |
209213
cmake -B build \
214+
-DGGML_BACKEND_DL=ON \
215+
-DGGML_NATIVE=OFF \
216+
-DGGML_CPU_ALL_VARIANTS=ON \
210217
-DGGML_VULKAN=ON \
211218
${{ env.CMAKE_ARGS }}
212219
cmake --build build --config Release -j $(nproc)
@@ -373,11 +380,11 @@ jobs:
373380
name: llama-bin-win-${{ matrix.backend }}-${{ matrix.arch }}.zip
374381

375382
windows-cuda:
376-
runs-on: windows-2019
383+
runs-on: windows-2022
377384

378385
strategy:
379386
matrix:
380-
cuda: ['12.4', '11.7']
387+
cuda: ['12.4']
381388

382389
steps:
383390
- name: Clone
@@ -405,7 +412,7 @@ jobs:
405412
id: cmake_build
406413
shell: cmd
407414
run: |
408-
call "C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\VC\Auxiliary\Build\vcvars64.bat"
415+
call "C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Auxiliary\Build\vcvarsall.bat" x64
409416
cmake -S . -B build -G "Ninja Multi-Config" ^
410417
-DGGML_BACKEND_DL=ON ^
411418
-DGGML_NATIVE=OFF ^

.github/workflows/server.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -180,7 +180,7 @@ jobs:
180180
181181
182182
server-windows:
183-
runs-on: windows-2019
183+
runs-on: windows-2022
184184

185185
steps:
186186
- name: Clone

README.md

Lines changed: 30 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,30 @@ Inference of Meta's [LLaMA](https://arxiv.org/abs/2302.13971) model (and others)
2828

2929
----
3030

31+
## Quick start
32+
33+
Getting started with llama.cpp is straightforward. Here are several ways to install it on your machine:
34+
35+
- Install `llama.cpp` using [brew, nix or winget](docs/install.md)
36+
- Run with Docker - see our [Docker documentation](docs/docker.md)
37+
- Download pre-built binaries from the [releases page](https://github.com/ggml-org/llama.cpp/releases)
38+
- Build from source by cloning this repository - check out [our build guide](docs/build.md)
39+
40+
Once installed, you'll need a model to work with. Head to the [Obtaining and quantizing models](#obtaining-and-quantizing-models) section to learn more.
41+
42+
Example command:
43+
44+
```sh
45+
# Use a local model file
46+
llama-cli -m my_model.gguf
47+
48+
# Or download and run a model directly from Hugging Face
49+
llama-cli -hf ggml-org/gemma-3-1b-it-GGUF
50+
51+
# Launch OpenAI-compatible API server
52+
llama-server -hf ggml-org/gemma-3-1b-it-GGUF
53+
```
54+
3155
## Description
3256

3357
The main goal of `llama.cpp` is to enable LLM inference with minimal setup and state-of-the-art performance on a wide
@@ -230,6 +254,7 @@ Instructions for adding support for new models: [HOWTO-add-model.md](docs/develo
230254

231255
</details>
232256

257+
233258
## Supported backends
234259

235260
| Backend | Target devices |
@@ -246,24 +271,18 @@ Instructions for adding support for new models: [HOWTO-add-model.md](docs/develo
246271
| [OpenCL](docs/backend/OPENCL.md) | Adreno GPU |
247272
| [RPC](https://github.com/ggml-org/llama.cpp/tree/master/tools/rpc) | All |
248273

249-
## Building the project
250-
251-
The main product of this project is the `llama` library. Its C-style interface can be found in [include/llama.h](include/llama.h).
252-
The project also includes many example programs and tools using the `llama` library. The examples range from simple, minimal code snippets to sophisticated sub-projects such as an OpenAI-compatible HTTP server. Possible methods for obtaining the binaries:
253-
254-
- Clone this repository and build locally, see [how to build](docs/build.md)
255-
- On MacOS or Linux, install `llama.cpp` via [brew, flox or nix](docs/install.md)
256-
- Use a Docker image, see [documentation for Docker](docs/docker.md)
257-
- Download pre-built binaries from [releases](https://github.com/ggml-org/llama.cpp/releases)
258-
259274
## Obtaining and quantizing models
260275

261276
The [Hugging Face](https://huggingface.co) platform hosts a [number of LLMs](https://huggingface.co/models?library=gguf&sort=trending) compatible with `llama.cpp`:
262277

263278
- [Trending](https://huggingface.co/models?library=gguf&sort=trending)
264279
- [LLaMA](https://huggingface.co/models?sort=trending&search=llama+gguf)
265280

266-
You can either manually download the GGUF file or directly use any `llama.cpp`-compatible models from [Hugging Face](https://huggingface.co/) or other model hosting sites, such as [ModelScope](https://modelscope.cn/), by using this CLI argument: `-hf <user>/<model>[:quant]`.
281+
You can either manually download the GGUF file or directly use any `llama.cpp`-compatible models from [Hugging Face](https://huggingface.co/) or other model hosting sites, such as [ModelScope](https://modelscope.cn/), by using this CLI argument: `-hf <user>/<model>[:quant]`. For example:
282+
283+
```sh
284+
llama-cli -hf ggml-org/gemma-3-1b-it-GGUF
285+
```
267286

268287
By default, the CLI would download from Hugging Face, you can switch to other options with the environment variable `MODEL_ENDPOINT`. For example, you may opt to downloading model checkpoints from ModelScope or other model sharing communities by setting the environment variable, e.g. `MODEL_ENDPOINT=https://www.modelscope.cn/`.
269288

docs/build.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,9 @@
11
# Build llama.cpp locally
22

3+
The main product of this project is the `llama` library. Its C-style interface can be found in [include/llama.h](include/llama.h).
4+
5+
The project also includes many example programs and tools using the `llama` library. The examples range from simple, minimal code snippets to sophisticated sub-projects such as an OpenAI-compatible HTTP server.
6+
37
**To get the Code:**
48

59
```bash

docs/install.md

Lines changed: 20 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,28 +1,42 @@
11
# Install pre-built version of llama.cpp
22

3-
## Homebrew
3+
| Install via | Windows | Mac | Linux |
4+
|-------------|---------|-----|-------|
5+
| Winget || | |
6+
| Homebrew | |||
7+
| MacPorts | || |
8+
| Nix | |||
49

5-
On Mac and Linux, the homebrew package manager can be used via
10+
## Winget (Windows)
11+
12+
```sh
13+
winget install llama.cpp
14+
```
15+
16+
The package is automatically updated with new `llama.cpp` releases. More info: https://github.com/ggml-org/llama.cpp/issues/8188
17+
18+
## Homebrew (Mac and Linux)
619

720
```sh
821
brew install llama.cpp
922
```
23+
1024
The formula is automatically updated with new `llama.cpp` releases. More info: https://github.com/ggml-org/llama.cpp/discussions/7668
1125

12-
## MacPorts
26+
## MacPorts (Mac)
1327

1428
```sh
1529
sudo port install llama.cpp
1630
```
17-
see also: https://ports.macports.org/port/llama.cpp/details/
1831

19-
## Nix
32+
See also: https://ports.macports.org/port/llama.cpp/details/
2033

21-
On Mac and Linux, the Nix package manager can be used via
34+
## Nix (Mac and Linux)
2235

2336
```sh
2437
nix profile install nixpkgs#llama-cpp
2538
```
39+
2640
For flake enabled installs.
2741

2842
Or
@@ -34,13 +48,3 @@ nix-env --file '<nixpkgs>' --install --attr llama-cpp
3448
For non-flake enabled installs.
3549

3650
This expression is automatically updated within the [nixpkgs repo](https://github.com/NixOS/nixpkgs/blob/nixos-24.05/pkgs/by-name/ll/llama-cpp/package.nix#L164).
37-
38-
## Flox
39-
40-
On Mac and Linux, Flox can be used to install llama.cpp within a Flox environment via
41-
42-
```sh
43-
flox install llama-cpp
44-
```
45-
46-
Flox follows the nixpkgs build of llama.cpp.

ggml/src/ggml-cpu/ops.cpp

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8132,8 +8132,8 @@ static void ggml_compute_forward_rwkv_wkv6_f32(
81328132
#define WKV_VECTOR_SIZE 4
81338133
#endif
81348134

8135-
int wkv_vector_size;
81368135
#ifdef WKV_VECTOR_SIZE
8136+
int wkv_vector_size;
81378137
#if defined(__ARM_FEATURE_SVE)
81388138
wkv_vector_size = svcntw();
81398139
#else
@@ -8348,8 +8348,8 @@ static void ggml_compute_forward_gla_f32(
83488348
#define GLA_VECTOR_SIZE 4
83498349
#endif
83508350

8351-
int gla_vector_size;
83528351
#ifdef GLA_VECTOR_SIZE
8352+
int gla_vector_size;
83538353
#if defined(__ARM_FEATURE_SVE)
83548354
gla_vector_size = svcntw();
83558355
#else

ggml/src/ggml-cuda/fattn-mma-f16.cuh

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -652,9 +652,12 @@ static __device__ __forceinline__ void flash_attn_ext_f16_iter(
652652
float KQ_max_scale[cols_per_thread];
653653
#pragma unroll
654654
for (int col = 0; col < cols_per_thread; ++col) {
655-
KQ_max_scale[col] = expf(KQ_max[col] - KQ_max_new[col]);
655+
const float KQ_max_diff = KQ_max[col] - KQ_max_new[col];
656+
KQ_max_scale[col] = expf(KQ_max_diff);
656657
KQ_max[col] = KQ_max_new[col];
657658

659+
*((uint32_t *) &KQ_max_scale[col]) *= KQ_max_diff >= SOFTMAX_FTZ_THRESHOLD;
660+
658661
// Scale previous KQ_rowsum to account for a potential increase in KQ_max:
659662
KQ_rowsum[col] = KQ_max_scale[col]*KQ_rowsum[col] + KQ_rowsum_add[col];
660663
}

0 commit comments

Comments
 (0)