Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
0cdce38
CUDA: fix FP16 overflow in tile FA kernel (#17875)
JohannesGaessler Dec 9, 2025
ca709e4
CANN: add support for partial RoPE and Vision mode (#17543)
noemotiovon Dec 9, 2025
4e842d5
console: allow using arrow left/right, home/end keys and history mode…
ngxson Dec 9, 2025
42b12b5
model : nit, DeepSeek V1 MoE is 16B and GigaChat is 20B (#12652)
CISC Dec 9, 2025
63908b6
cmake: fix Mach-O current version number (#17877)
Rhys-T Dec 9, 2025
86a3f0f
ggml : allow fill node alloc inplace (#17870)
CISC Dec 9, 2025
6b82eb7
metal : print node names for debugging (#17882)
ggerganov Dec 9, 2025
02e409a
ggml : Provide macos-specific backtrace printing to avoid terminal de…
gabe-l-hart Dec 9, 2025
48f4756
docs: clarify that CPU support should be first (#17886)
JohannesGaessler Dec 9, 2025
b635092
Add DIAG for CUDA (#17873)
pwilkin Dec 9, 2025
086a63e
metal: SSM kernel improvements (#17876)
gabe-l-hart Dec 9, 2025
6339185
docs : update cpu and cuda ops (#17890)
CISC Dec 9, 2025
2fbe3b7
common : add parser for ministral/mistral large 3/devstral 2 (#17713)
aldehir Dec 9, 2025
2e9eab8
fix softmax for iGPU (#17838)
NeoZhangJianyu Dec 10, 2025
9e79b01
convert: allow using quantized Mistral weight (#17889)
ngxson Dec 10, 2025
17f7f4b
CUDA: fix unpadded strides in MMA FA kernel (#17891)
JohannesGaessler Dec 10, 2025
2d2e103
docs : update opencl ops (#17904)
lhez Dec 10, 2025
b677721
model : Qwen3-Next-80B-A3B has 48 layers (#17898)
EZForever Dec 10, 2025
6c21317
cli: new CLI experience (#17824)
ngxson Dec 10, 2025
4df6e85
cuda : add missing support check for xielu (#17895)
CISC Dec 10, 2025
4dff236
ggml : remove GGML_KQ_MASK_PAD constant (#17910)
ggerganov Dec 10, 2025
e1f4921
Fix race conditions in threadpool when dealing with dynamic/frequent …
max-krasnyansky Dec 10, 2025
f32ca51
server: add presets (config) when using multiple models (#17859)
ServeurpersoCom Dec 10, 2025
34a6d86
cli: enable jinja by default (#17911)
ngxson Dec 10, 2025
c6b2c93
mtmd: some small clean up (#17909)
ngxson Dec 10, 2025
45e350e
ci: fix riscv64-native build (#17916)
CISC Dec 10, 2025
34ce48d
ggml-hexagon: fix `rope` failure at `test-backend-ops` (#17565)
chraac Dec 10, 2025
ca0931e
OWNERS: add file for OpenShift CI control
kpouget Aug 29, 2025
354d5a0
Add helper scripts
kpouget Aug 29, 2025
7be7680
ggml: add the ggml-remotingfrontend and ggml-remotingbackend libraries
kpouget Aug 29, 2025
89f338c
ggml: src: ggml-remotingfrontend/ggml-backend: add stub for .graph_op…
kpouget Nov 3, 2025
a146b32
src: llama-*: reduce the verbosity
kpouget Aug 29, 2025
18b9a5a
tools: run: run: add timing instrumentation
kpouget Aug 29, 2025
55cdc21
ggml-metal: make less verbose
kpouget Nov 3, 2025
68eb05e
HACK: ggml-cpu: reduce the verbosity
kpouget Nov 13, 2025
5518a24
ggml-remotingbackend: add missing includes
kpouget Nov 13, 2025
816ab5e
ggml/src/ggml-backend-reg.cpp: fix the frontend library name
kpouget Nov 20, 2025
76d5f4d
ggml-remotingbackend/backend: disable the GGML_BACKEND_LIBRARY_METAL_…
kpouget Nov 20, 2025
66d9487
ggml-remotingfrontend: disable USE_METAL_GUEST_SUPPORTS_OP
kpouget Nov 20, 2025
342e8a5
ggml-remotingfrontend/ggml-backend-reg: don't initialize the metal co…
kpouget Nov 20, 2025
8040ff8
ggml-remotingfrontend: disable USE_FROM_PTR (for Linux)
kpouget Nov 20, 2025
67ace70
ggml: src: ggml-remotingfrontend/ggml-remoting: use the arch to disti…
kpouget Dec 1, 2025
2ba908e
Update to make it work with Linux
kpouget Nov 20, 2025
8bb6441
ggml: src: ggml-remotingfrontend/ggml-backend-buffer: don't stop on b…
kpouget Dec 2, 2025
676605c
ggml: force disable vulkan loader when compiling the ggml-remotingbac…
kpouget Dec 9, 2025
b42853d
ggml-remoting: add support for the buffer_cpy_tensor function
kpouget Dec 9, 2025
75b65c3
ggml-remotingbackend: cleanups
kpouget Dec 9, 2025
dcee659
ggml-remotingbackend: fix inconsist fatal message ...
kpouget Dec 9, 2025
2d47b59
ggml-remotingfrontend: properly distinguish if the backend supports b…
kpouget Dec 9, 2025
bff6ad3
Revert "metal : make the FA extra sizes consistent (#17143)"
kpouget Dec 11, 2025
5592ddf
ggml-remoting: add the code the the graph_optimize function
kpouget Dec 9, 2025
d346b0d
ggml-remoting: update the graph serial/deserial
kpouget Dec 9, 2025
4be1999
Start integrating code for returning the optimized cgraph
kpouget Dec 9, 2025
bfecaa3
Add code to track the frontend buffers
kpouget Dec 9, 2025
da4bb3f
ggml-remotingfrontend: lookup the guest buffer from the host handle
kpouget Dec 9, 2025
cdf20e8
HACK: ggml-backend-reg: allow disabling the Vulkan backend
kpouget Dec 11, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -243,7 +243,7 @@ jobs:
echo "Fetch llama2c model"
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories260K/stories260K.bin
./bin/llama-convert-llama2c-to-ggml --copy-vocab-from-model ./tok512.bin --llama2c-model stories260K.bin --llama2c-output-model stories260K.gguf
./bin/llama-cli -m stories260K.gguf -p "One day, Lily met a Shoggoth" -n 500 -c 256
./bin/llama-completion -m stories260K.gguf -p "One day, Lily met a Shoggoth" -n 500 -c 256

- name: Test llama2c (s390x)
id: llama2c_test_s390x
Expand All @@ -252,7 +252,7 @@ jobs:
cd build
echo "Fetch llama2c big-endian model"
wget https://huggingface.co/ggml-org/models/resolve/main/tinyllamas/stories260K-be.gguf
./bin/llama-cli -m stories260K-be.gguf -p "One day, Lily met a Shoggoth" -n 500 -c 256
./bin/llama-completion -m stories260K-be.gguf -p "One day, Lily met a Shoggoth" -n 500 -c 256

ubuntu-latest-cmake-sanitizer:
runs-on: ubuntu-latest
Expand Down Expand Up @@ -1770,7 +1770,7 @@ jobs:
echo "Fetch llama2c model"
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories260K/stories260K.bin
./bin/llama-convert-llama2c-to-ggml --copy-vocab-from-model ./tok512.bin --llama2c-model stories260K.bin --llama2c-output-model stories260K.gguf
./bin/llama-cli -m stories260K.gguf -p "One day, Lily met a Shoggoth" -n 500 -c 256
./bin/llama-completion -m stories260K.gguf -p "One day, Lily met a Shoggoth" -n 500 -c 256

ubuntu-cmake-sanitizer-riscv64-native:
runs-on: RISCV64
Expand Down
2 changes: 2 additions & 0 deletions CMakePresets.json
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,8 @@
{ "name": "static", "hidden": true, "cacheVariables": { "GGML_STATIC": "ON" } },
{ "name": "sycl_f16", "hidden": true, "cacheVariables": { "GGML_SYCL_F16": "ON" } },
{ "name": "vulkan", "hidden": true, "cacheVariables": { "GGML_VULKAN": "ON" } },
{ "name": "remoting_frontend", "hidden": true, "cacheVariables": { "GGML_REMOTING_FRONTEND": "ON" } },
{ "name": "remoting_backend", "hidden": true, "cacheVariables": { "GGML_REMOTING_BACKEND": "ON" } },

{
"name": "x64-windows-llvm", "hidden": true,
Expand Down
1 change: 1 addition & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ The project differentiates between 3 levels of contributors:
- If you modified the `ggml` source, run the `test-backend-ops` tool to check whether different backend implementations of the `ggml` operators produce consistent results (this requires access to at least two different `ggml` backends)
- If you modified a `ggml` operator or added a new one, add the corresponding test cases to `test-backend-ops`
- Create separate PRs for each feature or fix. Avoid combining unrelated changes in a single PR
- When adding support for a new model or feature, focus on **CPU support only** in the initial PR unless you have a good reason not to. Add support for other backends like CUDA in follow-up PRs
- Consider allowing write access to your branch for faster reviews, as reviewers can push commits directly
- If your PR becomes stale, rebase it on top of latest `master` to get maintainers attention
- Maintainers will rely on your insights and approval when making a final decision to approve and merge a PR
Expand Down
13 changes: 13 additions & 0 deletions OWNERS
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
approvers:
- kpouget
- cfergeau
- praveenkumar
- vyasgun
- gbraad
options: {}
reviewers:
- kpouget
- cfergeau
- praveenkumar
- vyasgun
- gbraad
13 changes: 0 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -347,19 +347,6 @@ To learn more about model quantization, [read this documentation](tools/quantize

</details>

- <details>
<summary>Run simple text completion</summary>

To disable conversation mode explicitly, use `-no-cnv`

```bash
llama-cli -m model.gguf -p "I believe the meaning of life is" -n 128 -no-cnv

# I believe the meaning of life is to find your own truth and to live in accordance with it. For me, this means being true to myself and following my passions, even if they don't align with societal expectations. I think that's what I love about yoga – it's not just a physical practice, but a spiritual one too. It's about connecting with yourself, listening to your inner voice, and honoring your own unique journey.
```

</details>

- <details>
<summary>Constrain the output with a custom grammar</summary>

Expand Down
37 changes: 37 additions & 0 deletions build.backend.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# force isatty-->true, so that $0 |& head -50 has colors ...
rm -f READY_backend FAILED_backend

echo "int isatty(int fd) { return 1; }" | gcc -O2 -fpic -shared -ldl -o /tmp/isatty.so -xc -
export LD_PRELOAD=/tmp/isatty.so

if [[ "${PERF_MODE:-}" ]]; then
FLAVOR="-prod"
else
FLAVOR=""
fi

export SDKROOT=$(xcrun --sdk macosx --show-sdk-path)

if [[ "$FLAVOR" == "-prod" ]]; then
cat <<EOF
###
### Building the prod flavor
###
EOF
fi

TARGETS="llama-run"
if [[ "${BENCH_MODE:-}" == "bench" ]]; then
TARGETS="$TARGETS llama-bench"
elif [[ "${BENCH_MODE:-}" == "perf" ]]; then
TARGETS="$TARGETS test-backend-ops"
fi

cmake --build ../build.remoting-backend$FLAVOR --target $TARGETS "$@" --parallel 8

if [[ $? == 0 ]]; then
touch READY_backend
else
touch FAILED_backend
exit 1
fi
10 changes: 10 additions & 0 deletions build.linux.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
rm -f READY FAILED

cmake --build ../build.vulkan-linux --parallel 8 --target llama-run llama-server

if [[ $? == 0 ]]; then
touch READY
else
touch FAILED
exit 1
fi
26 changes: 26 additions & 0 deletions build.remoting.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# force isatty-->true, so that $0 |& head -50 has colors ...
rm -f READY FAILED

echo "int isatty(int fd) { return 1; }" | gcc -O2 -fpic -shared -ldl -o /tmp/isatty.so -xc -
export LD_PRELOAD=/tmp/isatty.so

TARGETS="ggml-remotingfrontend"

TARGETS="$BUILD_TARGET llama-run"
set -x
if [[ "${BENCH_MODE:-}" == "bench" ]]; then
TARGETS="$TARGETS llama-bench"
elif [[ "${BENCH_MODE:-}" == "server" ]]; then
TARGETS="$TARGETS llama-server"
elif [[ "${BENCH_MODE:-}" == "perf" ]]; then
TARGETS="$TARGETS test-backend-ops"
fi

cmake --build ../build.remoting-frontend$FLAVOR --parallel 8 --target $TARGETS "$@"

if [[ $? == 0 ]]; then
touch READY
else
touch FAILED
exit 1
fi
1 change: 1 addition & 0 deletions build.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
cmake --build ./build/ --parallel 8
10 changes: 10 additions & 0 deletions build.vulkan.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
rm -f READY FAILED

cmake --build ../build.vulkan --parallel 8 --target llama-run

if [[ $? == 0 ]]; then
touch READY
else
touch FAILED
exit 1
fi
24 changes: 12 additions & 12 deletions ci/run.sh
Original file line number Diff line number Diff line change
Expand Up @@ -398,18 +398,18 @@ function gg_run_qwen3_0_6b {
./bin/llama-quantize ${model_bf16} ${model_q5_k} q5_k $(nproc)
./bin/llama-quantize ${model_bf16} ${model_q6_k} q6_k $(nproc)

(time ./bin/llama-cli -no-cnv --model ${model_f16} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-f16.log
(time ./bin/llama-cli -no-cnv --model ${model_bf16} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-bf16.log
(time ./bin/llama-cli -no-cnv --model ${model_q8_0} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q8_0.log
(time ./bin/llama-cli -no-cnv --model ${model_q4_0} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q4_0.log
(time ./bin/llama-cli -no-cnv --model ${model_q4_1} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q4_1.log
(time ./bin/llama-cli -no-cnv --model ${model_q5_0} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q5_0.log
(time ./bin/llama-cli -no-cnv --model ${model_q5_1} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q5_1.log
(time ./bin/llama-cli -no-cnv --model ${model_q2_k} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q2_k.log
(time ./bin/llama-cli -no-cnv --model ${model_q3_k} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q3_k.log
(time ./bin/llama-cli -no-cnv --model ${model_q4_k} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q4_k.log
(time ./bin/llama-cli -no-cnv --model ${model_q5_k} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q5_k.log
(time ./bin/llama-cli -no-cnv --model ${model_q6_k} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q6_k.log
(time ./bin/llama-completion -no-cnv --model ${model_f16} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-f16.log
(time ./bin/llama-completion -no-cnv --model ${model_bf16} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-bf16.log
(time ./bin/llama-completion -no-cnv --model ${model_q8_0} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q8_0.log
(time ./bin/llama-completion -no-cnv --model ${model_q4_0} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q4_0.log
(time ./bin/llama-completion -no-cnv --model ${model_q4_1} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q4_1.log
(time ./bin/llama-completion -no-cnv --model ${model_q5_0} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q5_0.log
(time ./bin/llama-completion -no-cnv --model ${model_q5_1} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q5_1.log
(time ./bin/llama-completion -no-cnv --model ${model_q2_k} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q2_k.log
(time ./bin/llama-completion -no-cnv --model ${model_q3_k} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q3_k.log
(time ./bin/llama-completion -no-cnv --model ${model_q4_k} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q4_k.log
(time ./bin/llama-completion -no-cnv --model ${model_q5_k} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q5_k.log
(time ./bin/llama-completion -no-cnv --model ${model_q6_k} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q6_k.log

(time ./bin/llama-perplexity --model ${model_f16} -f ${wiki_test} -ngl 99 -c 1024 -b 512 --chunks 2 ) 2>&1 | tee -a $OUT/${ci}-tg-f16.log
if [ -z ${GG_BUILD_NO_BF16} ]; then
Expand Down
2 changes: 2 additions & 0 deletions common/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,8 @@ add_library(${TARGET} STATIC
ngram-cache.h
peg-parser.cpp
peg-parser.h
preset.cpp
preset.h
regex-partial.cpp
regex-partial.h
sampling.cpp
Expand Down
Loading