Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
95 commits
Select commit Hold shift + click to select a range
1e15bfd
graph : fix stack-use-after-return (#14960)
ggerganov Jul 30, 2025
00131d6
tests : update for LLAMA_SET_ROWS=1 (#14961)
ggerganov Jul 30, 2025
92b8810
CUDA: skip masked KV slices for all FA kernels (#14924)
JohannesGaessler Jul 30, 2025
73a8e5c
vulkan : fix 32-bit builds (ggml/1313)
dg0yt Jul 30, 2025
e228de9
cmake : Fix BLAS link interface (ggml/1316)
dg0yt Jul 30, 2025
e32a4ec
sync : ggml
ggerganov Jul 30, 2025
ad4a700
HIP: enable mfma mmq on gfx908 and gfx90a for select datatypes and sh…
IMbackK Jul 30, 2025
41e78c5
server : add support for `embd_normalize` parameter (#14964)
danbev Jul 30, 2025
e9192be
quantize : fix using combined imatrix GGUFs (multiple datasets) (#14973)
EAddario Jul 30, 2025
6e67254
opencl: add `mul_mat_f32_f32_l4_lm` and `mul_mat_f16_f32_l4_lm` (#14809)
lhez Jul 30, 2025
66625a5
graph : reduce splits for recurrent and hybrid models (#14825)
compilade Jul 31, 2025
11490b3
CANN: Improve loading efficiency after converting weights to NZ forma…
hipudding Jul 31, 2025
8a4a856
Add LLaDA 8b Diffusion model (#14771)
am17an Jul 31, 2025
a9f77a8
server : add openai-style logit_bias support (#14946)
lukasstraub2 Jul 31, 2025
c1dacaa
llama : merge build_moe_ffn_from_probs function into build_moe_ffn (#…
wdl339 Jul 31, 2025
94933c8
server : implement universal assisted decoding (#12635)
g2mt Jul 31, 2025
36e5fe7
MODEL_TENSOR.SSM_DT_NORM has defined twice (#14991)
csabakecskemeti Jul 31, 2025
952a47f
mtmd : support MiniCPM-V 4.0 (#14983)
tc-mb Jul 31, 2025
e08a988
Vulkan: Fix minor debug mode issues (#14899)
0cc4m Jul 31, 2025
d6818d0
llama : allow other bufts when overriding to CPU, add --no-repack opt…
slaren Jul 31, 2025
7845240
Fix params bug in diffusion example (#14993)
am17an Jul 31, 2025
a06ed5f
llama : add simple option to enable CPU for MoE weights (--cpu-moe) (…
slaren Jul 31, 2025
daf2dd7
quantize : skip tensor override when in fallback mode (#14995)
EAddario Jul 31, 2025
484b209
compare-commits.sh: support both llama-bench and test-backend-ops (#1…
yeahdongcn Aug 1, 2025
2860d47
docker : add cann build pipline (#14591)
diannaojiang Aug 1, 2025
ba42794
graph : fix equal_seq() check (#14986)
ggerganov Aug 1, 2025
baad948
ggml : Q2k interleaving implementation - x86/x64 SIMD (#14373)
Srihari-mcw Aug 1, 2025
1c872f7
opencl: add f16 for `add`, `sub`, `mul`, `div` (#14984)
lhez Aug 1, 2025
0f5ccd6
model : add hunyuan dense (#14878)
stevenkuang-tencent Aug 1, 2025
c76b420
vendor : update vendored copy of google/minja (#15011)
l-austenfeld Aug 1, 2025
9c35706
CUDA: fix MMQ nwarps for AMD with warp_size==32 (#15014)
JohannesGaessler Aug 1, 2025
a9f7541
vulkan: optimizations for direct convolution (#14933)
jeffbolznv Aug 2, 2025
f906275
server: enable token array inputs for OAI API (#15001)
JohannesGaessler Aug 2, 2025
339bd02
model : support Qwen3-Embedding (#15023)
iamlemec Aug 2, 2025
ec0b188
vulkan: Support ne[3]>1 in noncontig matrix-vector multiply (#15015)
jeffbolznv Aug 2, 2025
3025b62
llama-bench: rename DB table name from test to llama_bench (#15003)
yeahdongcn Aug 2, 2025
4cb208c
vulkan: coopmat2 mul_mat optimizations (#14934)
jeffbolznv Aug 2, 2025
f738989
chat : fix multiple tool_calls on hermes-2-pro (#14962)
jhen0409 Aug 2, 2025
711d5e6
convert : fix Qwen3-Embedding pre-tokenizer hash (#15030)
iamlemec Aug 2, 2025
2bf3fbf
ci : check that pre-tokenizer hashes are up-to-date (#15032)
CISC Aug 2, 2025
15e92fd
cuda, sycl : fix batched gemm when ne02 == 1 && ne03 > 1 (#15038)
ggerganov Aug 2, 2025
a4569c4
llama : enable LLAMA_SET_ROWS=1 by default (#14959)
ggerganov Aug 2, 2025
4fdea54
kv-cache : skip alignment of n_stream in kv-cache log msg [no ci] (#1…
danbev Aug 2, 2025
3303c19
cuda: make im2col a little faster (#15025)
leejet Aug 2, 2025
03d4698
CUDA: use mma FA kernel for gqa > 4 on RTX 4000 (#15035)
JohannesGaessler Aug 2, 2025
5c0eb5e
opencl: fix adreno compiler detection logic (#15029)
lhez Aug 2, 2025
6c7a441
vulkan: Use coopmat2 for conv2d (#14982)
jeffbolznv Aug 3, 2025
83bc2f2
model : add text-only support for Kimi-VL (and find special tokens in…
gabriellarson Aug 3, 2025
97366dc
vocab : JetBrains Mellum pre-tokenizer (#15045)
csabakecskemeti Aug 3, 2025
11a3811
memory : handle kv_unified for hybrid models (#15050)
compilade Aug 3, 2025
0a2f549
imatrix : fix 3d activation handling for hybrid and recurrent models …
compilade Aug 3, 2025
d31192b
imatrix : use GGUF by default (#14842)
compilade Aug 3, 2025
5aa1105
vulkan: fix build when using glslang that does not support coopmat2 (…
jeffbolznv Aug 4, 2025
587d011
ggml: WebGPU backend host improvements and style fixing (#14978)
reeselevine Aug 4, 2025
2721257
quantize : fix confusing error message if ftype is invalid (#15071)
CISC Aug 4, 2025
ef0144c
model: support GLM 4.5 family of models (#14939)
sammcj Aug 4, 2025
e5bebe5
gguf-py : add --chat-template-file to gguf_new_metadata (#15075)
CISC Aug 4, 2025
4161343
cmake: Add GGML_BACKEND_DIR option (#15074)
ckastner Aug 4, 2025
19f68fa
imatrix : warn when GGUF imatrix is saved without .gguf suffix (#15076)
compilade Aug 4, 2025
ec428b0
llama : add --n-cpu-moe option (#15077)
slaren Aug 4, 2025
ee3a9fc
context : fix index overflow on huge outputs (#15080)
compilade Aug 5, 2025
22f060c
webui: fix markdown table (#15081)
dindinw Aug 5, 2025
c81de6e
Fix `glm4moe` bug (#15088)
jukofyork Aug 5, 2025
3306cea
sycl: fix mul_mat selection (#15092)
Rbiessy Aug 5, 2025
be42642
readme : update hot topics (#15097)
ggerganov Aug 5, 2025
f324a3b
chat : only remove double bos/eos if added (#15086)
CISC Aug 5, 2025
fd1234c
llama : add gpt-oss (#15091)
ggerganov Aug 5, 2025
9515c61
ggml: WebGPU disable SET_ROWS for now (#15078)
reeselevine Aug 5, 2025
2241453
CANN: add support for ACL Graph (#15065)
noemotiovon Aug 6, 2025
2572689
chat : fix hunyuan auto-detection (#15114)
stevenkuang-tencent Aug 6, 2025
65c797c
chat : fix yandex chat template (#15116)
CISC Aug 6, 2025
0d88315
ggml : fix fallback to CPU for ununsupported ops (#15118)
slaren Aug 6, 2025
476aa3f
Fixed name `-override-tensors` to `-override-tensor` (#15129)
jukofyork Aug 6, 2025
3db4da5
chat : support Granite model reasoning and tool call (#14864)
smdesai Aug 6, 2025
e725a1a
opencl: add `swiglu_oai` and `add_id` (#15121)
lhez Aug 6, 2025
756cfea
fix profiling crash (#15072)
rmatif Aug 6, 2025
5fd160b
ggml: Add basic SET_ROWS support in WebGPU (#15137)
reeselevine Aug 6, 2025
36d3f00
requirements : fix PyTorch uint64 compatibility (#15134)
danbev Aug 7, 2025
20638e4
scripts: fix crash when --tool is not set (#15133)
JohannesGaessler Aug 7, 2025
1d72c84
CUDA: GEMM for FP32/FP16/BF16 and ne11 <= 16 (#15131)
JohannesGaessler Aug 7, 2025
9a96389
ggml: Skip backend library linking code when GGML_BACKEND_DL=ON (#15094)
ckastner Aug 7, 2025
7ad67ba
HIP: add cmake option to enable compiler output of kernel resource us…
IMbackK Aug 7, 2025
99acbc9
llama : Support intern-s1 (#14875)
RunningLeon Aug 7, 2025
a0552c8
vulkan: Add env var to disable host visible vidmem (#15109)
jeffbolznv Aug 7, 2025
c4f5356
vulkan: support fattn sinks (#15126)
jeffbolznv Aug 7, 2025
50aa938
convert : support non-mxfp4 HF model (#15153)
ngxson Aug 7, 2025
aaa3d07
opencl: support sink in `soft_max` (attn sinks) (#15152)
lhez Aug 8, 2025
1425f58
CUDA: attention sinks for mma FlashAttention (#15157)
JohannesGaessler Aug 8, 2025
6c7e9a5
vendor: sync minja (#15161)
ochafik Aug 8, 2025
cd6983d
ggml : fix field name when new ggml_backend (#14944)
aisk Aug 8, 2025
4850b52
server-bench: external OAI servers, sqlite (#15179)
JohannesGaessler Aug 8, 2025
e54d41b
gguf-py : add Numpy MXFP4 de/quantization support (#15111)
compilade Aug 8, 2025
34c9d76
CUDA: add attention sinks for tile and wmma (#15178)
am17an Aug 9, 2025
79c1160
cuda: refactored ssm_scan and use CUB (#13291)
Your-Cheese Aug 9, 2025
6b40f52
Merge branch 'layla-build' into merge
l3utterfly Aug 11, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
130 changes: 130 additions & 0 deletions .devops/cann.Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
# ==============================================================================
# ARGUMENTS
# ==============================================================================

# Define the CANN base image for easier version updates later
ARG CANN_BASE_IMAGE=quay.io/ascend/cann:8.1.rc1-910b-openeuler22.03-py3.10

# ==============================================================================
# BUILD STAGE
# Compile all binary files and libraries
# ==============================================================================
FROM ${CANN_BASE_IMAGE} AS build

# Define the Ascend chip model for compilation. Default is Ascend910B3
ARG ASCEND_SOC_TYPE=Ascend910B3

# -- Install build dependencies --
RUN yum install -y gcc g++ cmake make git libcurl-devel python3 python3-pip && \
yum clean all && \
rm -rf /var/cache/yum

# -- Set the working directory --
WORKDIR /app

# -- Copy project files --
COPY . .

# -- Set CANN environment variables (required for compilation) --
# Using ENV instead of `source` allows environment variables to persist across the entire image layer
ENV ASCEND_TOOLKIT_HOME=/usr/local/Ascend/ascend-toolkit/latest
ENV LD_LIBRARY_PATH=${ASCEND_TOOLKIT_HOME}/lib64:${LD_LIBRARY_PATH}
ENV PATH=${ASCEND_TOOLKIT_HOME}/bin:${PATH}
ENV ASCEND_OPP_PATH=${ASCEND_TOOLKIT_HOME}/opp
ENV LD_LIBRARY_PATH=${ASCEND_TOOLKIT_HOME}/runtime/lib64/stub:$LD_LIBRARY_PATH
# ... You can add other environment variables from the original file as needed ...
# For brevity, only core variables are listed here. You can paste the original ENV list here.

# -- Build llama.cpp --
# Use the passed ASCEND_SOC_TYPE argument and add general build options
RUN source /usr/local/Ascend/ascend-toolkit/set_env.sh --force \
&& \
cmake -B build \
-DGGML_CANN=ON \
-DCMAKE_BUILD_TYPE=Release \
-DSOC_TYPE=${ASCEND_SOC_TYPE} \
. && \
cmake --build build --config Release -j$(nproc)

# -- Organize build artifacts for copying in later stages --
# Create a lib directory to store all .so files
RUN mkdir -p /app/lib && \
find build -name "*.so" -exec cp {} /app/lib \;

# Create a full directory to store all executables and Python scripts
RUN mkdir -p /app/full && \
cp build/bin/* /app/full/ && \
cp *.py /app/full/ && \
cp -r gguf-py /app/full/ && \
cp -r requirements /app/full/ && \
cp requirements.txt /app/full/
# If you have a tools.sh script, make sure it is copied here
# cp .devops/tools.sh /app/full/tools.sh

# ==============================================================================
# BASE STAGE
# Create a minimal base image with CANN runtime and common libraries
# ==============================================================================
FROM ${CANN_BASE_IMAGE} AS base

# -- Install runtime dependencies --
RUN yum install -y libgomp curl && \
yum clean all && \
rm -rf /var/cache/yum

# -- Set CANN environment variables (required for runtime) --
ENV ASCEND_TOOLKIT_HOME=/usr/local/Ascend/ascend-toolkit/latest
ENV LD_LIBRARY_PATH=/app:${ASCEND_TOOLKIT_HOME}/lib64:${LD_LIBRARY_PATH}
ENV PATH=${ASCEND_TOOLKIT_HOME}/bin:${PATH}
ENV ASCEND_OPP_PATH=${ASCEND_TOOLKIT_HOME}/opp
# ... You can add other environment variables from the original file as needed ...

WORKDIR /app

# Copy compiled .so files from the build stage
COPY --from=build /app/lib/ /app

# ==============================================================================
# FINAL STAGES (TARGETS)
# ==============================================================================

### Target: full
# Complete image with all tools, Python bindings, and dependencies
# ==============================================================================
FROM base AS full

COPY --from=build /app/full /app

# Install Python dependencies
RUN yum install -y git python3 python3-pip && \
pip3 install --no-cache-dir --upgrade pip setuptools wheel && \
pip3 install --no-cache-dir -r requirements.txt && \
yum clean all && \
rm -rf /var/cache/yum

# You need to provide a tools.sh script as the entrypoint
ENTRYPOINT ["/app/tools.sh"]
# If there is no tools.sh, you can set the default to start the server
# ENTRYPOINT ["/app/llama-server"]

### Target: light
# Lightweight image containing only llama-cli
# ==============================================================================
FROM base AS light

COPY --from=build /app/full/llama-cli /app

ENTRYPOINT [ "/app/llama-cli" ]

### Target: server
# Dedicated server image containing only llama-server
# ==============================================================================
FROM base AS server

ENV LLAMA_ARG_HOST=0.0.0.0

COPY --from=build /app/full/llama-server /app

HEALTHCHECK --interval=5m CMD [ "curl", "-f", "http://localhost:8080/health" ]

ENTRYPOINT [ "/app/llama-server" ]
64 changes: 16 additions & 48 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -159,31 +159,15 @@ jobs:
- name: Dawn Dependency
id: dawn-depends
run: |
ARTIFACTS_JSON=$(curl -s -L \
-H "Accept: application/vnd.github+json" \
-H "Authorization: Bearer ${{ secrets.GITHUB_TOKEN }}" \
-H "X-GitHub-Api-Version: 2022-11-28" \
"https://api.github.com/repos/google/dawn/actions/artifacts")
echo "Finding latest macos-latest-Release artifact..."
DOWNLOAD_URL=$(echo "$ARTIFACTS_JSON" | jq -r '.artifacts
| sort_by(.created_at)
| reverse
| map(select(.name | test("macos-latest-Release$")))
| .[0].archive_download_url')
if [ "$DOWNLOAD_URL" = "null" ] || [ -z "$DOWNLOAD_URL" ]; then
echo "No suitable Dawn artifact found!"
exit 1
fi
echo "Downloading from: $DOWNLOAD_URL"
curl -L \
-H "Accept: application/vnd.github+json" \
-H "Authorization: Bearer ${{ secrets.GITHUB_TOKEN }}" \
-o artifact.zip "$DOWNLOAD_URL"
unzip artifact.zip
DAWN_VERSION="v1.0.0"
DAWN_OWNER="reeselevine"
DAWN_REPO="dawn"
DAWN_ASSET_NAME="Dawn-a1a6b45cced25a3b7f4fb491e0ae70796cc7f22b-macos-latest-Release.tar.gz"
echo "Fetching release asset from https://github.com/${DAWN_OWNER}/${DAWN_REPO}/releases/download/${DAWN_VERSION}/${DAWN_ASSET_NAME}"
curl -L -o artifact.tar.gz \
"https://github.com/${DAWN_OWNER}/${DAWN_REPO}/releases/download/${DAWN_VERSION}/${DAWN_ASSET_NAME}"
mkdir dawn
tar_file=$(find . -name '*.tar.gz' | head -n 1)
echo "Extracting: $tar_file"
tar -xvf "$tar_file" -C dawn --strip-components=1
tar -xvf artifact.tar.gz -C dawn --strip-components=1

- name: Build
id: cmake_build
Expand Down Expand Up @@ -433,31 +417,15 @@ jobs:
id: dawn-depends
run: |
sudo apt-get install -y libxrandr-dev libxinerama-dev libxcursor-dev mesa-common-dev libx11-xcb-dev libxi-dev
ARTIFACTS_JSON=$(curl -s -L \
-H "Accept: application/vnd.github+json" \
-H "Authorization: Bearer ${{ secrets.GITHUB_TOKEN }}" \
-H "X-GitHub-Api-Version: 2022-11-28" \
"https://api.github.com/repos/google/dawn/actions/artifacts")
echo "Finding latest ubuntu-latest-Release artifact..."
DOWNLOAD_URL=$(echo "$ARTIFACTS_JSON" | jq -r '.artifacts
| sort_by(.created_at)
| reverse
| map(select(.name | test("ubuntu-latest-Release$")))
| .[0].archive_download_url')
if [ "$DOWNLOAD_URL" = "null" ] || [ -z "$DOWNLOAD_URL" ]; then
echo "No suitable Dawn artifact found!"
exit 1
fi
echo "Downloading from: $DOWNLOAD_URL"
curl -L \
-H "Accept: application/vnd.github+json" \
-H "Authorization: Bearer ${{ secrets.GITHUB_TOKEN }}" \
-o artifact.zip "$DOWNLOAD_URL"
unzip artifact.zip
DAWN_VERSION="v1.0.0"
DAWN_OWNER="reeselevine"
DAWN_REPO="dawn"
DAWN_ASSET_NAME="Dawn-a1a6b45cced25a3b7f4fb491e0ae70796cc7f22b-ubuntu-latest-Release.tar.gz"
echo "Fetching release asset from https://github.com/${DAWN_OWNER}/${DAWN_REPO}/releases/download/${DAWN_VERSION}/${DAWN_ASSET_NAME}"
curl -L -o artifact.tar.gz \
"https://github.com/${DAWN_OWNER}/${DAWN_REPO}/releases/download/${DAWN_VERSION}/${DAWN_ASSET_NAME}"
mkdir dawn
tar_file=$(find . -name '*.tar.gz' | head -n 1)
echo "Extracting: $tar_file"
tar -xvf "$tar_file" -C dawn --strip-components=1
tar -xvf artifact.tar.gz -C dawn --strip-components=1

- name: Build
id: cmake_build
Expand Down
45 changes: 45 additions & 0 deletions .github/workflows/pre-tokenizer-hashes.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
name: Check Pre-Tokenizer Hashes

on:
push:
paths:
- 'convert_hf_to_gguf.py'
- 'convert_hf_to_gguf_update.py'
pull_request:
paths:
- 'convert_hf_to_gguf.py'
- 'convert_hf_to_gguf_update.py'

jobs:
pre-tokenizer-hashes:
runs-on: ubuntu-latest

steps:
- name: Checkout repository
uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'

- name: Install Python dependencies
run: |
python3 -m venv .venv
.venv/bin/pip install -r requirements/requirements-convert_hf_to_gguf_update.txt

- name: Update pre-tokenizer hashes
run: |
cp convert_hf_to_gguf.py /tmp
.venv/bin/python convert_hf_to_gguf_update.py --check-missing

- name: Check if committed pre-tokenizer hashes matches generated version
run: |
if ! diff -q convert_hf_to_gguf.py /tmp/convert_hf_to_gguf.py; then
echo "Model pre-tokenizer hashes (in convert_hf_to_gguf.py) do not match generated hashes (from convert_hf_to_gguf_update.py)."
echo "To fix: run ./convert_hf_to_gguf_update.py and commit the updated convert_hf_to_gguf.py along with your changes"
echo "Differences found:"
diff convert_hf_to_gguf.py /tmp/convert_hf_to_gguf.py || true
exit 1
fi
echo "Model pre-tokenizer hashes are up to date."
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ LLM inference in C/C++

## Hot topics

- Support for the `gpt-oss` model with native MXFP4 format has been added | [PR](https://github.com/ggml-org/llama.cpp/pull/15091) | [Collaboration with NVIDIA](https://blogs.nvidia.com/blog/rtx-ai-garage-openai-oss) | [Comment](https://github.com/ggml-org/llama.cpp/discussions/15095)
- Hot PRs: [All](https://github.com/ggml-org/llama.cpp/pulls?q=is%3Apr+label%3Ahot+) | [Open](https://github.com/ggml-org/llama.cpp/pulls?q=is%3Apr+label%3Ahot+is%3Aopen)
- Multimodal support arrived in `llama-server`: [#12898](https://github.com/ggml-org/llama.cpp/pull/12898) | [documentation](./docs/multimodal.md)
- VS Code extension for FIM completions: https://github.com/ggml-org/llama.vscode
Expand Down
Loading
Loading