Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
e086c5e
docs: update s390x document for sentencepiece
taronaeo Jul 21, 2025
8410b08
docs: update huggingface links + reword
taronaeo Jul 21, 2025
a2cdf55
vulkan/cuda: Fix im2col when KW!=KH (#14789)
jeffbolznv Jul 21, 2025
ae77ded
docs : fix backends table in README.md (#14796)
rgerganov Jul 21, 2025
549f9eb
kleidiai: add support for get_rows (#14676)
chaxu01 Jul 21, 2025
f04095b
sycl: Fix im2col (#14797)
Rbiessy Jul 21, 2025
120add9
opencl: add conv2d kernel (#14403)
rmatif Jul 21, 2025
e77f241
opencl: fix `im2col` when `KW!=KH` (#14803)
CISC Jul 21, 2025
9e500e2
cuda: remove linking to cublasLt (#14790)
yeahdongcn Jul 21, 2025
0dd3cd5
server : allow setting `--reverse-prompt` arg (#14799)
MollySophia Jul 22, 2025
1e54562
opencl: remove unreachable `return` (#14806)
lhez Jul 22, 2025
4c94f27
cuda : implement bf16 cpy ops and enable bf16 cont (#14763)
CISC Jul 22, 2025
888b75b
Mtmd: add a way to select device for vision encoder (#14236)
stduhpf Jul 22, 2025
45fc00e
imatrix: add option to display importance score statistics for a give…
EAddario Jul 22, 2025
10a6765
llama : add model type detection for rwkv7 7B&14B (#14816)
MollySophia Jul 22, 2025
44d4801
vulkan: fix rms_norm_mul to handle broadcasting dim0 (#14817)
jeffbolznv Jul 22, 2025
9b51256
ggml : model card yaml tab->2xspace (#14819)
csabakecskemeti Jul 22, 2025
1e55890
CUDA: add fused rms norm (#14800)
am17an Jul 23, 2025
ef6198b
CANN: weight format to NZ for Ascend310P3 (#14407)
tqgy6 Jul 23, 2025
bd3c22a
ggml: fix loongarch quantize_row_q8_1 error (#14827)
lixing-star Jul 23, 2025
e0f2615
memory : handle saving/loading null layers in recurrent memory (#14675)
l3utterfly Jul 23, 2025
90916df
tests : add non-cont K,V FA tests
ggerganov Jul 18, 2025
7473a0d
CUDA: fix quantized KV cache + multiple sequences (#14822)
JohannesGaessler Jul 23, 2025
a3ddddb
ci : correct label refactor->refactoring (#14832)
CISC Jul 23, 2025
9db975e
CUDA: fix compilation with GGML_CUDA_F16 (#14837)
JohannesGaessler Jul 23, 2025
5ad021f
CUDA: fix overflow in FA, tune performance (#14840)
JohannesGaessler Jul 23, 2025
bd060d6
convert : text-only support for GLM-4.1V-9B-Thinking (#14823)
jacekpoplawski Jul 23, 2025
7234b89
sycl: fix undefined variable in work group size check (#14843)
djeong20 Jul 24, 2025
e84b911
metal : fix fusion across different encoders (#14849)
ggerganov Jul 24, 2025
63b420b
docs: add libcurl-dev install hint for Linux distros (#14801)
PouyaGhahramanian Jul 24, 2025
6286ad2
llama : fix MiniCPM inference after Granite Four changes (#14850)
jk3456a Jul 24, 2025
07a4930
sycl: fixed semantics of block offset calculation (#14814)
Alcpz Jul 24, 2025
c1d4ffc
chat : fix kimi-k2 chat template (#14852)
ngxson Jul 24, 2025
7c5ca60
context : perform output reorder lazily upon access after sync (#14853)
ggerganov Jul 24, 2025
4601f39
ggml-cpu : remove stdlib include from repack.cpp (ggml/1276)
danbev Jul 21, 2025
7902541
cmake : fix usage issues (ggml/1257)
dg0yt Jul 22, 2025
45c2cc3
sync : ggml
ggerganov Jul 24, 2025
caaebfe
musa: upgrade musa sdk to rc4.2.0 (#14498)
yeahdongcn Jul 24, 2025
a122095
sched : fix multiple evaluations of the same graph with pipeline para…
slaren Jul 25, 2025
328ed53
rpc : check for null buffers in get/set/copy tensor endpoints (#14868)
struct Jul 25, 2025
092c1bd
mtmd : fix 32-bit narrowing issue in export-lora and mtmd clip (#14503)
kiwi142857 Jul 25, 2025
a6357ac
context : restore preemptive sched reset when LLAMA_SET_ROWS=0 (#14870)
ggerganov Jul 25, 2025
2177ccd
ggml : remove invalid portPos specifiers from dot files (#14838)
ORippler Jul 25, 2025
412f4c7
ggml-cpu: disable ggml-nnpa compile flag by default
taronaeo Jul 25, 2025
c1eeae1
docs: update s390x build docs to reflect nnpa disable
taronaeo Jul 25, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions .devops/musa.Dockerfile
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
ARG UBUNTU_VERSION=22.04
# This needs to generally match the container host's environment.
ARG MUSA_VERSION=rc4.0.1
ARG MUSA_VERSION=rc4.2.0
# Target the MUSA build image
ARG BASE_MUSA_DEV_CONTAINER=mthreads/musa:${MUSA_VERSION}-mudnn-devel-ubuntu${UBUNTU_VERSION}
ARG BASE_MUSA_DEV_CONTAINER=mthreads/musa:${MUSA_VERSION}-devel-ubuntu${UBUNTU_VERSION}-amd64

ARG BASE_MUSA_RUN_CONTAINER=mthreads/musa:${MUSA_VERSION}-mudnn-runtime-ubuntu${UBUNTU_VERSION}
ARG BASE_MUSA_RUN_CONTAINER=mthreads/musa:${MUSA_VERSION}-runtime-ubuntu${UBUNTU_VERSION}-amd64

FROM ${BASE_MUSA_DEV_CONTAINER} AS build

Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -515,7 +515,7 @@ jobs:

ubuntu-22-cmake-musa:
runs-on: ubuntu-22.04
container: mthreads/musa:rc4.0.1-mudnn-devel-ubuntu22.04
container: mthreads/musa:rc4.2.0-devel-ubuntu22.04-amd64

steps:
- name: Clone
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/close-issue.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ jobs:
steps:
- uses: actions/stale@v5
with:
exempt-issue-labels: "refactor,help wanted,good first issue,research,bug,roadmap"
exempt-issue-labels: "refactoring,help wanted,good first issue,research,bug,roadmap"
days-before-issue-stale: 30
days-before-issue-close: 14
stale-issue-label: "stale"
Expand Down
1 change: 0 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -270,7 +270,6 @@ Instructions for adding support for new models: [HOWTO-add-model.md](docs/develo
| [CANN](docs/build.md#cann) | Ascend NPU |
| [OpenCL](docs/backend/OPENCL.md) | Adreno GPU |
| [WebGPU [In Progress]](docs/build.md#webgpu) | All |

| [RPC](https://github.com/ggml-org/llama.cpp/tree/master/tools/rpc) | All |

## Obtaining and quantizing models
Expand Down
2 changes: 1 addition & 1 deletion ci/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ docker run --privileged -it \
-v $HOME/llama.cpp/ci-cache:/ci-cache \
-v $HOME/llama.cpp/ci-results:/ci-results \
-v $PWD:/ws -w /ws \
mthreads/musa:rc4.0.1-mudnn-devel-ubuntu22.04
mthreads/musa:rc4.2.0-devel-ubuntu22.04-amd64
```

Inside the container, execute the following commands:
Expand Down
9 changes: 8 additions & 1 deletion common/arg.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1612,7 +1612,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
[](common_params & params, const std::string & value) {
params.antiprompt.emplace_back(value);
}
).set_examples({LLAMA_EXAMPLE_MAIN}));
).set_examples({LLAMA_EXAMPLE_MAIN, LLAMA_EXAMPLE_SERVER}));
add_opt(common_arg(
{"-sp", "--special"},
string_format("special tokens output enabled (default: %s)", params.special ? "true" : "false"),
Expand Down Expand Up @@ -2655,6 +2655,13 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
params.i_chunk = value;
}
).set_examples({LLAMA_EXAMPLE_IMATRIX}));
add_opt(common_arg(
{"--show-statistics"},
string_format("show imatrix statistics and then exit (default: %s)", params.show_statistics ? "true" : "false"),
[](common_params & params) {
params.show_statistics = true;
}
).set_examples({LLAMA_EXAMPLE_IMATRIX}));
add_opt(common_arg(
{"--parse-special"},
string_format("prase special tokens (chat, tool, etc) (default: %s)", params.parse_special ? "true" : "false"),
Expand Down
7 changes: 4 additions & 3 deletions common/common.h
Original file line number Diff line number Diff line change
Expand Up @@ -432,9 +432,10 @@ struct common_params {
int32_t n_save_freq = 0; // save the imatrix every n_save_freq iterations
int32_t i_chunk = 0; // start processing from this chunk

bool process_output = false; // collect data for the output tensor
bool compute_ppl = true; // whether to compute perplexity
bool parse_special = false; // whether to parse special tokens during imatrix tokenization
bool process_output = false; // collect data for the output tensor
bool compute_ppl = true; // whether to compute perplexity
bool show_statistics = false; // show imatrix statistics per tensor
bool parse_special = false; // whether to parse special tokens during imatrix tokenization

// cvector-generator params
int n_pca_batch = 100;
Expand Down
12 changes: 10 additions & 2 deletions convert_hf_to_gguf.py
Original file line number Diff line number Diff line change
Expand Up @@ -6486,7 +6486,7 @@ def prepare_tensors(self):
self.gguf_writer.add_max_alibi_bias(self.max_alibi_bias)


@ModelBase.register("Glm4ForCausalLM")
@ModelBase.register("Glm4ForCausalLM", "Glm4vForConditionalGeneration")
class Glm4Model(TextModel):
model_arch = gguf.MODEL_ARCH.GLM4

Expand All @@ -6508,14 +6508,22 @@ def set_vocab(self):

def set_gguf_parameters(self):
super().set_gguf_parameters()
rope_dim = self.hparams["head_dim"]
if (rope_dim := self.hparams.get("head_dim")) is None:
rope_dim = self.hparams["hidden_size"] // self.hparams["num_attention_heads"]
self.gguf_writer.add_rope_dimension_count(int(rope_dim * self.hparams.get("partial_rotary_factor", 0.5)))
rope_scaling = self.hparams.get("rope_scaling") or {}
if rope_scaling.get("rope_type", rope_scaling.get("type")) == "yarn" and "factor" in rope_scaling:
self.gguf_writer.add_rope_scaling_type(gguf.RopeScalingType.YARN)
self.gguf_writer.add_rope_scaling_factor(rope_scaling["factor"])
self.gguf_writer.add_rope_scaling_orig_ctx_len(rope_scaling["original_max_position_embeddings"])

def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
if name.startswith("model.visual."): # ignore visual part of Glm4v
return []
elif name.startswith("model.language_model."):
name = name.replace("language_model.", "") # for Glm4v
return super().modify_tensors(data_torch, name, bid)


@ModelBase.register("GlmForCausalLM", "ChatGLMModel", "ChatGLMForConditionalGeneration")
class ChatGLMModel(TextModel):
Expand Down
46 changes: 38 additions & 8 deletions docs/build-s390x.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,14 +42,14 @@ cmake --build build --config Release -j $(nproc)
cmake --build build --config Release -j $(nproc)
```

- By default, NNPA is enabled when available. To disable it (not recommended):
- By default, NNPA is disabled by default. To enable it:

```bash
cmake -S . -B build \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_BLAS=ON \
-DGGML_BLAS_VENDOR=OpenBLAS \
-DGGML_NNPA=OFF
-DGGML_NNPA=ON

cmake --build build --config Release -j $(nproc)
```
Expand Down Expand Up @@ -84,16 +84,24 @@ All models need to be converted to Big-Endian. You can achieve this in three cas

![File Type - gguf](https://img.shields.io/badge/File_Type-gguf-fff)

You can find popular models pre-converted and verified at [s390x Ready Models](https://huggingface.co/collections/taronaeo/s390x-ready-models-672765393af438d0ccb72a08).
You can find popular models pre-converted and verified at [s390x Verified Models](https://huggingface.co/collections/taronaeo/s390x-verified-models-672765393af438d0ccb72a08) or [s390x Runnable Models](https://huggingface.co/collections/taronaeo/s390x-runnable-models-686e951824198df12416017e).

These models have already been converted from `safetensors` to `GGUF Big-Endian` and their respective tokenizers verified to run correctly on IBM z15 and later system.
These models have already been converted from `safetensors` to `GGUF` Big-Endian and their respective tokenizers verified to run correctly on IBM z15 and later system.

2. **Convert safetensors model to GGUF Big-Endian directly (recommended)**

![File Type - safetensors](https://img.shields.io/badge/File_Type-safetensors-da1e28)

The model you are trying to convert must be in `safetensors` file format (for example [IBM Granite 3.3 2B](https://huggingface.co/ibm-granite/granite-3.3-2b-instruct)). Make sure you have downloaded the model repository for this case.

Ensure that you have installed the required packages in advance

```bash
pip3 install -r requirements.txt
```

Convert the `safetensors` model to `GGUF`

```bash
python3 convert_hf_to_gguf.py \
--outfile model-name-be.f16.gguf \
Expand All @@ -116,7 +124,7 @@ All models need to be converted to Big-Endian. You can achieve this in three cas

![File Type - gguf](https://img.shields.io/badge/File_Type-gguf-fff)

The model you are trying to convert must be in `gguf` file format (for example [IBM Granite 3.3 2B](https://huggingface.co/ibm-granite/granite-3.3-2b-instruct-GGUF)). Make sure you have downloaded the model file for this case.
The model you are trying to convert must be in `gguf` file format (for example [IBM Granite 3.3 2B GGUF](https://huggingface.co/ibm-granite/granite-3.3-2b-instruct-GGUF)). Make sure you have downloaded the model file for this case.

```bash
python3 gguf-py/gguf/scripts/gguf_convert_endian.py model-name.f16.gguf BIG
Expand All @@ -141,15 +149,15 @@ Only available in IBM z15 or later system with the `-DGGML_VXE=ON` (turned on by

### 2. NNPA Vector Intrinsics Acceleration

Only available in IBM z16 or later system with the `-DGGML_NNPA=ON` (turned on when available) compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z15/arch13. In such systems, the APIs can still run but will use a scalar implementation.
Only available in IBM z16 or later system with the `-DGGML_NNPA=ON` (turned off by default) compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z15/arch13. In such systems, the APIs can still run but will use a scalar implementation.

### 3. zDNN Accelerator

_Only available in IBM z16 or later system. No direction at the moment._
_Only available in IBM z16 / LinuxONE 4 or later system. No support currently available._

### 4. Spyre Accelerator

_No direction at the moment._
_Only available with IBM z17 / LinuxONE 5 or later system. No support currently available._

## Performance Tuning

Expand Down Expand Up @@ -189,6 +197,26 @@ IBM VXE/VXE2 SIMD acceleration depends on the BLAS implementation. It is strongl

Answer: Please ensure that your GCC compiler is of minimum GCC 15.1.0 version, and have `binutils` updated to the latest version. If this does not fix the problem, kindly open an issue.

4. Failing to install the `sentencepiece` package using GCC 15+

Answer: The `sentencepiece` team are aware of this as seen in [this issue](https://github.com/google/sentencepiece/issues/1108).

As a temporary workaround, please run the installation command with the following environment variables.

```bash
export CXXFLAGS="-include cstdint"
```

For example,

```bash
CXXFLAGS="-include cstdint" pip3 install -r requirements.txt
```

5. `-DGGML_NNPA=ON` generates gibberish output

Answer: We are aware of this as detailed in [this issue](https://github.com/ggml-org/llama.cpp/issues/14877). Please either try reducing the number of threads, or disable the compile option using `-DGGML_NNPA=OFF`.

## Getting Help on IBM Z & LinuxONE

1. **Bugs, Feature Requests**
Expand Down Expand Up @@ -244,3 +272,5 @@ IBM VXE/VXE2 SIMD acceleration depends on the BLAS implementation. It is strongl
- ✅ - acceleration available
- 🚫 - acceleration unavailable, will still run using scalar implementation
- ❓ - acceleration unknown, please contribute if you can test it yourself

Last Updated by **Aaron Teo ([email protected])** on July 25, 2025.
3 changes: 3 additions & 0 deletions docs/build.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,9 @@ cmake --build build --config Release
cmake --build build-x64-windows-llvm-release
```
- Curl usage is enabled by default and can be turned off with `-DLLAMA_CURL=OFF`. Otherwise you need to install development libraries for libcurl.
- **Debian / Ubuntu:** `sudo apt-get install libcurl4-openssl-dev` # (or `libcurl4-gnutls-dev` if you prefer GnuTLS)
- **Fedora / RHEL / Rocky / Alma:** `sudo dnf install libcurl-devel`
- **Arch / Manjaro:** `sudo pacman -S curl` # includes libcurl headers

## BLAS Build

Expand Down
2 changes: 1 addition & 1 deletion docs/docker.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,7 @@ You may want to pass in some different `ARGS`, depending on the MUSA environment

The defaults are:

- `MUSA_VERSION` set to `rc4.0.1`
- `MUSA_VERSION` set to `rc4.2.0`

The resulting images, are essentially the same as the non-MUSA images:

Expand Down
4 changes: 3 additions & 1 deletion ggml/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -131,7 +131,7 @@ option(GGML_RVV "ggml: enable rvv" ON)
option(GGML_RV_ZFH "ggml: enable riscv zfh" OFF)
option(GGML_XTHEADVECTOR "ggml: enable xtheadvector" OFF)
option(GGML_VXE "ggml: enable vxe" ON)
option(GGML_NNPA "ggml: enable nnpa" ON)
option(GGML_NNPA "ggml: enable nnpa" OFF) # temp disabled by default, see: https://github.com/ggml-org/llama.cpp/issues/14877

option(GGML_CPU_ALL_VARIANTS "ggml: build all variants of the CPU backend (requires GGML_BACKEND_DL)" OFF)
set(GGML_CPU_ARM_ARCH "" CACHE STRING "ggml: CPU architecture for ARM")
Expand Down Expand Up @@ -174,6 +174,8 @@ option(GGML_HIP_GRAPHS "ggml: use HIP graph, experimental,
option(GGML_HIP_NO_VMM "ggml: do not try to use HIP VMM" ON)
option(GGML_HIP_ROCWMMA_FATTN "ggml: enable rocWMMA for FlashAttention" OFF)
option(GGML_HIP_FORCE_ROCWMMA_FATTN_GFX12 "ggml: enable rocWMMA FlashAttention on GFX12" OFF)
option(GGML_MUSA_GRAPHS "ggml: use MUSA graph, experimental, unstable" OFF)
option(GGML_MUSA_MUDNN_COPY "ggml: enable muDNN for accelerated copy" OFF)
option(GGML_VULKAN "ggml: use Vulkan" OFF)
option(GGML_VULKAN_CHECK_RESULTS "ggml: run Vulkan op checks" OFF)
option(GGML_VULKAN_DEBUG "ggml: enable Vulkan debug output" OFF)
Expand Down
Loading
Loading