ggml-cpu: Support Q5_0 and Q5_1 on s390x #15486

taronaeo · 2025-08-21T18:16:33Z

This pull request aims to include SIMD instruction set for Q5_0 and Q5_1 quantisation on the s390x platform. For Q5_0, at best, we see an improvement of 38.42% and 156.17% performance improvement for Prompt Processing and Token Generation respectively and, 38.40% and 146.84% for Q5_1 respectively.

Before SIMD Benchmark

model	size	params	backend	threads	test	t/s
granite 3B Q5_0	1.64 GiB	2.53 B	BLAS	8	pp512	55.42 ± 0.13
granite 3B Q5_0	1.64 GiB	2.53 B	BLAS	8	tg128	2.52 ± 0.00
granite 3B Q5_0	1.64 GiB	2.53 B	BLAS	16	pp512	100.22 ± 0.07
granite 3B Q5_0	1.64 GiB	2.53 B	BLAS	16	tg128	4.83 ± 0.00
granite 3B Q5_1	1.78 GiB	2.53 B	BLAS	8	pp512	55.40 ± 0.06
granite 3B Q5_1	1.78 GiB	2.53 B	BLAS	8	tg128	2.99 ± 0.00
granite 3B Q5_1	1.78 GiB	2.53 B	BLAS	16	pp512	100.72 ± 0.16
granite 3B Q5_1	1.78 GiB	2.53 B	BLAS	16	tg128	5.66 ± 0.00

build: fd8f4a2 (6226)

After SIMD Benchmark

model	size	params	backend	threads	test	t/s
granite 3B Q5_0	1.64 GiB	2.53 B	BLAS	8	pp512	81.78 ± 0.08
granite 3B Q5_0	1.64 GiB	2.53 B	BLAS	8	tg128	20.48 ± 0.02
granite 3B Q5_0	1.64 GiB	2.53 B	BLAS	16	pp512	138.53 ± 0.21
granite 3B Q5_0	1.64 GiB	2.53 B	BLAS	16	tg128	30.77 ± 0.13
granite 3B Q5_1	1.78 GiB	2.53 B	BLAS	8	pp512	81.73 ± 0.03
granite 3B Q5_1	1.78 GiB	2.53 B	BLAS	8	tg128	19.51 ± 0.03
granite 3B Q5_1	1.78 GiB	2.53 B	BLAS	16	pp512	139.85 ± 0.16
granite 3B Q5_1	1.78 GiB	2.53 B	BLAS	16	tg128	29.03 ± 0.11

build: fd8f4a2 (6226)

Verification

To ensure that this implementation did not break anything, the SIMD instruction set has been tested on the following models:

Tested Granite 3.3 2B Instruct Big-Endian (Q5_0, Q5_1)
Kindly request other models to be tested

Note

Tests were conducted on an IBM z17 Mainframe with 40 IFLs (cores) and 128 GB Memory on a shared R&D LPAR.

Please review this pull request and consider merging into the main repository. Thank you!

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: initial q5_0 impl for s390x Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: updated q5_0 code for better performance Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: use optimised hsum for better performance Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: introduce q5_1 simd + refactor q5_0 Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: fix incorrect return type vec_hsum Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: q5_0 incomplete refactor + table_b2b_0 activation Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: refactor q5_1 Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: q5_1 update loop unroll to 4 Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: update q5_0 unroll to 4 Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: update build-s390x docs Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: update unused variables q5_0 Signed-off-by: Aaron Teo <[email protected]> * docs: update the last update date Signed-off-by: Aaron Teo <[email protected]> --------- Signed-off-by: Aaron Teo <[email protected]>

taronaeo added 11 commits August 21, 2025 01:08

ggml-cpu: initial q5_0 impl for s390x

e5a9469

Signed-off-by: Aaron Teo <[email protected]>

ggml-cpu: updated q5_0 code for better performance

1506737

Signed-off-by: Aaron Teo <[email protected]>

ggml-cpu: use optimised hsum for better performance

d02fbd8

Signed-off-by: Aaron Teo <[email protected]>

ggml-cpu: introduce q5_1 simd + refactor q5_0

dd6deef

Signed-off-by: Aaron Teo <[email protected]>

ggml-cpu: fix incorrect return type vec_hsum

5cdac46

Signed-off-by: Aaron Teo <[email protected]>

ggml-cpu: q5_0 incomplete refactor + table_b2b_0 activation

330a2a5

Signed-off-by: Aaron Teo <[email protected]>

ggml-cpu: refactor q5_1

4a72780

Signed-off-by: Aaron Teo <[email protected]>

ggml-cpu: q5_1 update loop unroll to 4

5a94a01

Signed-off-by: Aaron Teo <[email protected]>

ggml-cpu: update q5_0 unroll to 4

fd8f4a2

Signed-off-by: Aaron Teo <[email protected]>

ggml-cpu: update build-s390x docs

3815dea

Signed-off-by: Aaron Teo <[email protected]>

ggml-cpu: update unused variables q5_0

46284a0

Signed-off-by: Aaron Teo <[email protected]>

github-actions bot added documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning labels Aug 21, 2025

docs: update the last update date

9969fcb

Signed-off-by: Aaron Teo <[email protected]>

ggerganov approved these changes Aug 22, 2025

View reviewed changes

taronaeo merged commit ad5c975 into ggml-org:master Aug 22, 2025
88 of 89 checks passed

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 6, 2025

Revert "ggml-cpu: Support Q5_0 and Q5_1 on s390x (ggml-org#15486)"

1885750

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml-cpu: Support Q5_0 and Q5_1 on s390x #15486

ggml-cpu: Support Q5_0 and Q5_1 on s390x #15486

Uh oh!

taronaeo commented Aug 21, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ggml-cpu: Support Q5_0 and Q5_1 on s390x #15486

ggml-cpu: Support Q5_0 and Q5_1 on s390x #15486

Uh oh!

Conversation

taronaeo commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before SIMD Benchmark

After SIMD Benchmark

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

taronaeo commented Aug 21, 2025 •

edited

Loading