Skip to content

Conversation

taronaeo
Copy link
Collaborator

@taronaeo taronaeo commented Aug 21, 2025

This pull request aims to include SIMD instruction set for Q5_0 and Q5_1 quantisation on the s390x platform. For Q5_0, at best, we see an improvement of 38.42% and 156.17% performance improvement for Prompt Processing and Token Generation respectively and, 38.40% and 146.84% for Q5_1 respectively.

Before SIMD Benchmark

model size params backend threads test t/s
granite 3B Q5_0 1.64 GiB 2.53 B BLAS 8 pp512 55.42 ± 0.13
granite 3B Q5_0 1.64 GiB 2.53 B BLAS 8 tg128 2.52 ± 0.00
granite 3B Q5_0 1.64 GiB 2.53 B BLAS 16 pp512 100.22 ± 0.07
granite 3B Q5_0 1.64 GiB 2.53 B BLAS 16 tg128 4.83 ± 0.00
granite 3B Q5_1 1.78 GiB 2.53 B BLAS 8 pp512 55.40 ± 0.06
granite 3B Q5_1 1.78 GiB 2.53 B BLAS 8 tg128 2.99 ± 0.00
granite 3B Q5_1 1.78 GiB 2.53 B BLAS 16 pp512 100.72 ± 0.16
granite 3B Q5_1 1.78 GiB 2.53 B BLAS 16 tg128 5.66 ± 0.00

build: fd8f4a2 (6226)

After SIMD Benchmark

model size params backend threads test t/s
granite 3B Q5_0 1.64 GiB 2.53 B BLAS 8 pp512 81.78 ± 0.08
granite 3B Q5_0 1.64 GiB 2.53 B BLAS 8 tg128 20.48 ± 0.02
granite 3B Q5_0 1.64 GiB 2.53 B BLAS 16 pp512 138.53 ± 0.21
granite 3B Q5_0 1.64 GiB 2.53 B BLAS 16 tg128 30.77 ± 0.13
granite 3B Q5_1 1.78 GiB 2.53 B BLAS 8 pp512 81.73 ± 0.03
granite 3B Q5_1 1.78 GiB 2.53 B BLAS 8 tg128 19.51 ± 0.03
granite 3B Q5_1 1.78 GiB 2.53 B BLAS 16 pp512 139.85 ± 0.16
granite 3B Q5_1 1.78 GiB 2.53 B BLAS 16 tg128 29.03 ± 0.11

build: fd8f4a2 (6226)

Verification

To ensure that this implementation did not break anything, the SIMD instruction set has been tested on the following models:

  • Tested Granite 3.3 2B Instruct Big-Endian (Q5_0, Q5_1)
  • Kindly request other models to be tested

Note

Tests were conducted on an IBM z17 Mainframe with 40 IFLs (cores) and 128 GB Memory on a shared R&D LPAR.

Please review this pull request and consider merging into the main repository. Thank you!

@github-actions github-actions bot added documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning labels Aug 21, 2025
@taronaeo taronaeo merged commit ad5c975 into ggml-org:master Aug 22, 2025
88 of 89 checks passed
qnixsynapse pushed a commit to menloresearch/llama.cpp that referenced this pull request Aug 25, 2025
* ggml-cpu: initial q5_0 impl for s390x

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: updated q5_0 code for better performance

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: use optimised hsum for better performance

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: introduce q5_1 simd + refactor q5_0

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: fix incorrect return type vec_hsum

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: q5_0 incomplete refactor + table_b2b_0 activation

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: refactor q5_1

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: q5_1 update loop unroll to 4

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: update q5_0 unroll to 4

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: update build-s390x docs

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: update unused variables q5_0

Signed-off-by: Aaron Teo <[email protected]>

* docs: update the last update date

Signed-off-by: Aaron Teo <[email protected]>

---------

Signed-off-by: Aaron Teo <[email protected]>
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants