Skip to content

Conversation

@taronaeo
Copy link
Collaborator

@taronaeo taronaeo commented Sep 23, 2025

This pull request integrates the SIMD instruction set for MXFP4 on the s390x platform. We notice a 159.52% performance improvement for Prompt Processing, and 136.90% for Token Generation.

Before SIMD Benchmark

model size params backend threads test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B BLAS 8 pp512 3.44 ± 0.00
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B BLAS 8 tg128 2.97 ± 0.00

After SIMD Benchmark

model size params backend threads test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B BLAS 8 pp512 30.56 ± 0.01
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B BLAS 8 tg128 15.86 ± 0.03

Verification

To ensure that this implementation did not break anything, the SIMD instruction set has been tested on the following models:

  • Tested GPT OSS 20B Big-Endian @ MXFP4
  • Kindly request additional models for testing

test-quantize-fns

It is noted that q8_0 is currently failing, fixed in #15925. MXFP4 itself is passing.

$ build/bin/test-quantize-fns 

Testing f32
Testing f16
Testing q4_0
Testing q4_1
Testing q5_0
Testing q5_1
Testing q8_0
 q8_0 reference implementation error: FAILED (0.000175)
Testing q8_1
Testing q2_K
Testing q3_K
Testing q4_K
Testing q5_K
Testing q6_K
Testing q8_K
Testing iq2_xxs
Testing iq2_xs
Testing iq3_xxs
Testing iq1_s
Testing iq4_nl
Testing iq3_s
Testing iq2_s
Testing iq4_xs
Testing i8
Testing i16
Testing i32
Testing i64
Testing f64
Testing iq1_m
Testing bf16
Testing tq1_0
Testing tq2_0
Testing mxfp4
1 tests failed

Note

Tests were conducted on an IBM z17 Mainframe with 40 IFLs (cores) and 128 GB Memory on a shared R&D LPAR.

Please review this pull request and consider merging into the main repository. Thank you!

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Sep 23, 2025
Comment on lines 325 to 326
#pragma GCC unroll 8
for (; ib < nb; ++ib) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This unroll seems unnecessary, since this loop should only have zero or one iterations.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Fixed in latest commit

@taronaeo
Copy link
Collaborator Author

AMX CI has been failing with the same failure, ignoring it. CI / ggml-ci-arm64-cpu-high-perf-sve does not seem related to this PR.

Pushing to master in 1 hour if there are no further comments

@ggerganov ggerganov merged commit 9b26511 into ggml-org:master Sep 26, 2025
64 of 66 checks passed
struct pushed a commit to struct/llama.cpp that referenced this pull request Sep 26, 2025
* ggml-cpu: impl mxfp4 s390x

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: missing s = sumf

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: fix incorrect kval_mxfp4 type

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: rework mxfp4

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: missing delta calc

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: fix typo

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: fix typo for vec_splats

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: expand to 2 blocks per loop

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: add unroll to boost perf

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: back to 1 block per loop to test perf

Signed-off-by: Aaron Teo <[email protected]>

* Revert "ggml-cpu: back to 1 block per loop to test perf"

This reverts commit 1fe5572.

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: rm unroll from single block

Signed-off-by: Aaron Teo <[email protected]>

---------

Signed-off-by: Aaron Teo <[email protected]>
yael-works pushed a commit to yael-works/llama.cpp that referenced this pull request Oct 15, 2025
* ggml-cpu: impl mxfp4 s390x

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: missing s = sumf

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: fix incorrect kval_mxfp4 type

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: rework mxfp4

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: missing delta calc

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: fix typo

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: fix typo for vec_splats

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: expand to 2 blocks per loop

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: add unroll to boost perf

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: back to 1 block per loop to test perf

Signed-off-by: Aaron Teo <[email protected]>

* Revert "ggml-cpu: back to 1 block per loop to test perf"

This reverts commit 1fe5572.

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: rm unroll from single block

Signed-off-by: Aaron Teo <[email protected]>

---------

Signed-off-by: Aaron Teo <[email protected]>
pwilkin pushed a commit to pwilkin/llama.cpp that referenced this pull request Oct 23, 2025
* ggml-cpu: impl mxfp4 s390x

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: missing s = sumf

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: fix incorrect kval_mxfp4 type

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: rework mxfp4

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: missing delta calc

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: fix typo

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: fix typo for vec_splats

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: expand to 2 blocks per loop

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: add unroll to boost perf

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: back to 1 block per loop to test perf

Signed-off-by: Aaron Teo <[email protected]>

* Revert "ggml-cpu: back to 1 block per loop to test perf"

This reverts commit 1fe5572.

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: rm unroll from single block

Signed-off-by: Aaron Teo <[email protected]>

---------

Signed-off-by: Aaron Teo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants