ggml-cpu: implement MXFP4 SIMD for s390x #16193

taronaeo · 2025-09-23T07:13:24Z

This pull request integrates the SIMD instruction set for MXFP4 on the s390x platform. We notice a 159.52% performance improvement for Prompt Processing, and 136.90% for Token Generation.

Before SIMD Benchmark

model	size	params	backend	threads	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	BLAS	8	pp512	3.44 ± 0.00
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	BLAS	8	tg128	2.97 ± 0.00

After SIMD Benchmark

model	size	params	backend	threads	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	BLAS	8	pp512	30.56 ± 0.01
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	BLAS	8	tg128	15.86 ± 0.03

Verification

To ensure that this implementation did not break anything, the SIMD instruction set has been tested on the following models:

Tested GPT OSS 20B Big-Endian @ MXFP4
Kindly request additional models for testing

`test-quantize-fns`

It is noted that q8_0 is currently failing, fixed in #15925. MXFP4 itself is passing.

$ build/bin/test-quantize-fns 

Testing f32
Testing f16
Testing q4_0
Testing q4_1
Testing q5_0
Testing q5_1
Testing q8_0
 q8_0 reference implementation error: FAILED (0.000175)
Testing q8_1
Testing q2_K
Testing q3_K
Testing q4_K
Testing q5_K
Testing q6_K
Testing q8_K
Testing iq2_xxs
Testing iq2_xs
Testing iq3_xxs
Testing iq1_s
Testing iq4_nl
Testing iq3_s
Testing iq2_s
Testing iq4_xs
Testing i8
Testing i16
Testing i32
Testing i64
Testing f64
Testing iq1_m
Testing bf16
Testing tq1_0
Testing tq2_0
Testing mxfp4
1 tests failed

Note

Tests were conducted on an IBM z17 Mainframe with 40 IFLs (cores) and 128 GB Memory on a shared R&D LPAR.

Please review this pull request and consider merging into the main repository. Thank you!

Signed-off-by: Aaron Teo <[email protected]>

This reverts commit 1fe5572. Signed-off-by: Aaron Teo <[email protected]>

slaren · 2025-09-24T22:56:02Z

ggml/src/ggml-cpu/arch/s390/quants.c

+    #pragma GCC unroll 8
+    for (; ib < nb; ++ib) {


This unroll seems unnecessary, since this loop should only have zero or one iterations.

Good catch! Fixed in latest commit

Signed-off-by: Aaron Teo <[email protected]>

taronaeo · 2025-09-26T09:13:56Z

AMX CI has been failing with the same failure, ignoring it. CI / ggml-ci-arm64-cpu-high-perf-sve does not seem related to this PR.

Pushing to master in 1 hour if there are no further comments

* ggml-cpu: impl mxfp4 s390x Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: missing s = sumf Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: fix incorrect kval_mxfp4 type Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: rework mxfp4 Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: missing delta calc Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: fix typo Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: fix typo for vec_splats Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: expand to 2 blocks per loop Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: add unroll to boost perf Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: back to 1 block per loop to test perf Signed-off-by: Aaron Teo <[email protected]> * Revert "ggml-cpu: back to 1 block per loop to test perf" This reverts commit 1fe5572. Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: rm unroll from single block Signed-off-by: Aaron Teo <[email protected]> --------- Signed-off-by: Aaron Teo <[email protected]>

taronaeo added 11 commits September 23, 2025 02:01

ggml-cpu: impl mxfp4 s390x

618ef46

Signed-off-by: Aaron Teo <[email protected]>

ggml-cpu: missing s = sumf

6549e3b

Signed-off-by: Aaron Teo <[email protected]>

ggml-cpu: fix incorrect kval_mxfp4 type

377d0fc

Signed-off-by: Aaron Teo <[email protected]>

ggml-cpu: rework mxfp4

3538930

Signed-off-by: Aaron Teo <[email protected]>

ggml-cpu: missing delta calc

cf927d8

Signed-off-by: Aaron Teo <[email protected]>

ggml-cpu: fix typo

ae718c7

Signed-off-by: Aaron Teo <[email protected]>

ggml-cpu: fix typo for vec_splats

f7e7539

Signed-off-by: Aaron Teo <[email protected]>

ggml-cpu: expand to 2 blocks per loop

5fb1bb9

Signed-off-by: Aaron Teo <[email protected]>

ggml-cpu: add unroll to boost perf

4f85c33

Signed-off-by: Aaron Teo <[email protected]>

ggml-cpu: back to 1 block per loop to test perf

1fe5572

Signed-off-by: Aaron Teo <[email protected]>

Revert "ggml-cpu: back to 1 block per loop to test perf"

1f99e51

This reverts commit 1fe5572. Signed-off-by: Aaron Teo <[email protected]>

taronaeo requested review from ggerganov and slaren as code owners September 23, 2025 07:13

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Sep 23, 2025

slaren approved these changes Sep 24, 2025

View reviewed changes

ggml-cpu: rm unroll from single block

96cba33

Signed-off-by: Aaron Teo <[email protected]>

ggerganov approved these changes Sep 26, 2025

View reviewed changes

ggerganov merged commit 9b26511 into ggml-org:master Sep 26, 2025
64 of 66 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml-cpu: implement MXFP4 SIMD for s390x #16193

ggml-cpu: implement MXFP4 SIMD for s390x #16193

Uh oh!

taronaeo commented Sep 23, 2025 •

edited

Loading

Uh oh!

slaren Sep 24, 2025

Uh oh!

taronaeo Sep 26, 2025

Uh oh!

taronaeo commented Sep 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ggml-cpu: implement MXFP4 SIMD for s390x #16193

ggml-cpu: implement MXFP4 SIMD for s390x #16193

Uh oh!

Conversation

taronaeo commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before SIMD Benchmark

After SIMD Benchmark

Verification

test-quantize-fns

Uh oh!

slaren Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

taronaeo Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

taronaeo commented Sep 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

taronaeo commented Sep 23, 2025 •

edited

Loading

`test-quantize-fns`