ggml-cpu: optimise rms_norm op #16650

taronaeo · 2025-10-18T15:38:10Z

This PR optimises the GGML_OP_RMS_NORM operation and introduces performance test cases with reference to Qwen3-0.6B, Llama-3.2-1B, Granite-3.3-2B and GPT-OSS-20B dimensions and epsilons.

Benchmark

This PR was tested on an Apple M1 Pro 10 Cores / 32 GB RAM.

Before Optimisation

$ build/bin/test-backend-ops -b CPU -o RMS_NORM

ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.026 sec
ggml_metal_device_init: GPU name:   Apple M1 Pro
ggml_metal_device_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 22906.50 MB
Testing 3 devices

Backend 1/3: Metal
  Skipping
Backend 2/3: BLAS
  Skipping
Backend 3/3: CPU
  Device description: Apple M1 Pro
  Device memory: 32768 MB (32768 MB free)

  RMS_NORM(type=f32,ne=[1024,2,1,1],v=0,eps=0.000001,inplace=0):               98292 runs -    10.22 us/run -       16 kB/run -    1.49 GB/s
  RMS_NORM(type=f32,ne=[2048,2,1,1],v=0,eps=0.000010,inplace=0):               65528 runs -    15.65 us/run -       32 kB/run -    1.95 GB/s
  RMS_NORM(type=f32,ne=[2880,2,1,1],v=0,eps=0.000010,inplace=0):               57337 runs -    18.07 us/run -       45 kB/run -    2.38 GB/s
  Backend CPU: OK
3/3 backends passed
OK

After Optimisation

$ build/bin/test-backend-ops -b CPU -o RMS_NORM

ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.026 sec
ggml_metal_device_init: GPU name:   Apple M1 Pro
ggml_metal_device_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 22906.50 MB
Testing 3 devices

Backend 1/3: Metal
  Skipping
Backend 2/3: BLAS
  Skipping
Backend 3/3: CPU
  Device description: Apple M1 Pro
  Device memory: 32768 MB (32768 MB free)

  RMS_NORM(type=f32,ne=[1024,2,1,1],v=0,eps=0.000001,inplace=0):              155629 runs -     6.88 us/run -       16 kB/run -    2.22 GB/s
  RMS_NORM(type=f32,ne=[2048,2,1,1],v=0,eps=0.000010,inplace=0):              139247 runs -     7.36 us/run -       32 kB/run -    4.14 GB/s
  RMS_NORM(type=f32,ne=[2880,2,1,1],v=0,eps=0.000010,inplace=0):              122865 runs -     8.26 us/run -       45 kB/run -    5.20 GB/s
  Backend CPU: OK
3/3 backends passed
OK

Signed-off-by: Aaron Teo <[email protected]> tests: add rms_norm tests Signed-off-by: Aaron Teo <[email protected]> tests: update eps for rms norm Signed-off-by: Aaron Teo <[email protected]> tests: add qwen-3-0.6b test case Signed-off-by: Aaron Teo <[email protected]> ggml-cpu: minor code alignment cleanup Signed-off-by: Aaron Teo <[email protected]> ggml-cpu: code cleanup Signed-off-by: Aaron Teo <[email protected]>

slaren · 2025-10-22T10:08:02Z

ggml/src/ggml-cpu/ops.cpp

+                float sum = 0.0f;
+                ggml_vec_dot_f32(ne00, &sum, 0, x, 0, x, 0, 1);


This would change the accumulator from double to float. Are there any possible overflow issues?

Was also having similar doubts and not sure what is the answer. I think practically this should be OK, but not sure it is worth breaking the convention to accumulate floats into ggml_float.

I checked this and it turns out ggml_vec_dot_f32 already accumulates as ggml_float and returns the result as a float.

So I didn't see a point using ggml_float again unless I'm missing something here.

Ref:

llama.cpp/ggml/src/ggml-cpu/vec.cpp

Lines 129 to 133 in 19a5a3e

// scalar

ggml_float sumf = 0.0;

for (int i = 0; i < n; ++i) {

sumf += (ggml_float)(x[i]*y[i]);

}

Do let me know if I should run any tests to determine if this change causes a degradation :)

ggml_vec_dot_f32 accumulates in a ggml_float only on the scalar branch. On the vectorized ones it accumulates to a float. Have a look and you will see :)

taronaeo requested review from ggerganov and slaren as code owners October 18, 2025 15:38

github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning labels Oct 18, 2025

slaren reviewed Oct 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml-cpu: optimise rms_norm op #16650

ggml-cpu: optimise rms_norm op #16650

taronaeo commented Oct 18, 2025

Uh oh!

slaren Oct 22, 2025

Uh oh!

ggerganov Oct 22, 2025

Uh oh!

taronaeo Oct 22, 2025

Uh oh!

duduta Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		float sum = 0.0f;
		ggml_vec_dot_f32(ne00, &sum, 0, x, 0, x, 0, 1);

	// scalar
	ggml_float sumf = 0.0;
	for (int i = 0; i < n; ++i) {
	sumf += (ggml_float)(x[i]*y[i]);
	}

ggml-cpu: optimise rms_norm op #16650

Are you sure you want to change the base?

ggml-cpu: optimise rms_norm op #16650

Conversation

taronaeo commented Oct 18, 2025

Benchmark

Before Optimisation

After Optimisation

Uh oh!

slaren Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

taronaeo Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

duduta Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants