Skip to content

Conversation

@taronaeo
Copy link
Collaborator

This PR optimises the GGML_OP_RMS_NORM operation and introduces performance test cases with reference to Qwen3-0.6B, Llama-3.2-1B, Granite-3.3-2B and GPT-OSS-20B dimensions and epsilons.

Benchmark

This PR was tested on an Apple M1 Pro 10 Cores / 32 GB RAM.

Before Optimisation

$ build/bin/test-backend-ops -b CPU -o RMS_NORM

ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.026 sec
ggml_metal_device_init: GPU name:   Apple M1 Pro
ggml_metal_device_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 22906.50 MB
Testing 3 devices

Backend 1/3: Metal
  Skipping
Backend 2/3: BLAS
  Skipping
Backend 3/3: CPU
  Device description: Apple M1 Pro
  Device memory: 32768 MB (32768 MB free)

  RMS_NORM(type=f32,ne=[1024,2,1,1],v=0,eps=0.000001,inplace=0):               98292 runs -    10.22 us/run -       16 kB/run -    1.49 GB/s
  RMS_NORM(type=f32,ne=[2048,2,1,1],v=0,eps=0.000010,inplace=0):               65528 runs -    15.65 us/run -       32 kB/run -    1.95 GB/s
  RMS_NORM(type=f32,ne=[2880,2,1,1],v=0,eps=0.000010,inplace=0):               57337 runs -    18.07 us/run -       45 kB/run -    2.38 GB/s
  Backend CPU: OK
3/3 backends passed
OK

After Optimisation

$ build/bin/test-backend-ops -b CPU -o RMS_NORM

ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.026 sec
ggml_metal_device_init: GPU name:   Apple M1 Pro
ggml_metal_device_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 22906.50 MB
Testing 3 devices

Backend 1/3: Metal
  Skipping
Backend 2/3: BLAS
  Skipping
Backend 3/3: CPU
  Device description: Apple M1 Pro
  Device memory: 32768 MB (32768 MB free)

  RMS_NORM(type=f32,ne=[1024,2,1,1],v=0,eps=0.000001,inplace=0):              155629 runs -     6.88 us/run -       16 kB/run -    2.22 GB/s
  RMS_NORM(type=f32,ne=[2048,2,1,1],v=0,eps=0.000010,inplace=0):              139247 runs -     7.36 us/run -       32 kB/run -    4.14 GB/s
  RMS_NORM(type=f32,ne=[2880,2,1,1],v=0,eps=0.000010,inplace=0):              122865 runs -     8.26 us/run -       45 kB/run -    5.20 GB/s
  Backend CPU: OK
3/3 backends passed
OK

Signed-off-by: Aaron Teo <[email protected]>

tests: add rms_norm tests

Signed-off-by: Aaron Teo <[email protected]>

tests: update eps for rms norm

Signed-off-by: Aaron Teo <[email protected]>

tests: add qwen-3-0.6b test case

Signed-off-by: Aaron Teo <[email protected]>

ggml-cpu: minor code alignment cleanup

Signed-off-by: Aaron Teo <[email protected]>

ggml-cpu: code cleanup

Signed-off-by: Aaron Teo <[email protected]>
@github-actions github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning labels Oct 18, 2025
Comment on lines +3543 to +3544
float sum = 0.0f;
ggml_vec_dot_f32(ne00, &sum, 0, x, 0, x, 0, 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would change the accumulator from double to float. Are there any possible overflow issues?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was also having similar doubts and not sure what is the answer. I think practically this should be OK, but not sure it is worth breaking the convention to accumulate floats into ggml_float.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked this and it turns out ggml_vec_dot_f32 already accumulates as ggml_float and returns the result as a float.

So I didn't see a point using ggml_float again unless I'm missing something here.

Ref:

// scalar
ggml_float sumf = 0.0;
for (int i = 0; i < n; ++i) {
sumf += (ggml_float)(x[i]*y[i]);
}

Do let me know if I should run any tests to determine if this change causes a degradation :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ggml_vec_dot_f32 accumulates in a ggml_float only on the scalar branch. On the vectorized ones it accumulates to a float. Have a look and you will see :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants