Skip to content

Conversation

taronaeo
Copy link
Collaborator

This PR optimises the GGML_OP_RMS_NORM operation and introduces performance test cases with reference to Qwen3-0.6B, Llama-3.2-1B, Granite-3.3-2B and GPT-OSS-20B dimensions and epsilons.

Benchmark

This PR was tested on an Apple M1 Pro 10 Cores / 32 GB RAM.

Before Optimisation

$ build/bin/test-backend-ops -b CPU -o RMS_NORM

ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.026 sec
ggml_metal_device_init: GPU name:   Apple M1 Pro
ggml_metal_device_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 22906.50 MB
Testing 3 devices

Backend 1/3: Metal
  Skipping
Backend 2/3: BLAS
  Skipping
Backend 3/3: CPU
  Device description: Apple M1 Pro
  Device memory: 32768 MB (32768 MB free)

  RMS_NORM(type=f32,ne=[1024,2,1,1],v=0,eps=0.000001,inplace=0):               98292 runs -    10.22 us/run -       16 kB/run -    1.49 GB/s
  RMS_NORM(type=f32,ne=[2048,2,1,1],v=0,eps=0.000010,inplace=0):               65528 runs -    15.65 us/run -       32 kB/run -    1.95 GB/s
  RMS_NORM(type=f32,ne=[2880,2,1,1],v=0,eps=0.000010,inplace=0):               57337 runs -    18.07 us/run -       45 kB/run -    2.38 GB/s
  Backend CPU: OK
3/3 backends passed
OK

After Optimisation

$ build/bin/test-backend-ops -b CPU -o RMS_NORM

ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.026 sec
ggml_metal_device_init: GPU name:   Apple M1 Pro
ggml_metal_device_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 22906.50 MB
Testing 3 devices

Backend 1/3: Metal
  Skipping
Backend 2/3: BLAS
  Skipping
Backend 3/3: CPU
  Device description: Apple M1 Pro
  Device memory: 32768 MB (32768 MB free)

  RMS_NORM(type=f32,ne=[1024,2,1,1],v=0,eps=0.000001,inplace=0):              155629 runs -     6.88 us/run -       16 kB/run -    2.22 GB/s
  RMS_NORM(type=f32,ne=[2048,2,1,1],v=0,eps=0.000010,inplace=0):              139247 runs -     7.36 us/run -       32 kB/run -    4.14 GB/s
  RMS_NORM(type=f32,ne=[2880,2,1,1],v=0,eps=0.000010,inplace=0):              122865 runs -     8.26 us/run -       45 kB/run -    5.20 GB/s
  Backend CPU: OK
3/3 backends passed
OK

Signed-off-by: Aaron Teo <[email protected]>

tests: add rms_norm tests

Signed-off-by: Aaron Teo <[email protected]>

tests: update eps for rms norm

Signed-off-by: Aaron Teo <[email protected]>

tests: add qwen-3-0.6b test case

Signed-off-by: Aaron Teo <[email protected]>

ggml-cpu: minor code alignment cleanup

Signed-off-by: Aaron Teo <[email protected]>

ggml-cpu: code cleanup

Signed-off-by: Aaron Teo <[email protected]>
@github-actions github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning labels Oct 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant