Layernorm bwd OPT #1880

jianyizh · 2025-07-25T06:31:04Z

I noticed layer norm backward on gamma and beta is very slow when column is much longer. i.e. [M,N] column reduction and M>>N.
For example, in timm tnt_s_patch16_224 training, layernorm bwd shape [25088,16,24], normalized shape [24]. it will only launch one workgroup. I use a two staged column reduction to increase parallelism. GammaBetaBackwardSimpleKernelFunctor takes 9 ms on PVC, 8.5ms on BMG. After opt, we use GammaBetaReduceFunctor and two sum to do column reduction, they will take 0.09ms + 0.06ms x2 on PVC and 0.19ms + 0.04ms x 2 on BMG

jianyizh · 2025-07-25T06:43:20Z

Maybe we can follow recent cuda change here pytorch/pytorch@73b4938

Copilot

Pull Request Overview

This PR optimizes the backward pass computation for layer normalization's gamma and beta gradients by implementing a two-stage column reduction approach to improve parallelism when the matrix dimension M is much larger than N.

Introduces a new GammaBetaReduceFunctor kernel that uses tiled computation with local memory for better occupancy
Adds logic to automatically select between the optimized two-stage reduction and the existing simple kernel based on occupancy thresholds
Implements separate code paths for different combinations of gamma and beta gradient computations

Copilot · 2025-08-06T12:39:02Z

src/ATen/native/xpu/sycl/LayerNormKernels.cpp

-       std::is_same_v<T, at::BFloat16>)&&N <=
-          static_cast<int64_t>(1ULL << std::numeric_limits<float>::digits) &&
+       std::is_same_v<T, at::BFloat16>) &&
+      N <= static_cast<int64_t>(1ULL << std::numeric_limits<float>::digits) &&
      N % num_vec_elems == 0 && can_vec_X && can_vec_Y && can_vec_gamma &&
      can_vec_beta) {


[nitpick] The condition formatting is inconsistent. The && operator should be aligned with the opening parenthesis or consistently indented.

Copilot · 2025-08-06T12:39:02Z

src/ATen/native/xpu/sycl/LayerNormKernels.cpp

+    if (dbeta.defined()) {
+      auto options = dbeta.options();
+      dbeta_blocks = at::empty({num_tile_m, N}, options);
+      dbeta_blocks_ptr = dbeta_blocks.data_ptr<weight_t>();


This TODO comment suggests uncertainty about the data type handling. The comment should either be resolved or provide more context about why float32 might be needed and what the current behavior is.

Suggested change

dbeta_blocks_ptr = dbeta_blocks.data_ptr<weight_t>();

// Set dgamma_blocks dtype to float32 for numerical stability in reduction

dgamma_blocks = at::empty({num_tile_m, N}, options.dtype(at::kFloat));

dgamma_blocks_ptr = dgamma_blocks.data_ptr<float>();

}

if (dbeta.defined()) {

auto options = dbeta.options();

dbeta_blocks = at::empty({num_tile_m, N}, options.dtype(at::kFloat));

dbeta_blocks_ptr = dbeta_blocks.data_ptr<float>();

src/ATen/native/xpu/sycl/LayerNormKernels.cpp

Co-authored-by: Copilot <[email protected]>

save

e21ccec

jianyizh requested review from EikanWang and xytintel July 25, 2025 06:31

jianyizh added the kernel_optimization label Jul 25, 2025

Copilot AI review requested due to automatic review settings July 25, 2025 06:31

This comment was marked as outdated.

Sign in to view

jianyizh added 4 commits August 4, 2025 14:43

save

e0f15de

save

02b55ff

save

7a06f6c

save

d0149b8

jianyizh requested a review from Copilot August 6, 2025 12:37

jianyizh changed the title ~~[WIP] Layernorm bwd OPT~~ Layernorm bwd OPT Aug 6, 2025

Copilot AI reviewed Aug 6, 2025

View reviewed changes

jianyizh and others added 3 commits August 6, 2025 20:41

Apply suggestions from copilot

8f1fcd7

Co-authored-by: Copilot <[email protected]>

Apply suggestions from copilot

d94ca74

Co-authored-by: Copilot <[email protected]>

lint

e8bef72

xytintel approved these changes Aug 8, 2025

View reviewed changes

jianyizh requested a review from liangan1 August 13, 2025 02:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Layernorm bwd OPT #1880

Layernorm bwd OPT #1880

Uh oh!

jianyizh commented Jul 25, 2025 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

jianyizh commented Jul 25, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Aug 6, 2025

Uh oh!

Copilot AI Aug 6, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

-      dbeta_blocks_ptr = dbeta_blocks.data_ptr<weight_t>();
+      // Set dgamma_blocks dtype to float32 for numerical stability in reduction
+      dgamma_blocks = at::empty({num_tile_m, N}, options.dtype(at::kFloat));
+      dgamma_blocks_ptr = dgamma_blocks.data_ptr<float>();
+    }
+    if (dbeta.defined()) {
+      auto options = dbeta.options();
+      dbeta_blocks = at::empty({num_tile_m, N}, options.dtype(at::kFloat));
+      dbeta_blocks_ptr = dbeta_blocks.data_ptr<float>();

Layernorm bwd OPT #1880

Are you sure you want to change the base?

Layernorm bwd OPT #1880

Uh oh!

Conversation

jianyizh commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

jianyizh commented Jul 25, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jianyizh commented Jul 25, 2025 •

edited

Loading