Layernorm bwd OPT #1880

jianyizh · 2025-07-25T06:31:04Z

I noticed layer norm backward on gamma and beta is very slow when column is much longer. i.e. [M,N] column reduction and M>>N.
For example, in timm tnt_s_patch16_224 training, layernorm bwd shape [25088,16,24], normalized shape [24]. it will only launch one workgroup. I use a two staged column reduction to increase parallelism. GammaBetaBackwardSimpleKernelFunctor takes 9 ms on PVC, 8.5ms on BMG. After opt, we use GammaBetaReduceFunctor and two sum to do column reduction, they will take 0.09ms + 0.06ms x2 on PVC and 0.19ms + 0.04ms x 2 on BMG

jianyizh · 2025-07-25T06:43:20Z

Maybe we can follow recent cuda change here pytorch/pytorch@73b4938

Co-authored-by: Copilot <[email protected]>

Copilot

Pull Request Overview

This PR optimizes the backward pass for LayerNorm by implementing a two-stage column reduction for gamma and beta gradients when dealing with cases where M (rows) >> N (columns). The optimization addresses performance bottlenecks where the original implementation would only launch a single workgroup for column reduction, resulting in poor GPU utilization.

Implements a new GammaBetaReduceFunctor for optimized two-stage reduction
Adds intelligent heuristics to determine when to use the optimized path vs. the simple kernel
Provides significant performance improvements (from 9ms to 0.15ms on PVC, 8.5ms to 0.27ms on BMG)

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

src/ATen/native/xpu/sycl/LayerNormKernels.cpp

jianyizh requested review from EikanWang and toyxu July 25, 2025 06:31

jianyizh added the kernel_optimization label Jul 25, 2025

Copilot AI review requested due to automatic review settings July 25, 2025 06:31

This comment was marked as outdated.

Sign in to view

jianyizh requested a review from Copilot August 6, 2025 12:37

jianyizh changed the title ~~[WIP] Layernorm bwd OPT~~ Layernorm bwd OPT Aug 6, 2025

This comment was marked as outdated.

Sign in to view

toyxu approved these changes Aug 8, 2025

View reviewed changes

jianyizh requested a review from liangan1 August 13, 2025 02:35

liangan1 approved these changes Aug 19, 2025

View reviewed changes

jianyizh and others added 8 commits August 19, 2025 13:44

save

c4930b6

save

484ffa6

save

c436b11

save

2001004

save

0121231

Apply suggestions from copilot

02bcd80

Co-authored-by: Copilot <[email protected]>

Apply suggestions from copilot

0e14482

Co-authored-by: Copilot <[email protected]>

lint

0bfee0f

jianyizh force-pushed the jianyi/ln_bwd branch from e8bef72 to 0bfee0f Compare August 19, 2025 05:44

jianyizh requested a review from Copilot August 21, 2025 07:04

Copilot AI reviewed Aug 21, 2025

View reviewed changes

jianyizh added this pull request to the merge queue Aug 21, 2025

Merged via the queue into main with commit 7651ca2 Aug 21, 2025
98 of 105 checks passed

jianyizh deleted the jianyi/ln_bwd branch August 21, 2025 07:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Layernorm bwd OPT #1880

Layernorm bwd OPT #1880

Uh oh!

jianyizh commented Jul 25, 2025 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

jianyizh commented Jul 25, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Layernorm bwd OPT #1880

Layernorm bwd OPT #1880

Uh oh!

Conversation

jianyizh commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

jianyizh commented Jul 25, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jianyizh commented Jul 25, 2025 •

edited

Loading