[ROCm] Use fine-grain fence in reduction #2553

jerrymannil · 2025-08-22T16:29:34Z

cherry-pick of pytorch#160979
Less-performant fix until pytorch#161180 is finalized

The global reduction path in reduction kernel currently has two threadfence operation
The first threadfence is executed by all threads in all the blocks, whereas the second threadfence is only run by threads in a single block
For AMD gpus, threadfence is a heavy weight operation, esp. when run by all the threads in the system (due to cross-XCD synchronizations)
So using fine-grain fence gives significant performance boost for AMD gpus.
We do a release fence when threads write to reduce buffer in global memory; and then do a acquire fence when threads read from the reduce buffer

Reproducer:

import torch

shapes = [(2, 896, 59, 91),
]

dims = [(2, 3),
]

for i, shape in enumerate(shapes):
    x = torch.randn(shape, device='cuda', dtype=torch.bfloat16)
    x = x.to(memory_format=torch.channels_last)
    for _ in range(20):
        _ = torch.sum(x, dims[i], keepdim=True, dtype=torch.bfloat16)
    torch.cuda.synchronize()

    start_evt = torch.cuda.Event(enable_timing=True)
    end_evt = torch.cuda.Event(enable_timing=True)
    start_evt.record()
    for _ in range(100):
        _ = torch.sum(x, dims[i], keepdim=True, dtype=torch.bfloat16)
    end_evt.record()
    torch.cuda.synchronize()
    print(f"Avg time for shape {shape}: {start_evt.elapsed_time(end_evt) / 100 * 1e3:.2f} us")

Fixes SWDEV-545710

Cherry-picked to release/2.8 branch via #2561

Cherry-picked to rocm7.1_internal_testing branch via #2563

@amd-hhashemi

* The global reduction path in reduction kernel currently has two threadfence operation * The first threadfence is executed by all threads in all the blocks, whereas the second threadfence is only run by threads in a single block * For AMD gpus, threadfence is a heavy weight operation, esp. when run by all the threads in the system (due to cross-XCD synchronizations) * So using fine-grain fence gives significant performance boost for AMD gpus. * We do a release fence when threads write to reduce buffer in global memory; and then do a acquire fence when threads read from the reduce buffer Co-author: @amd-hhashemi, @jeffdaily **Reproducer**: ```import time import torch shapes = [(2, 896, 59, 91), ] dims = [(2, 3), ] for i, shape in enumerate(shapes): x = torch.randn(shape, device='cuda', dtype=torch.bfloat16) x = x.to(memory_format=torch.channels_last) for _ in range(20): _ = torch.sum(x, dims[i], keepdim=True, dtype=torch.bfloat16) torch.cuda.synchronize() start_evt = torch.cuda.Event(enable_timing=True) end_evt = torch.cuda.Event(enable_timing=True) start_evt.record() for _ in range(100): _ = torch.sum(x, dims[i], keepdim=True, dtype=torch.bfloat16) end_evt.record() torch.cuda.synchronize() print(f"Avg time for shape {shape}: {start_evt.elapsed_time(end_evt) / 100 * 1e3:.2f} us") ```

jerrymannil · 2025-08-22T16:33:57Z

Results (MI300X):

Before:
Avg time for shape (2, 896, 59, 91): 82.13 us

After:
Avg time for shape (2, 896, 59, 91): 61 us

rocm-repo-management-api · 2025-08-22T16:49:03Z

Jenkins build for baddc98b5389ba858f9677a5a2738914e429192d commit is in progress
Links: Blue Ocean view / Build artifacts

jerrymannil · 2025-08-22T19:02:51Z

! cherry-pick --onto release/2.8 rocm7.1_internal_testing

@amd-hhashemi

cherry-pick of pytorch#160979 Less-performant fix until pytorch#161180 is finalized * The global reduction path in reduction kernel currently has two threadfence operation * The first threadfence is executed by all threads in all the blocks, whereas the second threadfence is only run by threads in a single block * For AMD gpus, threadfence is a heavy weight operation, esp. when run by all the threads in the system (due to cross-XCD synchronizations) * So using fine-grain fence gives significant performance boost for AMD gpus. * We do a release fence when threads write to reduce buffer in global memory; and then do a acquire fence when threads read from the reduce buffer Co-author: @amd-hhashemi, @jeffdaily **Reproducer**: ```import time import torch shapes = [(2, 896, 59, 91), ] dims = [(2, 3), ] for i, shape in enumerate(shapes): x = torch.randn(shape, device='cuda', dtype=torch.bfloat16) x = x.to(memory_format=torch.channels_last) for _ in range(20): _ = torch.sum(x, dims[i], keepdim=True, dtype=torch.bfloat16) torch.cuda.synchronize() start_evt = torch.cuda.Event(enable_timing=True) end_evt = torch.cuda.Event(enable_timing=True) start_evt.record() for _ in range(100): _ = torch.sum(x, dims[i], keepdim=True, dtype=torch.bfloat16) end_evt.record() torch.cuda.synchronize() print(f"Avg time for shape {shape}: {start_evt.elapsed_time(end_evt) / 100 * 1e3:.2f} us") ``` Fixes SWDEV-545710

@amd-hhashemi

cherry-pick of pytorch#160979 Less-performant fix until pytorch#161180 is finalized * The global reduction path in reduction kernel currently has two threadfence operation * The first threadfence is executed by all threads in all the blocks, whereas the second threadfence is only run by threads in a single block * For AMD gpus, threadfence is a heavy weight operation, esp. when run by all the threads in the system (due to cross-XCD synchronizations) * So using fine-grain fence gives significant performance boost for AMD gpus. * We do a release fence when threads write to reduce buffer in global memory; and then do a acquire fence when threads read from the reduce buffer Co-author: @amd-hhashemi, @jeffdaily **Reproducer**: ```import time import torch shapes = [(2, 896, 59, 91), ] dims = [(2, 3), ] for i, shape in enumerate(shapes): x = torch.randn(shape, device='cuda', dtype=torch.bfloat16) x = x.to(memory_format=torch.channels_last) for _ in range(20): _ = torch.sum(x, dims[i], keepdim=True, dtype=torch.bfloat16) torch.cuda.synchronize() start_evt = torch.cuda.Event(enable_timing=True) end_evt = torch.cuda.Event(enable_timing=True) start_evt.record() for _ in range(100): _ = torch.sum(x, dims[i], keepdim=True, dtype=torch.bfloat16) end_evt.record() torch.cuda.synchronize() print(f"Avg time for shape {shape}: {start_evt.elapsed_time(end_evt) / 100 * 1e3:.2f} us") ``` Fixes SWDEV-545710

@amd-hhashemi

cherry-pick of pytorch#160979 Less-performant fix until pytorch#161180 is finalized * The global reduction path in reduction kernel currently has two threadfence operation * The first threadfence is executed by all threads in all the blocks, whereas the second threadfence is only run by threads in a single block * For AMD gpus, threadfence is a heavy weight operation, esp. when run by all the threads in the system (due to cross-XCD synchronizations) * So using fine-grain fence gives significant performance boost for AMD gpus. * We do a release fence when threads write to reduce buffer in global memory; and then do a acquire fence when threads read from the reduce buffer Co-author: @amd-hhashemi, @jeffdaily **Reproducer**: ```import time import torch shapes = [(2, 896, 59, 91), ] dims = [(2, 3), ] for i, shape in enumerate(shapes): x = torch.randn(shape, device='cuda', dtype=torch.bfloat16) x = x.to(memory_format=torch.channels_last) for _ in range(20): _ = torch.sum(x, dims[i], keepdim=True, dtype=torch.bfloat16) torch.cuda.synchronize() start_evt = torch.cuda.Event(enable_timing=True) end_evt = torch.cuda.Event(enable_timing=True) start_evt.record() for _ in range(100): _ = torch.sum(x, dims[i], keepdim=True, dtype=torch.bfloat16) end_evt.record() torch.cuda.synchronize() print(f"Avg time for shape {shape}: {start_evt.elapsed_time(end_evt) / 100 * 1e3:.2f} us") ``` Fixes SWDEV-545710

dhonnappa-amd · 2025-08-22T19:05:49Z

Created branch autogenerated/release/2.8_cherry-pick_pr-2553 and #2560

Created branch autogenerated/rocm7.1_internal_testing_cherry-pick_pr-2553 and #2562

@amd-hhashemi

cherry-pick of pytorch#160979 Less-performant fix until pytorch#161180 is finalized * The global reduction path in reduction kernel currently has two threadfence operation * The first threadfence is executed by all threads in all the blocks, whereas the second threadfence is only run by threads in a single block * For AMD gpus, threadfence is a heavy weight operation, esp. when run by all the threads in the system (due to cross-XCD synchronizations) * So using fine-grain fence gives significant performance boost for AMD gpus. * We do a release fence when threads write to reduce buffer in global memory; and then do a acquire fence when threads read from the reduce buffer Co-author: @amd-hhashemi, @jeffdaily **Reproducer**: ```import time import torch shapes = [(2, 896, 59, 91), ] dims = [(2, 3), ] for i, shape in enumerate(shapes): x = torch.randn(shape, device='cuda', dtype=torch.bfloat16) x = x.to(memory_format=torch.channels_last) for _ in range(20): _ = torch.sum(x, dims[i], keepdim=True, dtype=torch.bfloat16) torch.cuda.synchronize() start_evt = torch.cuda.Event(enable_timing=True) end_evt = torch.cuda.Event(enable_timing=True) start_evt.record() for _ in range(100): _ = torch.sum(x, dims[i], keepdim=True, dtype=torch.bfloat16) end_evt.record() torch.cuda.synchronize() print(f"Avg time for shape {shape}: {start_evt.elapsed_time(end_evt) / 100 * 1e3:.2f} us") ``` Fixes SWDEV-545710

dhonnappa-amd · 2025-08-22T19:06:39Z

Created branch autogenerated/release/2.8_cherry-pick_pr-2553 and #2561

Created branch autogenerated/rocm7.1_internal_testing_cherry-pick_pr-2553 and #2563

Comment processed by Build

#2561) Cherry-pick of #2553 Co-authored-by: Jerry Mannil <[email protected]>

…e in reduction (#2563) Cherry-pick of #2553 Co-authored-by: Jerry Mannil <[email protected]>

jerrymannil requested a review from pruthvistony August 22, 2025 16:29

jerrymannil self-assigned this Aug 22, 2025

pruthvistony approved these changes Aug 22, 2025

View reviewed changes

pruthvistony merged commit c00d48c into release/2.7 Aug 22, 2025
0 of 2 checks passed

pruthvistony deleted the jerrymannil-patch-1 branch August 22, 2025 16:55

dhonnappa-amd mentioned this pull request Aug 22, 2025

[AUTOGENERATED] [release/2.8] [ROCm] Use fine-grain fence in reduction #2560

Closed

dhonnappa-amd mentioned this pull request Aug 22, 2025

[AUTOGENERATED] [release/2.8] [ROCm] Use fine-grain fence in reduction #2561

Merged

dhonnappa-amd mentioned this pull request Aug 22, 2025

[AUTOGENERATED] [rocm7.1_internal_testing] [ROCm] Use fine-grain fence in reduction #2562

Closed

dhonnappa-amd mentioned this pull request Aug 22, 2025

[AUTOGENERATED] [rocm7.1_internal_testing] [ROCm] Use fine-grain fence in reduction #2563

Merged

jerrymannil added a commit that referenced this pull request Aug 22, 2025

[AUTOGENERATED] [release/2.8] [ROCm] Use fine-grain fence in reduction (

c140442

#2561) Cherry-pick of #2553 Co-authored-by: Jerry Mannil <[email protected]>

jerrymannil added a commit that referenced this pull request Aug 22, 2025

[AUTOGENERATED] [rocm7.1_internal_testing] [ROCm] Use fine-grain fenc…

93fa949

…e in reduction (#2563) Cherry-pick of #2553 Co-authored-by: Jerry Mannil <[email protected]>

jerrymannil added a commit that referenced this pull request Sep 5, 2025

[AUTOGENERATED] [rocm7.1_internal_testing] [ROCm] Use fine-grain fenc…

d03e621

…e in reduction (#2563) Cherry-pick of #2553 Co-authored-by: Jerry Mannil <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROCm] Use fine-grain fence in reduction #2553

[ROCm] Use fine-grain fence in reduction #2553

Uh oh!

jerrymannil commented Aug 22, 2025 •

edited by dhonnappa-amd

Loading

Uh oh!

jerrymannil commented Aug 22, 2025

Uh oh!

rocm-repo-management-api bot commented Aug 22, 2025

Uh oh!

Uh oh!

jerrymannil commented Aug 22, 2025

Uh oh!

dhonnappa-amd commented Aug 22, 2025

Uh oh!

dhonnappa-amd commented Aug 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[ROCm] Use fine-grain fence in reduction #2553

[ROCm] Use fine-grain fence in reduction #2553

Uh oh!

Conversation

jerrymannil commented Aug 22, 2025 • edited by dhonnappa-amd Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jerrymannil commented Aug 22, 2025

Uh oh!

rocm-repo-management-api bot commented Aug 22, 2025

Uh oh!

Uh oh!

jerrymannil commented Aug 22, 2025

Uh oh!

dhonnappa-amd commented Aug 22, 2025

Uh oh!

dhonnappa-amd commented Aug 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jerrymannil commented Aug 22, 2025 •

edited by dhonnappa-amd

Loading