Skip to content

Commit 9f82875

Browse files
foadjayfoad
authored andcommitted
[AMDGPU] Simplify the exclusive scan used for optimized atomics
Summary: Change the scan algorithm to use only power-of-two shifts (1, 2, 4, 8, 16, 32) instead of starting off shifting by 1, 2 and 3 and then doing a 3-way ADD, because: 1. It simplifies the compiler a little. 2. It minimizes vgpr pressure because each instruction is now of the form vn = vn + vn << c. 3. It is more friendly to the DPP combiner, which currently can't combine into an ADD3 instruction. Because of #2 and ROCm#3 the end result is improved from this: v_add_u32_dpp v4, v3, v3 row_shr:1 row_mask:0xf bank_mask:0xf bound_ctrl:0 v_mov_b32_dpp v5, v3 row_shr:2 row_mask:0xf bank_mask:0xf v_mov_b32_dpp v1, v3 row_shr:3 row_mask:0xf bank_mask:0xf v_add3_u32 v1, v4, v5, v1 s_nop 1 v_add_u32_dpp v1, v1, v1 row_shr:4 row_mask:0xf bank_mask:0xe s_nop 1 v_add_u32_dpp v1, v1, v1 row_shr:8 row_mask:0xf bank_mask:0xc s_nop 1 v_add_u32_dpp v1, v1, v1 row_bcast:15 row_mask:0xa bank_mask:0xf s_nop 1 v_add_u32_dpp v1, v1, v1 row_bcast:31 row_mask:0xc bank_mask:0xf To this: v_add_u32_dpp v1, v1, v1 row_shr:1 row_mask:0xf bank_mask:0xf bound_ctrl:0 s_nop 1 v_add_u32_dpp v1, v1, v1 row_shr:2 row_mask:0xf bank_mask:0xf bound_ctrl:0 s_nop 1 v_add_u32_dpp v1, v1, v1 row_shr:4 row_mask:0xf bank_mask:0xe s_nop 1 v_add_u32_dpp v1, v1, v1 row_shr:8 row_mask:0xf bank_mask:0xc s_nop 1 v_add_u32_dpp v1, v1, v1 row_bcast:15 row_mask:0xa bank_mask:0xf s_nop 1 v_add_u32_dpp v1, v1, v1 row_bcast:31 row_mask:0xc bank_mask:0xf I.e. two fewer computational instructions, one extra nop where we could schedule something else. Reviewers: arsenm, sheredom, critson, rampitec, vpykhtin Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D64411 Change-Id: I79f792d30210974acbd67ae0c5eaff3094263281
1 parent 5d8fc59 commit 9f82875

File tree

2 files changed

+8
-12
lines changed

2 files changed

+8
-12
lines changed

lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp

Lines changed: 8 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -385,26 +385,24 @@ void AMDGPUAtomicOptimizer::optimizeAtomic(Instruction &I,
385385
CallInst *const SetInactive =
386386
B.CreateIntrinsic(Intrinsic::amdgcn_set_inactive, Ty, {V, Identity});
387387

388-
CallInst *const FirstDPP =
388+
ExclScan =
389389
B.CreateIntrinsic(Intrinsic::amdgcn_update_dpp, Ty,
390390
{Identity, SetInactive, B.getInt32(DPP_WF_SR1),
391391
B.getInt32(0xf), B.getInt32(0xf), B.getFalse()});
392-
ExclScan = FirstDPP;
393392

394-
const unsigned Iters = 7;
395-
const unsigned DPPCtrl[Iters] = {
396-
DPP_ROW_SR1, DPP_ROW_SR2, DPP_ROW_SR3, DPP_ROW_SR4,
397-
DPP_ROW_SR8, DPP_ROW_BCAST15, DPP_ROW_BCAST31};
398-
const unsigned RowMask[Iters] = {0xf, 0xf, 0xf, 0xf, 0xf, 0xa, 0xc};
399-
const unsigned BankMask[Iters] = {0xf, 0xf, 0xf, 0xe, 0xc, 0xf, 0xf};
393+
const unsigned Iters = 6;
394+
const unsigned DPPCtrl[Iters] = {DPP_ROW_SR1, DPP_ROW_SR2,
395+
DPP_ROW_SR4, DPP_ROW_SR8,
396+
DPP_ROW_BCAST15, DPP_ROW_BCAST31};
397+
const unsigned RowMask[Iters] = {0xf, 0xf, 0xf, 0xf, 0xa, 0xc};
398+
const unsigned BankMask[Iters] = {0xf, 0xf, 0xe, 0xc, 0xf, 0xf};
400399

401400
// This loop performs an exclusive scan across the wavefront, with all lanes
402401
// active (by using the WWM intrinsic).
403402
for (unsigned Idx = 0; Idx < Iters; Idx++) {
404-
Value *const UpdateValue = Idx < 3 ? FirstDPP : ExclScan;
405403
CallInst *const DPP = B.CreateIntrinsic(
406404
Intrinsic::amdgcn_update_dpp, Ty,
407-
{Identity, UpdateValue, B.getInt32(DPPCtrl[Idx]),
405+
{Identity, ExclScan, B.getInt32(DPPCtrl[Idx]),
408406
B.getInt32(RowMask[Idx]), B.getInt32(BankMask[Idx]), B.getFalse()});
409407

410408
ExclScan = buildNonAtomicBinOp(B, Op, ExclScan, DPP);

test/CodeGen/AMDGPU/atomic_optimizations_buffer.ll

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,6 @@ entry:
4747
; GFX8MORE: v_mov_b32_dpp v[[wave_shr1:[0-9]+]], v{{[0-9]+}} wave_shr:1 row_mask:0xf bank_mask:0xf
4848
; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v[[wave_shr1]] row_shr:1 row_mask:0xf bank_mask:0xf
4949
; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v[[wave_shr1]] row_shr:2 row_mask:0xf bank_mask:0xf
50-
; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v[[wave_shr1]] row_shr:3 row_mask:0xf bank_mask:0xf
5150
; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v{{[0-9]+}} row_shr:4 row_mask:0xf bank_mask:0xe
5251
; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v{{[0-9]+}} row_shr:8 row_mask:0xf bank_mask:0xc
5352
; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v{{[0-9]+}} row_bcast:15 row_mask:0xa bank_mask:0xf
@@ -115,7 +114,6 @@ entry:
115114
; GFX8MORE: v_mov_b32_dpp v[[wave_shr1:[0-9]+]], v{{[0-9]+}} wave_shr:1 row_mask:0xf bank_mask:0xf
116115
; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v[[wave_shr1]] row_shr:1 row_mask:0xf bank_mask:0xf
117116
; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v[[wave_shr1]] row_shr:2 row_mask:0xf bank_mask:0xf
118-
; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v[[wave_shr1]] row_shr:3 row_mask:0xf bank_mask:0xf
119117
; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v{{[0-9]+}} row_shr:4 row_mask:0xf bank_mask:0xe
120118
; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v{{[0-9]+}} row_shr:8 row_mask:0xf bank_mask:0xc
121119
; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v{{[0-9]+}} row_bcast:15 row_mask:0xa bank_mask:0xf

0 commit comments

Comments
 (0)