Skip to content

Commit fe45cb1

Browse files
committed
[AMDGPU] Identify vector idiom to unlock SROA
HIP vector types often lower to aggregates and get copied with memcpy. When the source or destination is chosen via a pointer select, SROA cannot split the aggregate. This keeps data in stack slots and increases scratch traffic. By rewriting these memcpy idioms, we enable SROA to promote values, reducing stack usage and improving occupancy and bandwidth on AMD GPUs. For example: %p = select i1 %cond, ptr %A, ptr %B call void @llvm.memcpy.p0.p0.i32(ptr %dst, ptr %p, i32 16, i1 false) When the source is a pointer select and conditions allow, the pass replaces the memcpy with two aligned loads, a value-level select of the loaded vector, and one aligned store. If it is not safe to speculate both loads, it splits control flow and emits a memcpy in each arm. When the destination is a select, it always splits control flow to avoid speculative stores. Vector element types are chosen based on size and minimum proven alignment to minimize the number of operations. The pass handles non-volatile, constant-length memcpy up to a small size cap. Source and destination must be in the same address space. It runs early, after inlining and before InferAddressSpaces and SROA. Volatile and cross-address-space memcpys are skipped. The size cap is controlled by -amdgpu-vector-idiom-max-bytes (default 32), allowing tuning for different workloads. Fixes: SWDEV-550134
1 parent f4087f6 commit fe45cb1

File tree

6 files changed

+965
-0
lines changed

6 files changed

+965
-0
lines changed

llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,8 @@ FUNCTION_PASS("amdgpu-simplifylib", AMDGPUSimplifyLibCallsPass())
6969
FUNCTION_PASS("amdgpu-unify-divergent-exit-nodes",
7070
AMDGPUUnifyDivergentExitNodesPass())
7171
FUNCTION_PASS("amdgpu-usenative", AMDGPUUseNativeCallsPass())
72+
FUNCTION_PASS("amdgpu-vector-idiom",
73+
AMDGPUVectorIdiomCombinePass(/*MaxBytes=*/32))
7274
FUNCTION_PASS("si-annotate-control-flow", SIAnnotateControlFlowPass(*static_cast<const GCNTargetMachine *>(this)))
7375
#undef FUNCTION_PASS
7476

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@
2929
#include "AMDGPUTargetObjectFile.h"
3030
#include "AMDGPUTargetTransformInfo.h"
3131
#include "AMDGPUUnifyDivergentExitNodes.h"
32+
#include "AMDGPUVectorIdiom.h"
3233
#include "AMDGPUWaitSGPRHazards.h"
3334
#include "GCNDPPCombine.h"
3435
#include "GCNIterativeScheduler.h"
@@ -849,6 +850,12 @@ void AMDGPUTargetMachine::registerPassBuilderCallbacks(PassBuilder &PB) {
849850
EnablePromoteKernelArguments)
850851
FPM.addPass(AMDGPUPromoteKernelArgumentsPass());
851852

853+
// Run vector-idiom canonicalization early (after inlining) and before
854+
// infer-AS / SROA to maximize scalarization opportunities.
855+
// Specify 32 bytes since the largest HIP vector types are double4 or
856+
// long4.
857+
FPM.addPass(AMDGPUVectorIdiomCombinePass(/*MaxBytes=*/32));
858+
852859
// Add infer address spaces pass to the opt pipeline after inlining
853860
// but before SROA to increase SROA opportunities.
854861
FPM.addPass(InferAddressSpacesPass());
@@ -911,6 +918,8 @@ void AMDGPUTargetMachine::registerPassBuilderCallbacks(PassBuilder &PB) {
911918
if (EnableLowerModuleLDS)
912919
PM.addPass(AMDGPULowerModuleLDSPass(*this));
913920
if (Level != OptimizationLevel::O0) {
921+
PM.addPass(createModuleToFunctionPassAdaptor(
922+
AMDGPUVectorIdiomCombinePass(/*MaxBytes=*/32)));
914923
// Do we really need internalization in LTO?
915924
if (InternalizeSymbols) {
916925
PM.addPass(InternalizePass(mustPreserveGV));

0 commit comments

Comments
 (0)