Skip to content

Commit f481f5b

Browse files
authored
[OpenMP][flang] Add initial support for by-ref reductions on the GPU (#165714)
Adds initial support for GPU by-ref reductions. The main problem for reduction by reference is that, prior to this PR, we were shuffling (from remote lanes within the same warp or across different warps within the block) pointers/references to the private reduction values rather than the private reduction values themselves. In particular, this diff adds support for reductions on scalar allocatables where reductions happen on loops nested in `target` regions. For example: ```fortran integer :: i real, allocatable :: scalar_alloc allocate(scalar_alloc) scalar_alloc = 0 !$omp target map(tofrom: scalar_alloc) !$omp parallel do reduction(+: scalar_alloc) do i = 1, 1000000 scalar_alloc = scalar_alloc + 1 end do !$omp end target ``` This PR supports by-ref reductions on the intra- and inter-warp levels. So far, there are still steps to be takens for full support of by-ref reductions, for example: * Support inter-block value combination is still not supported. Therefore, `target teams distribute parallel do` is still not supported. * Support for dynamically-sized arrays still needs to be added. * Support for more than one allocatable/array on the same `reduction` clause.
1 parent 63e4b8c commit f481f5b

35 files changed

+500
-114
lines changed

clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1727,7 +1727,7 @@ void CGOpenMPRuntimeGPU::emitReduction(
17271727
CGF.Builder.GetInsertPoint());
17281728
llvm::OpenMPIRBuilder::LocationDescription OmpLoc(
17291729
CodeGenIP, CGF.SourceLocToDebugLoc(Loc));
1730-
llvm::SmallVector<llvm::OpenMPIRBuilder::ReductionInfo> ReductionInfos;
1730+
llvm::SmallVector<llvm::OpenMPIRBuilder::ReductionInfo, 2> ReductionInfos;
17311731

17321732
CodeGenFunction::OMPPrivateScope Scope(CGF);
17331733
unsigned Idx = 0;
@@ -1780,14 +1780,15 @@ void CGOpenMPRuntimeGPU::emitReduction(
17801780
};
17811781
ReductionInfos.emplace_back(llvm::OpenMPIRBuilder::ReductionInfo(
17821782
ElementType, Variable, PrivateVariable, EvalKind,
1783-
/*ReductionGen=*/nullptr, ReductionGen, AtomicReductionGen));
1783+
/*ReductionGen=*/nullptr, ReductionGen, AtomicReductionGen,
1784+
/*DataPtrPtrGen=*/nullptr));
17841785
Idx++;
17851786
}
17861787

17871788
llvm::OpenMPIRBuilder::InsertPointTy AfterIP =
17881789
cantFail(OMPBuilder.createReductionsGPU(
1789-
OmpLoc, AllocaIP, CodeGenIP, ReductionInfos, false, TeamsReduction,
1790-
llvm::OpenMPIRBuilder::ReductionGenCBKind::Clang,
1790+
OmpLoc, AllocaIP, CodeGenIP, ReductionInfos, /*IsByRef=*/{}, false,
1791+
TeamsReduction, llvm::OpenMPIRBuilder::ReductionGenCBKind::Clang,
17911792
CGF.getTarget().getGridValue(),
17921793
C.getLangOpts().OpenMPCUDAReductionBufNum, RTLoc));
17931794
CGF.Builder.restoreIP(AfterIP);

flang/include/flang/Optimizer/Dialect/FIROps.td

Lines changed: 14 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3753,7 +3753,7 @@ def fir_DeclareReductionOp : fir_Op<"declare_reduction", [IsolatedFromAbove,
37533753
duplication at the moment. TODO Combine both ops into one. See:
37543754
https://discourse.llvm.org/t/dialect-for-data-locality-sharing-specifiers-clauses-in-openmp-openacc-and-do-concurrent/86108.
37553755

3756-
Declares a `do concurrent` reduction. This requires two mandatory and three
3756+
Declares a `do concurrent` reduction. This requires two mandatory and four
37573757
optional regions.
37583758

37593759
1. The optional alloc region specifies how to allocate the thread-local
@@ -3782,30 +3782,40 @@ def fir_DeclareReductionOp : fir_Op<"declare_reduction", [IsolatedFromAbove,
37823782
allocated by the initializer region. The region has an argument that
37833783
contains the value of the thread-local reduction accumulator. This will
37843784
be executed after the reduction has completed.
3785+
6. The DataPtrPtr region specifies how to access the base address of a
3786+
boxed-value. This is used, in particular, for GPU reductions in order
3787+
know where partial reduction results are stored in remote lanes.
37853788

37863789
Note that the MLIR type system does not allow for type-polymorphic
37873790
reductions. Separate reduction declarations should be created for different
37883791
element and accumulator types.
37893792

37903793
For initializer and reduction regions, the operand to `fir.yield` must
37913794
match the parent operation's results.
3795+
3796+
* `$byref_element_type`: For by-ref reductions, we want to keep track of the
3797+
boxed/allocated type. For example, for a `real, allocatable` variable,
3798+
`real` should be stored in this attribute.
37923799
}];
37933800

37943801
let arguments = (ins SymbolNameAttr:$sym_name,
3795-
TypeAttr:$type);
3802+
TypeAttr:$type,
3803+
OptionalAttr<TypeAttr>:$byref_element_type);
37963804

37973805
let regions = (region MaxSizedRegion<1>:$allocRegion,
37983806
AnyRegion:$initializerRegion,
37993807
AnyRegion:$reductionRegion,
38003808
AnyRegion:$atomicReductionRegion,
3801-
AnyRegion:$cleanupRegion);
3809+
AnyRegion:$cleanupRegion,
3810+
AnyRegion:$dataPtrPtrRegion);
38023811

38033812
let assemblyFormat = "$sym_name `:` $type attr-dict-with-keyword "
38043813
"( `alloc` $allocRegion^ )? "
38053814
"`init` $initializerRegion "
38063815
"`combiner` $reductionRegion "
38073816
"( `atomic` $atomicReductionRegion^ )? "
3808-
"( `cleanup` $cleanupRegion^ )? ";
3817+
"( `cleanup` $cleanupRegion^ )? "
3818+
"( `data_ptr_ptr` $dataPtrPtrRegion^ )? ";
38093819

38103820
let extraClassDeclaration = [{
38113821
mlir::BlockArgument getAllocMoldArg() {

flang/lib/Lower/Support/ReductionProcessor.cpp

Lines changed: 45 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -572,10 +572,21 @@ DeclareRedType ReductionProcessor::createDeclareReductionHelper(
572572

573573
mlir::OpBuilder modBuilder(module.getBodyRegion());
574574
mlir::Type valTy = fir::unwrapRefType(type);
575-
if (!isByRef)
575+
576+
// For by-ref reductions, we want to keep track of the
577+
// boxed/referenced/allocated type. For example, for a `real, allocatable`
578+
// variable, `real` should be stored.
579+
mlir::TypeAttr boxedTyAttr{};
580+
mlir::Type boxedTy;
581+
582+
if (isByRef) {
583+
boxedTy = fir::unwrapPassByRefType(valTy);
584+
boxedTyAttr = mlir::TypeAttr::get(boxedTy);
585+
} else
576586
type = valTy;
577587

578-
decl = DeclareRedType::create(modBuilder, loc, reductionOpName, type);
588+
decl = DeclareRedType::create(modBuilder, loc, reductionOpName, type,
589+
boxedTyAttr);
579590
createReductionAllocAndInitRegions(converter, loc, decl, genInitValueCB, type,
580591
isByRef);
581592
builder.createBlock(&decl.getReductionRegion(),
@@ -585,6 +596,38 @@ DeclareRedType ReductionProcessor::createDeclareReductionHelper(
585596
mlir::Value op1 = decl.getReductionRegion().front().getArgument(0);
586597
mlir::Value op2 = decl.getReductionRegion().front().getArgument(1);
587598
genCombinerCB(builder, loc, type, op1, op2, isByRef);
599+
600+
if (isByRef && fir::isa_box_type(valTy)) {
601+
bool isBoxReductionSupported = [&]() {
602+
auto offloadMod = llvm::dyn_cast<mlir::omp::OffloadModuleInterface>(
603+
*builder.getModule());
604+
605+
// This check tests the implementation status on the GPU. Box reductions
606+
// are fully supported on the CPU.
607+
if (!offloadMod.getIsGPU())
608+
return true;
609+
610+
auto seqTy = mlir::dyn_cast<fir::SequenceType>(boxedTy);
611+
612+
// Dynamically-shaped arrays are not supported yet on the GPU.
613+
return !seqTy || !fir::sequenceWithNonConstantShape(seqTy);
614+
}();
615+
616+
if (!isBoxReductionSupported) {
617+
TODO(loc, "Reduction of dynamically-shaped arrays are not supported yet "
618+
"on the GPU.");
619+
}
620+
621+
mlir::Region &dataPtrPtrRegion = decl.getDataPtrPtrRegion();
622+
mlir::Block &dataAddrBlock = *builder.createBlock(
623+
&dataPtrPtrRegion, dataPtrPtrRegion.end(), {type}, {loc});
624+
builder.setInsertionPointToEnd(&dataAddrBlock);
625+
mlir::Value boxRefOperand = dataAddrBlock.getArgument(0);
626+
mlir::Value baseAddrOffset = fir::BoxOffsetOp::create(
627+
builder, loc, boxRefOperand, fir::BoxFieldAttr::base_addr);
628+
genYield<DeclareRedType>(builder, loc, baseAddrOffset);
629+
}
630+
588631
return decl;
589632
}
590633

flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -848,7 +848,8 @@ class DoConcurrentConversion
848848
if (!ompReducer) {
849849
ompReducer = mlir::omp::DeclareReductionOp::create(
850850
rewriter, firReducer.getLoc(), ompReducerName,
851-
firReducer.getTypeAttr().getValue());
851+
firReducer.getTypeAttr().getValue(),
852+
firReducer.getByrefElementTypeAttr());
852853

853854
cloneFIRRegionToOMP(rewriter, firReducer.getAllocRegion(),
854855
ompReducer.getAllocRegion());

flang/test/Lower/OpenMP/delayed-privatization-reduction-byref.f90

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ subroutine red_and_delayed_private
2222
! CHECK-SAME: @[[PRIVATIZER_SYM:.*]] : i32
2323

2424
! CHECK-LABEL: omp.declare_reduction
25-
! CHECK-SAME: @[[REDUCTION_SYM:.*]] : !fir.ref<i32> alloc
25+
! CHECK-SAME: @[[REDUCTION_SYM:.*]] : !fir.ref<i32> attributes {byref_element_type = i32} alloc
2626

2727
! CHECK-LABEL: _QPred_and_delayed_private
2828
! CHECK: omp.parallel

flang/test/Lower/OpenMP/parallel-reduction-allocatable-array.f90

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ program reduce
1818

1919
end program
2020

21-
! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_heap_Uxi32 : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>> alloc {
21+
! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_heap_Uxi32 : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>> attributes {byref_element_type = !fir.array<?xi32>} alloc {
2222
! CHECK: %[[VAL_10:.*]] = fir.alloca !fir.box<!fir.heap<!fir.array<?xi32>>>
2323
! CHECK: omp.yield(%[[VAL_10]] : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>>)
2424
! CHECK-LABEL: } init {

flang/test/Lower/OpenMP/parallel-reduction-array-lb.f90

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ program reduce
1212

1313
end program
1414

15-
! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_3x2xi32 : !fir.ref<!fir.box<!fir.array<3x2xi32>>> alloc {
15+
! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_3x2xi32 : !fir.ref<!fir.box<!fir.array<3x2xi32>>> {{.*}} alloc {
1616
! CHECK: %[[VAL_15:.*]] = fir.alloca !fir.box<!fir.array<3x2xi32>>
1717
! CHECK: omp.yield(%[[VAL_15]] : !fir.ref<!fir.box<!fir.array<3x2xi32>>>)
1818
! CHECK-LABEL: } init {

flang/test/Lower/OpenMP/parallel-reduction-array.f90

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ program reduce
1717
print *,i
1818
end program
1919

20-
! CPU-LABEL: omp.declare_reduction @add_reduction_byref_box_3xi32 : !fir.ref<!fir.box<!fir.array<3xi32>>> alloc {
20+
! CPU-LABEL: omp.declare_reduction @add_reduction_byref_box_3xi32 : !fir.ref<!fir.box<!fir.array<3xi32>>> attributes {byref_element_type = !fir.array<3xi32>} alloc {
2121
! CPU: %[[VAL_8:.*]] = fir.alloca !fir.box<!fir.array<3xi32>>
2222
! CPU: omp.yield(%[[VAL_8]] : !fir.ref<!fir.box<!fir.array<3xi32>>>)
2323
! CPU-LABEL: } init {

flang/test/Lower/OpenMP/parallel-reduction-array2.f90

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ program reduce
1313
print *,i
1414
end program
1515

16-
! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_3xi32 : !fir.ref<!fir.box<!fir.array<3xi32>>> alloc {
16+
! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_3xi32 : !fir.ref<!fir.box<!fir.array<3xi32>>> {{.*}} alloc {
1717
! CHECK: %[[VAL_8:.*]] = fir.alloca !fir.box<!fir.array<3xi32>>
1818
! CHECK: omp.yield(%[[VAL_8]] : !fir.ref<!fir.box<!fir.array<3xi32>>>)
1919
! CHECK-LABEL: } init {

flang/test/Lower/OpenMP/parallel-reduction-pointer-array.f90

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ program reduce
1919

2020
end program
2121

22-
! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_ptr_Uxi32 : !fir.ref<!fir.box<!fir.ptr<!fir.array<?xi32>>>> alloc {
22+
! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_ptr_Uxi32 : !fir.ref<!fir.box<!fir.ptr<!fir.array<?xi32>>>> attributes {byref_element_type = !fir.array<?xi32>} alloc {
2323
! CHECK: %[[VAL_3:.*]] = fir.alloca !fir.box<!fir.ptr<!fir.array<?xi32>>>
2424
! CHECK: omp.yield(%[[VAL_3]] : !fir.ref<!fir.box<!fir.ptr<!fir.array<?xi32>>>>)
2525
! CHECK-LABEL: } init {

0 commit comments

Comments
 (0)