-
Notifications
You must be signed in to change notification settings - Fork 14.9k
[LoadStoreVectorizer] Fill gaps in load/store chains to enable vectorization #159388
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 2 commits
0eb9669
68a88d1
001b409
da7391b
8380174
f07f630
73441cc
34a5cdf
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -45,29 +45,31 @@ define half @fh(ptr %p) { | |
; ENABLED-LABEL: fh( | ||
; ENABLED: { | ||
; ENABLED-NEXT: .reg .b16 %rs<10>; | ||
; ENABLED-NEXT: .reg .b32 %r<13>; | ||
; ENABLED-NEXT: .reg .b32 %r<17>; | ||
; ENABLED-NEXT: .reg .b64 %rd<2>; | ||
; ENABLED-EMPTY: | ||
; ENABLED-NEXT: // %bb.0: | ||
; ENABLED-NEXT: ld.param.b64 %rd1, [fh_param_0]; | ||
; ENABLED-NEXT: ld.v4.b16 {%rs1, %rs2, %rs3, %rs4}, [%rd1]; | ||
; ENABLED-NEXT: ld.b16 %rs5, [%rd1+8]; | ||
; ENABLED-NEXT: cvt.f32.f16 %r1, %rs2; | ||
; ENABLED-NEXT: cvt.f32.f16 %r2, %rs1; | ||
; ENABLED-NEXT: add.rn.f32 %r3, %r2, %r1; | ||
; ENABLED-NEXT: cvt.rn.f16.f32 %rs6, %r3; | ||
; ENABLED-NEXT: cvt.f32.f16 %r4, %rs4; | ||
; ENABLED-NEXT: cvt.f32.f16 %r5, %rs3; | ||
; ENABLED-NEXT: add.rn.f32 %r6, %r5, %r4; | ||
; ENABLED-NEXT: cvt.rn.f16.f32 %rs7, %r6; | ||
; ENABLED-NEXT: cvt.f32.f16 %r7, %rs7; | ||
; ENABLED-NEXT: cvt.f32.f16 %r8, %rs6; | ||
; ENABLED-NEXT: add.rn.f32 %r9, %r8, %r7; | ||
; ENABLED-NEXT: cvt.rn.f16.f32 %rs8, %r9; | ||
; ENABLED-NEXT: cvt.f32.f16 %r10, %rs8; | ||
; ENABLED-NEXT: cvt.f32.f16 %r11, %rs5; | ||
; ENABLED-NEXT: add.rn.f32 %r12, %r10, %r11; | ||
; ENABLED-NEXT: cvt.rn.f16.f32 %rs9, %r12; | ||
; ENABLED-NEXT: ld.v4.b32 {%r1, %r2, %r3, %r4}, [%rd1]; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This does not look right. Our input is presumably an array of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note the difference in number of ld instructions in the PTX. The old output has two load instructions to load 5 b16s: a ld.v4.b16 and a ld.b16. The new version, in the LSV, "extends" the chain of 5 loads to the next power of two, a chain of 8 loads with 3 unused tail elements, vectorizing it a single This reduction from two load instructions to one load instruction is an optimization. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've missed the 5th load of f16. Generated code looks correct. My next question is whether this extension is always beneficial. E.g. if we do that on shared memory, it may potentially increase bank contention due to the extra loads. In the worst case we'd waste ~25% of shared memory bandwidth for this particular extension from v5f16 to v4b32. I think we should take AS info into account and have some sort of user-controllable knob to enable/disable the gap filling, if needed. E.g. it's probably always good for loads from global AS, it's a maybe for shared memory (less instructions may win over bank conflicts if the extra loads happen to be broadcast to other thread's loads, but would waste bandwidth otherwise), and we can't say much about generic AS, as it could go either way, I think. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I don't think that's a concern for CUDA GPU. But it's a good idea to add AS as a parameter to the TTI API, other targets may want to control this feature for specific AS. |
||
; ENABLED-NEXT: { .reg .b16 tmp; mov.b32 {%rs1, tmp}, %r3; } | ||
; ENABLED-NEXT: mov.b32 {%rs2, %rs3}, %r2; | ||
; ENABLED-NEXT: mov.b32 {%rs4, %rs5}, %r1; | ||
; ENABLED-NEXT: cvt.f32.f16 %r5, %rs5; | ||
; ENABLED-NEXT: cvt.f32.f16 %r6, %rs4; | ||
; ENABLED-NEXT: add.rn.f32 %r7, %r6, %r5; | ||
; ENABLED-NEXT: cvt.rn.f16.f32 %rs6, %r7; | ||
; ENABLED-NEXT: cvt.f32.f16 %r8, %rs3; | ||
; ENABLED-NEXT: cvt.f32.f16 %r9, %rs2; | ||
; ENABLED-NEXT: add.rn.f32 %r10, %r9, %r8; | ||
; ENABLED-NEXT: cvt.rn.f16.f32 %rs7, %r10; | ||
; ENABLED-NEXT: cvt.f32.f16 %r11, %rs7; | ||
; ENABLED-NEXT: cvt.f32.f16 %r12, %rs6; | ||
; ENABLED-NEXT: add.rn.f32 %r13, %r12, %r11; | ||
; ENABLED-NEXT: cvt.rn.f16.f32 %rs8, %r13; | ||
; ENABLED-NEXT: cvt.f32.f16 %r14, %rs8; | ||
; ENABLED-NEXT: cvt.f32.f16 %r15, %rs1; | ||
; ENABLED-NEXT: add.rn.f32 %r16, %r14, %r15; | ||
; ENABLED-NEXT: cvt.rn.f16.f32 %rs9, %r16; | ||
; ENABLED-NEXT: st.param.b16 [func_retval0], %rs9; | ||
; ENABLED-NEXT: ret; | ||
; | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,81 @@ | ||
; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5 | ||
; RUN: opt -mtriple=nvptx64-nvidia-cuda -passes=load-store-vectorizer -S -o - %s | FileCheck %s | ||
|
||
;; Check that the vectorizer extends a Chain to the next power of two, | ||
;; essentially loading more vector elements than the original | ||
;; code. Alignment and other requirement for vectorization should | ||
;; still be met. | ||
|
||
define void @load3to4(ptr %p) #0 { | ||
; CHECK-LABEL: define void @load3to4( | ||
; CHECK-SAME: ptr [[P:%.*]]) { | ||
; CHECK-NEXT: [[P_0:%.*]] = getelementptr i32, ptr [[P]], i32 0 | ||
; CHECK-NEXT: [[TMP1:%.*]] = load <4 x i32>, ptr [[P_0]], align 16 | ||
; CHECK-NEXT: [[V01:%.*]] = extractelement <4 x i32> [[TMP1]], i32 0 | ||
; CHECK-NEXT: [[V12:%.*]] = extractelement <4 x i32> [[TMP1]], i32 1 | ||
; CHECK-NEXT: [[V23:%.*]] = extractelement <4 x i32> [[TMP1]], i32 2 | ||
; CHECK-NEXT: [[EXTEND4:%.*]] = extractelement <4 x i32> [[TMP1]], i32 3 | ||
; CHECK-NEXT: ret void | ||
; | ||
%p.0 = getelementptr i32, ptr %p, i32 0 | ||
%p.1 = getelementptr i32, ptr %p, i32 1 | ||
%p.2 = getelementptr i32, ptr %p, i32 2 | ||
|
||
%v0 = load i32, ptr %p.0, align 16 | ||
%v1 = load i32, ptr %p.1, align 4 | ||
%v2 = load i32, ptr %p.2, align 8 | ||
|
||
ret void | ||
} | ||
|
||
define void @load5to8(ptr %p) #0 { | ||
; CHECK-LABEL: define void @load5to8( | ||
; CHECK-SAME: ptr [[P:%.*]]) { | ||
; CHECK-NEXT: [[P_0:%.*]] = getelementptr i16, ptr [[P]], i32 0 | ||
; CHECK-NEXT: [[TMP1:%.*]] = load <8 x i16>, ptr [[P_0]], align 16 | ||
; CHECK-NEXT: [[V05:%.*]] = extractelement <8 x i16> [[TMP1]], i32 0 | ||
; CHECK-NEXT: [[V16:%.*]] = extractelement <8 x i16> [[TMP1]], i32 1 | ||
; CHECK-NEXT: [[V27:%.*]] = extractelement <8 x i16> [[TMP1]], i32 2 | ||
; CHECK-NEXT: [[V38:%.*]] = extractelement <8 x i16> [[TMP1]], i32 3 | ||
; CHECK-NEXT: [[V49:%.*]] = extractelement <8 x i16> [[TMP1]], i32 4 | ||
; CHECK-NEXT: [[EXTEND10:%.*]] = extractelement <8 x i16> [[TMP1]], i32 5 | ||
; CHECK-NEXT: [[EXTEND211:%.*]] = extractelement <8 x i16> [[TMP1]], i32 6 | ||
; CHECK-NEXT: [[EXTEND412:%.*]] = extractelement <8 x i16> [[TMP1]], i32 7 | ||
; CHECK-NEXT: ret void | ||
; | ||
%p.0 = getelementptr i16, ptr %p, i32 0 | ||
%p.1 = getelementptr i16, ptr %p, i32 1 | ||
%p.2 = getelementptr i16, ptr %p, i32 2 | ||
%p.3 = getelementptr i16, ptr %p, i32 3 | ||
%p.4 = getelementptr i16, ptr %p, i32 4 | ||
|
||
%v0 = load i16, ptr %p.0, align 16 | ||
%v1 = load i16, ptr %p.1, align 2 | ||
%v2 = load i16, ptr %p.2, align 4 | ||
%v3 = load i16, ptr %p.3, align 8 | ||
%v4 = load i16, ptr %p.4, align 2 | ||
|
||
ret void | ||
} | ||
|
||
define void @load3to4_unaligned(ptr %p) #0 { | ||
; CHECK-LABEL: define void @load3to4_unaligned( | ||
; CHECK-SAME: ptr [[P:%.*]]) { | ||
; CHECK-NEXT: [[P_0:%.*]] = getelementptr i32, ptr [[P]], i32 0 | ||
; CHECK-NEXT: [[P_2:%.*]] = getelementptr i32, ptr [[P]], i32 2 | ||
; CHECK-NEXT: [[TMP1:%.*]] = load <2 x i32>, ptr [[P_0]], align 8 | ||
; CHECK-NEXT: [[V01:%.*]] = extractelement <2 x i32> [[TMP1]], i32 0 | ||
; CHECK-NEXT: [[V12:%.*]] = extractelement <2 x i32> [[TMP1]], i32 1 | ||
; CHECK-NEXT: [[V2:%.*]] = load i32, ptr [[P_2]], align 8 | ||
; CHECK-NEXT: ret void | ||
; | ||
%p.0 = getelementptr i32, ptr %p, i32 0 | ||
%p.1 = getelementptr i32, ptr %p, i32 1 | ||
%p.2 = getelementptr i32, ptr %p, i32 2 | ||
|
||
%v0 = load i32, ptr %p.0, align 8 | ||
%v1 = load i32, ptr %p.1, align 4 | ||
%v2 = load i32, ptr %p.2, align 8 | ||
|
||
ret void | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5 | ||
; RUN: opt -mtriple=nvptx64-nvidia-cuda -passes=load-store-vectorizer -S < %s | FileCheck %s | ||
|
||
; Test that gap filled instructions get deleted if they are not used | ||
%struct.S10 = type { i32, i32, i32, i32 } | ||
|
||
; First, confirm that gap instructions get generated and would be vectorized if the alignment is correct | ||
define void @fillTwoGapsCanVectorize(ptr %in) { | ||
; CHECK-LABEL: define void @fillTwoGapsCanVectorize( | ||
; CHECK-SAME: ptr [[IN:%.*]]) { | ||
; CHECK-NEXT: [[TMP1:%.*]] = load <4 x i32>, ptr [[IN]], align 16 | ||
; CHECK-NEXT: [[LOAD03:%.*]] = extractelement <4 x i32> [[TMP1]], i32 0 | ||
; CHECK-NEXT: [[GAPFILL4:%.*]] = extractelement <4 x i32> [[TMP1]], i32 1 | ||
; CHECK-NEXT: [[GAPFILL25:%.*]] = extractelement <4 x i32> [[TMP1]], i32 2 | ||
; CHECK-NEXT: [[LOAD36:%.*]] = extractelement <4 x i32> [[TMP1]], i32 3 | ||
; CHECK-NEXT: ret void | ||
; | ||
%load0 = load i32, ptr %in, align 16 | ||
%getElem = getelementptr i8, ptr %in, i64 12 | ||
%load3 = load i32, ptr %getElem, align 4 | ||
ret void | ||
} | ||
|
||
; Then, confirm that gap instructions get deleted if the alignment prevents the vectorization | ||
define void @fillTwoGapsCantVectorize(ptr %in) { | ||
; CHECK-LABEL: define void @fillTwoGapsCantVectorize( | ||
; CHECK-SAME: ptr [[IN:%.*]]) { | ||
; CHECK-NEXT: [[LOAD0:%.*]] = load i32, ptr [[IN]], align 4 | ||
; CHECK-NEXT: [[GETELEM:%.*]] = getelementptr i8, ptr [[IN]], i64 12 | ||
; CHECK-NEXT: [[LOAD3:%.*]] = load i32, ptr [[GETELEM]], align 4 | ||
; CHECK-NEXT: ret void | ||
; | ||
%load0 = load i32, ptr %in, align 4 | ||
%getElem = getelementptr i8, ptr %in, i64 12 | ||
%load3 = load i32, ptr %getElem, align 4 | ||
ret void | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,83 @@ | ||
; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5 | ||
; RUN: opt -mtriple=nvptx64-nvidia-cuda -passes=load-store-vectorizer -S < %s | FileCheck %s | ||
|
||
; Test that gap filled instructions don't lose invariant metadata | ||
%struct.S10 = type { i32, i32, i32, i32 } | ||
|
||
; With no gaps, if every load is invariant, the vectorized load will be too. | ||
define i32 @noGaps(ptr %in) { | ||
; CHECK-LABEL: define i32 @noGaps( | ||
; CHECK-SAME: ptr [[IN:%.*]]) { | ||
; CHECK-NEXT: [[TMP1:%.*]] = load <4 x i32>, ptr [[IN]], align 16, !invariant.load [[META0:![0-9]+]] | ||
; CHECK-NEXT: [[TMP01:%.*]] = extractelement <4 x i32> [[TMP1]], i32 0 | ||
; CHECK-NEXT: [[TMP12:%.*]] = extractelement <4 x i32> [[TMP1]], i32 1 | ||
; CHECK-NEXT: [[TMP23:%.*]] = extractelement <4 x i32> [[TMP1]], i32 2 | ||
; CHECK-NEXT: [[TMP34:%.*]] = extractelement <4 x i32> [[TMP1]], i32 3 | ||
; CHECK-NEXT: [[SUM01:%.*]] = add i32 [[TMP01]], [[TMP12]] | ||
; CHECK-NEXT: [[SUM012:%.*]] = add i32 [[SUM01]], [[TMP23]] | ||
; CHECK-NEXT: [[SUM0123:%.*]] = add i32 [[SUM012]], [[TMP34]] | ||
; CHECK-NEXT: ret i32 [[SUM0123]] | ||
; | ||
%load0 = load i32, ptr %in, align 16, !invariant.load !0 | ||
%getElem1 = getelementptr inbounds %struct.S10, ptr %in, i64 0, i32 1 | ||
%load1 = load i32, ptr %getElem1, align 4, !invariant.load !0 | ||
%getElem2 = getelementptr inbounds %struct.S10, ptr %in, i64 0, i32 2 | ||
%load2 = load i32, ptr %getElem2, align 4, !invariant.load !0 | ||
%getElem3 = getelementptr inbounds %struct.S10, ptr %in, i64 0, i32 3 | ||
%load3 = load i32, ptr %getElem3, align 4, !invariant.load !0 | ||
%sum01 = add i32 %load0, %load1 | ||
%sum012 = add i32 %sum01, %load2 | ||
%sum0123 = add i32 %sum012, %load3 | ||
ret i32 %sum0123 | ||
} | ||
|
||
; If one of the loads is not invariant, the vectorized load will not be invariant. | ||
define i32 @noGapsMissingInvariant(ptr %in) { | ||
; CHECK-LABEL: define i32 @noGapsMissingInvariant( | ||
; CHECK-SAME: ptr [[IN:%.*]]) { | ||
; CHECK-NEXT: [[TMP1:%.*]] = load <4 x i32>, ptr [[IN]], align 16 | ||
; CHECK-NEXT: [[TMP01:%.*]] = extractelement <4 x i32> [[TMP1]], i32 0 | ||
; CHECK-NEXT: [[TMP12:%.*]] = extractelement <4 x i32> [[TMP1]], i32 1 | ||
; CHECK-NEXT: [[TMP23:%.*]] = extractelement <4 x i32> [[TMP1]], i32 2 | ||
; CHECK-NEXT: [[TMP34:%.*]] = extractelement <4 x i32> [[TMP1]], i32 3 | ||
; CHECK-NEXT: [[SUM01:%.*]] = add i32 [[TMP01]], [[TMP12]] | ||
; CHECK-NEXT: [[SUM012:%.*]] = add i32 [[SUM01]], [[TMP23]] | ||
; CHECK-NEXT: [[SUM0123:%.*]] = add i32 [[SUM012]], [[TMP34]] | ||
; CHECK-NEXT: ret i32 [[SUM0123]] | ||
; | ||
%load0 = load i32, ptr %in, align 16, !invariant.load !0 | ||
%getElem1 = getelementptr inbounds %struct.S10, ptr %in, i64 0, i32 1 | ||
%load1 = load i32, ptr %getElem1, align 4, !invariant.load !0 | ||
%getElem2 = getelementptr inbounds %struct.S10, ptr %in, i64 0, i32 2 | ||
%load2 = load i32, ptr %getElem2, align 4, !invariant.load !0 | ||
%getElem3 = getelementptr inbounds %struct.S10, ptr %in, i64 0, i32 3 | ||
%load3 = load i32, ptr %getElem3, align 4 | ||
%sum01 = add i32 %load0, %load1 | ||
%sum012 = add i32 %sum01, %load2 | ||
%sum0123 = add i32 %sum012, %load3 | ||
ret i32 %sum0123 | ||
} | ||
|
||
; With two gaps, if every real load is invariant, the vectorized load will be too. | ||
define i32 @twoGaps(ptr %in) { | ||
; CHECK-LABEL: define i32 @twoGaps( | ||
; CHECK-SAME: ptr [[IN:%.*]]) { | ||
; CHECK-NEXT: [[TMP1:%.*]] = load <4 x i32>, ptr [[IN]], align 16, !invariant.load [[META0]] | ||
; CHECK-NEXT: [[LOAD03:%.*]] = extractelement <4 x i32> [[TMP1]], i32 0 | ||
; CHECK-NEXT: [[GAPFILL4:%.*]] = extractelement <4 x i32> [[TMP1]], i32 1 | ||
; CHECK-NEXT: [[GAPFILL25:%.*]] = extractelement <4 x i32> [[TMP1]], i32 2 | ||
; CHECK-NEXT: [[LOAD36:%.*]] = extractelement <4 x i32> [[TMP1]], i32 3 | ||
; CHECK-NEXT: [[SUM:%.*]] = add i32 [[LOAD03]], [[LOAD36]] | ||
; CHECK-NEXT: ret i32 [[SUM]] | ||
; | ||
%load0 = load i32, ptr %in, align 16, !invariant.load !0 | ||
%getElem3 = getelementptr inbounds %struct.S10, ptr %in, i64 0, i32 3 | ||
%load3 = load i32, ptr %getElem3, align 4, !invariant.load !0 | ||
%sum = add i32 %load0, %load3 | ||
ret i32 %sum | ||
} | ||
|
||
!0 = !{} | ||
;. | ||
; CHECK: [[META0]] = !{} | ||
;. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs context argument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming you're referring to an instance of LLVMContext, right? If so, done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. I mean a hook without parameters corresponding to a specific load is close to useless. At minimum would need address space, alignment, type etc.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I was unsure what we may want to check because for NVPTX the answer is "true" under any conditions, and I'm not familiar enough with other architectures to know what they may want to check.
Address space
and(Element?) Type
sound reasonable enough, butAlignment
would make this less ergonomic to use. The use case of this API is to check whether widening loads is generally allowed for a given target, and the result is used to answer the question "can we attempt to fill gaps, which would eventually result in widened loads?" As documented in the comment, there is an assumption that the answer spit out by the API depends on the resulting widened load being sufficiently aligned. In this case, at the first point in the code we call this API in (splitChainByContiguity
), we do not know the alignment of the resulting load, as alignment analysis and chain splitting (splitChainByAlignment
) happens later.What we would have to do to incorporate an alignment argument into the API is something similar to how I'm using the existing
isLegalMaskedStore
, which is this not-so-clean helper function: https://github.com/llvm/llvm-project/pull/159388/files#diff-e0eab10050c9ef433cb0d7bc38e32e8d7fe44cdb0cf7422ae7a98966bff53672R1865-R1892. The helper is essentially converting the answers to "is this specific masked store legal" to the more general "are masked stores generally legal on this target", which is what we need to know at that point in the code. If you think that is best, I'm ok with it, but it feels a little hacky to me, so I was trying to come up with something better for this new API.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding AS as a parameter sounds reasonable to me, but I am not sure about the other "context". Currently, this feature is only enabled with the NVPTX target. If other targets want to enable this feature, they should modify the TTI API according to their own needs, rather than trying to guess someone else’s requirements here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LoadInst + type it wants to widen to