-
Notifications
You must be signed in to change notification settings - Fork 14.9k
[LoadStoreVectorizer] Fill gaps in load/store chains to enable vectorization #159388
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
dakersnar
wants to merge
8
commits into
llvm:main
Choose a base branch
from
dakersnar:github/dkersnar/lsv-gap-fill
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 4 commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
0eb9669
[LoadStoreVectorizer] Fill gaps in loads/stores to enable vectorization
dakersnar 68a88d1
Clang format
dakersnar 001b409
Remove cl opts
dakersnar da7391b
Add context argument to TTI API
dakersnar 8380174
Update llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
dakersnar f07f630
Update tests to test for masked load generation in the LSV
dakersnar 73441cc
Remove isLegalToWidenLoads API
dakersnar 34a5cdf
Change LSV to create masked loads
dakersnar File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
81 changes: 81 additions & 0 deletions
81
llvm/test/Transforms/LoadStoreVectorizer/NVPTX/extend-chain.ll
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,81 @@ | ||
; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5 | ||
; RUN: opt -mtriple=nvptx64-nvidia-cuda -passes=load-store-vectorizer -S -o - %s | FileCheck %s | ||
|
||
;; Check that the vectorizer extends a Chain to the next power of two, | ||
;; essentially loading more vector elements than the original | ||
;; code. Alignment and other requirement for vectorization should | ||
;; still be met. | ||
|
||
define void @load3to4(ptr %p) #0 { | ||
; CHECK-LABEL: define void @load3to4( | ||
; CHECK-SAME: ptr [[P:%.*]]) { | ||
; CHECK-NEXT: [[P_0:%.*]] = getelementptr i32, ptr [[P]], i32 0 | ||
; CHECK-NEXT: [[TMP1:%.*]] = load <4 x i32>, ptr [[P_0]], align 16 | ||
; CHECK-NEXT: [[V01:%.*]] = extractelement <4 x i32> [[TMP1]], i32 0 | ||
; CHECK-NEXT: [[V12:%.*]] = extractelement <4 x i32> [[TMP1]], i32 1 | ||
; CHECK-NEXT: [[V23:%.*]] = extractelement <4 x i32> [[TMP1]], i32 2 | ||
; CHECK-NEXT: [[EXTEND4:%.*]] = extractelement <4 x i32> [[TMP1]], i32 3 | ||
; CHECK-NEXT: ret void | ||
; | ||
%p.0 = getelementptr i32, ptr %p, i32 0 | ||
%p.1 = getelementptr i32, ptr %p, i32 1 | ||
%p.2 = getelementptr i32, ptr %p, i32 2 | ||
|
||
%v0 = load i32, ptr %p.0, align 16 | ||
%v1 = load i32, ptr %p.1, align 4 | ||
%v2 = load i32, ptr %p.2, align 8 | ||
|
||
ret void | ||
} | ||
|
||
define void @load5to8(ptr %p) #0 { | ||
; CHECK-LABEL: define void @load5to8( | ||
; CHECK-SAME: ptr [[P:%.*]]) { | ||
; CHECK-NEXT: [[P_0:%.*]] = getelementptr i16, ptr [[P]], i32 0 | ||
; CHECK-NEXT: [[TMP1:%.*]] = load <8 x i16>, ptr [[P_0]], align 16 | ||
; CHECK-NEXT: [[V05:%.*]] = extractelement <8 x i16> [[TMP1]], i32 0 | ||
; CHECK-NEXT: [[V16:%.*]] = extractelement <8 x i16> [[TMP1]], i32 1 | ||
; CHECK-NEXT: [[V27:%.*]] = extractelement <8 x i16> [[TMP1]], i32 2 | ||
; CHECK-NEXT: [[V38:%.*]] = extractelement <8 x i16> [[TMP1]], i32 3 | ||
; CHECK-NEXT: [[V49:%.*]] = extractelement <8 x i16> [[TMP1]], i32 4 | ||
; CHECK-NEXT: [[EXTEND10:%.*]] = extractelement <8 x i16> [[TMP1]], i32 5 | ||
; CHECK-NEXT: [[EXTEND211:%.*]] = extractelement <8 x i16> [[TMP1]], i32 6 | ||
; CHECK-NEXT: [[EXTEND412:%.*]] = extractelement <8 x i16> [[TMP1]], i32 7 | ||
; CHECK-NEXT: ret void | ||
; | ||
%p.0 = getelementptr i16, ptr %p, i32 0 | ||
%p.1 = getelementptr i16, ptr %p, i32 1 | ||
%p.2 = getelementptr i16, ptr %p, i32 2 | ||
%p.3 = getelementptr i16, ptr %p, i32 3 | ||
%p.4 = getelementptr i16, ptr %p, i32 4 | ||
|
||
%v0 = load i16, ptr %p.0, align 16 | ||
%v1 = load i16, ptr %p.1, align 2 | ||
%v2 = load i16, ptr %p.2, align 4 | ||
%v3 = load i16, ptr %p.3, align 8 | ||
%v4 = load i16, ptr %p.4, align 2 | ||
|
||
ret void | ||
} | ||
|
||
define void @load3to4_unaligned(ptr %p) #0 { | ||
; CHECK-LABEL: define void @load3to4_unaligned( | ||
; CHECK-SAME: ptr [[P:%.*]]) { | ||
; CHECK-NEXT: [[P_0:%.*]] = getelementptr i32, ptr [[P]], i32 0 | ||
; CHECK-NEXT: [[P_2:%.*]] = getelementptr i32, ptr [[P]], i32 2 | ||
; CHECK-NEXT: [[TMP1:%.*]] = load <2 x i32>, ptr [[P_0]], align 8 | ||
; CHECK-NEXT: [[V01:%.*]] = extractelement <2 x i32> [[TMP1]], i32 0 | ||
; CHECK-NEXT: [[V12:%.*]] = extractelement <2 x i32> [[TMP1]], i32 1 | ||
; CHECK-NEXT: [[V2:%.*]] = load i32, ptr [[P_2]], align 8 | ||
; CHECK-NEXT: ret void | ||
; | ||
%p.0 = getelementptr i32, ptr %p, i32 0 | ||
%p.1 = getelementptr i32, ptr %p, i32 1 | ||
%p.2 = getelementptr i32, ptr %p, i32 2 | ||
|
||
%v0 = load i32, ptr %p.0, align 8 | ||
%v1 = load i32, ptr %p.1, align 4 | ||
%v2 = load i32, ptr %p.2, align 8 | ||
|
||
ret void | ||
} |
37 changes: 37 additions & 0 deletions
37
llvm/test/Transforms/LoadStoreVectorizer/NVPTX/gap-fill-cleanup.ll
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5 | ||
; RUN: opt -mtriple=nvptx64-nvidia-cuda -passes=load-store-vectorizer -S < %s | FileCheck %s | ||
|
||
; Test that gap filled instructions get deleted if they are not used | ||
%struct.S10 = type { i32, i32, i32, i32 } | ||
|
||
; First, confirm that gap instructions get generated and would be vectorized if the alignment is correct | ||
define void @fillTwoGapsCanVectorize(ptr %in) { | ||
; CHECK-LABEL: define void @fillTwoGapsCanVectorize( | ||
; CHECK-SAME: ptr [[IN:%.*]]) { | ||
; CHECK-NEXT: [[TMP1:%.*]] = load <4 x i32>, ptr [[IN]], align 16 | ||
; CHECK-NEXT: [[LOAD03:%.*]] = extractelement <4 x i32> [[TMP1]], i32 0 | ||
; CHECK-NEXT: [[GAPFILL4:%.*]] = extractelement <4 x i32> [[TMP1]], i32 1 | ||
; CHECK-NEXT: [[GAPFILL25:%.*]] = extractelement <4 x i32> [[TMP1]], i32 2 | ||
; CHECK-NEXT: [[LOAD36:%.*]] = extractelement <4 x i32> [[TMP1]], i32 3 | ||
; CHECK-NEXT: ret void | ||
; | ||
%load0 = load i32, ptr %in, align 16 | ||
%getElem = getelementptr i8, ptr %in, i64 12 | ||
%load3 = load i32, ptr %getElem, align 4 | ||
ret void | ||
} | ||
|
||
; Then, confirm that gap instructions get deleted if the alignment prevents the vectorization | ||
define void @fillTwoGapsCantVectorize(ptr %in) { | ||
; CHECK-LABEL: define void @fillTwoGapsCantVectorize( | ||
; CHECK-SAME: ptr [[IN:%.*]]) { | ||
; CHECK-NEXT: [[LOAD0:%.*]] = load i32, ptr [[IN]], align 4 | ||
; CHECK-NEXT: [[GETELEM:%.*]] = getelementptr i8, ptr [[IN]], i64 12 | ||
; CHECK-NEXT: [[LOAD3:%.*]] = load i32, ptr [[GETELEM]], align 4 | ||
; CHECK-NEXT: ret void | ||
; | ||
%load0 = load i32, ptr %in, align 4 | ||
%getElem = getelementptr i8, ptr %in, i64 12 | ||
%load3 = load i32, ptr %getElem, align 4 | ||
ret void | ||
} |
83 changes: 83 additions & 0 deletions
83
llvm/test/Transforms/LoadStoreVectorizer/NVPTX/gap-fill-invariant.ll
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,83 @@ | ||
; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5 | ||
; RUN: opt -mtriple=nvptx64-nvidia-cuda -passes=load-store-vectorizer -S < %s | FileCheck %s | ||
|
||
; Test that gap filled instructions don't lose invariant metadata | ||
%struct.S10 = type { i32, i32, i32, i32 } | ||
|
||
; With no gaps, if every load is invariant, the vectorized load will be too. | ||
define i32 @noGaps(ptr %in) { | ||
; CHECK-LABEL: define i32 @noGaps( | ||
; CHECK-SAME: ptr [[IN:%.*]]) { | ||
; CHECK-NEXT: [[TMP1:%.*]] = load <4 x i32>, ptr [[IN]], align 16, !invariant.load [[META0:![0-9]+]] | ||
; CHECK-NEXT: [[TMP01:%.*]] = extractelement <4 x i32> [[TMP1]], i32 0 | ||
; CHECK-NEXT: [[TMP12:%.*]] = extractelement <4 x i32> [[TMP1]], i32 1 | ||
; CHECK-NEXT: [[TMP23:%.*]] = extractelement <4 x i32> [[TMP1]], i32 2 | ||
; CHECK-NEXT: [[TMP34:%.*]] = extractelement <4 x i32> [[TMP1]], i32 3 | ||
; CHECK-NEXT: [[SUM01:%.*]] = add i32 [[TMP01]], [[TMP12]] | ||
; CHECK-NEXT: [[SUM012:%.*]] = add i32 [[SUM01]], [[TMP23]] | ||
; CHECK-NEXT: [[SUM0123:%.*]] = add i32 [[SUM012]], [[TMP34]] | ||
; CHECK-NEXT: ret i32 [[SUM0123]] | ||
; | ||
%load0 = load i32, ptr %in, align 16, !invariant.load !0 | ||
%getElem1 = getelementptr inbounds %struct.S10, ptr %in, i64 0, i32 1 | ||
%load1 = load i32, ptr %getElem1, align 4, !invariant.load !0 | ||
%getElem2 = getelementptr inbounds %struct.S10, ptr %in, i64 0, i32 2 | ||
%load2 = load i32, ptr %getElem2, align 4, !invariant.load !0 | ||
%getElem3 = getelementptr inbounds %struct.S10, ptr %in, i64 0, i32 3 | ||
%load3 = load i32, ptr %getElem3, align 4, !invariant.load !0 | ||
%sum01 = add i32 %load0, %load1 | ||
%sum012 = add i32 %sum01, %load2 | ||
%sum0123 = add i32 %sum012, %load3 | ||
ret i32 %sum0123 | ||
} | ||
|
||
; If one of the loads is not invariant, the vectorized load will not be invariant. | ||
define i32 @noGapsMissingInvariant(ptr %in) { | ||
; CHECK-LABEL: define i32 @noGapsMissingInvariant( | ||
; CHECK-SAME: ptr [[IN:%.*]]) { | ||
; CHECK-NEXT: [[TMP1:%.*]] = load <4 x i32>, ptr [[IN]], align 16 | ||
; CHECK-NEXT: [[TMP01:%.*]] = extractelement <4 x i32> [[TMP1]], i32 0 | ||
; CHECK-NEXT: [[TMP12:%.*]] = extractelement <4 x i32> [[TMP1]], i32 1 | ||
; CHECK-NEXT: [[TMP23:%.*]] = extractelement <4 x i32> [[TMP1]], i32 2 | ||
; CHECK-NEXT: [[TMP34:%.*]] = extractelement <4 x i32> [[TMP1]], i32 3 | ||
; CHECK-NEXT: [[SUM01:%.*]] = add i32 [[TMP01]], [[TMP12]] | ||
; CHECK-NEXT: [[SUM012:%.*]] = add i32 [[SUM01]], [[TMP23]] | ||
; CHECK-NEXT: [[SUM0123:%.*]] = add i32 [[SUM012]], [[TMP34]] | ||
; CHECK-NEXT: ret i32 [[SUM0123]] | ||
; | ||
%load0 = load i32, ptr %in, align 16, !invariant.load !0 | ||
%getElem1 = getelementptr inbounds %struct.S10, ptr %in, i64 0, i32 1 | ||
%load1 = load i32, ptr %getElem1, align 4, !invariant.load !0 | ||
%getElem2 = getelementptr inbounds %struct.S10, ptr %in, i64 0, i32 2 | ||
%load2 = load i32, ptr %getElem2, align 4, !invariant.load !0 | ||
%getElem3 = getelementptr inbounds %struct.S10, ptr %in, i64 0, i32 3 | ||
%load3 = load i32, ptr %getElem3, align 4 | ||
%sum01 = add i32 %load0, %load1 | ||
%sum012 = add i32 %sum01, %load2 | ||
%sum0123 = add i32 %sum012, %load3 | ||
ret i32 %sum0123 | ||
} | ||
|
||
; With two gaps, if every real load is invariant, the vectorized load will be too. | ||
define i32 @twoGaps(ptr %in) { | ||
; CHECK-LABEL: define i32 @twoGaps( | ||
; CHECK-SAME: ptr [[IN:%.*]]) { | ||
; CHECK-NEXT: [[TMP1:%.*]] = load <4 x i32>, ptr [[IN]], align 16, !invariant.load [[META0]] | ||
; CHECK-NEXT: [[LOAD03:%.*]] = extractelement <4 x i32> [[TMP1]], i32 0 | ||
; CHECK-NEXT: [[GAPFILL4:%.*]] = extractelement <4 x i32> [[TMP1]], i32 1 | ||
; CHECK-NEXT: [[GAPFILL25:%.*]] = extractelement <4 x i32> [[TMP1]], i32 2 | ||
; CHECK-NEXT: [[LOAD36:%.*]] = extractelement <4 x i32> [[TMP1]], i32 3 | ||
; CHECK-NEXT: [[SUM:%.*]] = add i32 [[LOAD03]], [[LOAD36]] | ||
; CHECK-NEXT: ret i32 [[SUM]] | ||
; | ||
%load0 = load i32, ptr %in, align 16, !invariant.load !0 | ||
%getElem3 = getelementptr inbounds %struct.S10, ptr %in, i64 0, i32 3 | ||
%load3 = load i32, ptr %getElem3, align 4, !invariant.load !0 | ||
%sum = add i32 %load0, %load3 | ||
ret i32 %sum | ||
} | ||
|
||
!0 = !{} | ||
;. | ||
; CHECK: [[META0]] = !{} | ||
;. |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not look right. Our input is presumably an array of
f16
elements, but we end up loading 4 x b32, but then appear to ignore the last two elements. It should have beenld.v2.b32
, or, perhaps the load should have remainedld.v4.f16
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note the difference in number of ld instructions in the PTX. The old output has two load instructions to load 5 b16s: a ld.v4.b16 and a ld.b16. The new version, in the LSV, "extends" the chain of 5 loads to the next power of two, a chain of 8 loads with 3 unused tail elements, vectorizing it a single
load <8 x i16>
. This gets lowered by the backend to ald.v4.b32
, with 2.5 elements (containing the packed 5 b16s) ending up being used, the rest unused.This reduction from two load instructions to one load instruction is an optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've missed the 5th load of f16. Generated code looks correct.
My next question is whether this extension is always beneficial. E.g. if we do that on shared memory, it may potentially increase bank contention due to the extra loads. In the worst case we'd waste ~25% of shared memory bandwidth for this particular extension from v5f16 to v4b32.
I think we should take AS info into account and have some sort of user-controllable knob to enable/disable the gap filling, if needed. E.g. it's probably always good for loads from global AS, it's a maybe for shared memory (less instructions may win over bank conflicts if the extra loads happen to be broadcast to other thread's loads, but would waste bandwidth otherwise), and we can't say much about generic AS, as it could go either way, I think.
For masked writes it's more likely to be a win, as we don't actually write extra data, so the potential downside is a possible register pressure bump.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's a concern for CUDA GPU. But it's a good idea to add AS as a parameter to the TTI API, other targets may want to control this feature for specific AS.