Skip to content

Commit e0f13d7

Browse files
committed
[AMDGPU] Document "relaxed buffer OOB mode", update HSA default
This commit adds documentation for the relaxed-buffer-oob-mode subtarget feature so that users are aware of the performance implications of the change. It also enables relaxed buffer OOB mode for HSA programs, which don't have this correctness requirement.
1 parent 9fdac84 commit e0f13d7

File tree

4 files changed

+44
-3
lines changed

4 files changed

+44
-3
lines changed

llvm/docs/AMDGPUUsage.rst

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1136,6 +1136,41 @@ is conservatively correct for OpenCL.
11361136
other operations within the same address space.
11371137
======================= ===================================================
11381138

1139+
Relaxed Buffer OOB (Out Of Bounds) Mode
1140+
---------------------------------------
1141+
1142+
Instructions that load from or store to buffer resources (and thus, by extension
1143+
buffer fat pointers and buffer strided pointers) generally implement handling for
1144+
out of bounds (OOB) memory accesses, including those that are partially OOB,
1145+
if the buffer resource resource has the required flags set.
1146+
1147+
When operating on more than 32 bits of data, the `voffset` used for the access
1148+
will be range-checked for each 32-bit word independently. This check uses saturating
1149+
arithmetic and interprets the offset as an unsigned value.
1150+
1151+
The behavior described above conflicts with the ABI requirements of certain graphics
1152+
APIs that require out of bounds accesses to be handled strictly so that accessed
1153+
that begin out of bounds but then access in-bounds elements (such as loading A
1154+
``<4 x i32>`` beginning at offset ``-4``) still load the three in-bounds integers.
1155+
1156+
Similarly, buffer fat pointers permit operating types such as `<8 x i8>` which
1157+
must be accessed (and bounds-checked) 4 bytes at a time. Non-word-aligned
1158+
accesses to such types from near the end of a buffer resource (such as starting
1159+
a load of an ``<8xi8>`` from an offset of ``6`` on an 8-byte buffer) will treat
1160+
the initial two bytes to be loaded/stored as out of bounds, even though, under
1161+
a strict interpretation of the bounds-checking semantics, they would be out of bounds.
1162+
1163+
These violations of strict bounds-checking semantics for buffer resources require
1164+
usage of less-vectorized code to ensure correctness. Ifthis strict conformance
1165+
is not required, the target feature ``relaxed-oob-buffer-mode`` should be enabled
1166+
(using ``-mcpu``, ``-offload-arch`` or ``-mattr``).
1167+
1168+
``relaxed-buffer-oob-mode`` permits unaligned memory acceses through a buffer resource
1169+
to propagate to nearby elemennts, causing them to become out of bounds as well.
1170+
1171+
``relaxed-buffer-oob-mode`` is **enabled** on HSA targets by default to preserve
1172+
compute performance and existing ABI expectations.
1173+
11391174
LLVM IR Intrinsics
11401175
------------------
11411176

llvm/docs/ReleaseNotes.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,11 @@ Changes to the AMDGPU Backend
9292

9393
* Bump the default `.amdhsa_code_object_version` to 6. ROCm 6.3 is required to run any program compiled with COV6.
9494

95+
* Turn on strict buffer OOB checking on non-AMDHSA OSs. This improves the correctness
96+
of buffer accesses in some cases at the cost of performance for programs that do not
97+
contain unaligned out-of-bounds accesses. The old behavior may be restored with the
98+
`relaxed-buffer-oob-mode` feature.
99+
95100
Changes to the ARM Backend
96101
--------------------------
97102

llvm/lib/Target/AMDGPU/GCNSubtarget.cpp

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,8 @@ GCNSubtarget &GCNSubtarget::initializeSubtargetDependencies(const Triple &TT,
7171
// Turn on features that HSA ABI requires. Also turn on FlatForGlobal by
7272
// default
7373
if (isAmdHsaOS())
74-
FullFS += "+flat-for-global,+unaligned-access-mode,+trap-handler,";
74+
FullFS += "+flat-for-global,+unaligned-access-mode,+trap-handler,"
75+
"+relaxed-buffer-oob-mode,";
7576

7677
FullFS += "+enable-prt-strict-null,"; // This is overridden by a disable in FS
7778

llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/merge-vectors.ll

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
; RUN: opt -mtriple=amdgcn-amd-amdhsa -passes=load-store-vectorizer -mattr=+relaxed-buffer-oob-mode -S -o - %s | FileCheck --check-prefixes=CHECK,CHECK-OOB-RELAXED %s
2-
; RUN: opt -mtriple=amdgcn-amd-amdhsa -passes=load-store-vectorizer -S -o - %s | FileCheck --check-prefixes=CHECK,CHECK-OOB-STRICT %s
1+
; RUN: opt -mtriple=amdgcn-amd-amdhsa -passes=load-store-vectorizer -S -o - %s | FileCheck --check-prefixes=CHECK,CHECK-OOB-RELAXED %s
2+
; RUN: opt -mtriple=amdgcn-amd-amdhsa -passes=load-store-vectorizer -mattr=-relaxed-buffer-oob-mode -S -o - %s | FileCheck --check-prefixes=CHECK,CHECK-OOB-STRICT %s
33

44
target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-p7:160:256:256:32-p8:128:128-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-ni:7"
55

0 commit comments

Comments
 (0)