@@ -677,7 +677,7 @@ the device used to execute the code match the features enabled when
677
677
generating the code. A mismatch of features may result in incorrect
678
678
execution, or a reduction in performance.
679
679
680
- The target features supported by each processor is listed in
680
+ The target features supported by each processor are listed in
681
681
:ref:`amdgpu-processors`.
682
682
683
683
Target features are controlled by exactly one of the following Clang
@@ -783,7 +783,7 @@ description. The AMDGPU target specific information is:
783
783
Is an AMDGPU processor or alternative processor name specified in
784
784
:ref:`amdgpu-processor-table`. The non-canonical form target ID allows both
785
785
the primary processor and alternative processor names. The canonical form
786
- target ID only allow the primary processor name.
786
+ target ID only allows the primary processor name.
787
787
788
788
**target-feature**
789
789
Is a target feature name specified in :ref:`amdgpu-target-features-table` that
@@ -793,7 +793,7 @@ description. The AMDGPU target specific information is:
793
793
``--offload-arch``. Each target feature must appear at most once in a target
794
794
ID. The non-canonical form target ID allows the target features to be
795
795
specified in any order. The canonical form target ID requires the target
796
- features to be specified in alphabetic order.
796
+ features to be specified in alphabetical order.
797
797
798
798
.. _amdgpu-target-id-v2-v3:
799
799
@@ -886,7 +886,7 @@ supported for the ``amdgcn`` target.
886
886
setup (see :ref:`amdgpu-amdhsa-kernel-prolog-m0`).
887
887
888
888
To convert between a private or group address space address (termed a segment
889
- address) and a flat address the base address of the corresponding aperture
889
+ address) and a flat address, the base address of the corresponding aperture
890
890
can be used. For GFX7-GFX8 these are available in the
891
891
:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
892
892
Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
@@ -1186,7 +1186,7 @@ The AMDGPU backend implements the following LLVM IR intrinsics.
1186
1186
:ref:`llvm.stackrestore.p5 <int_stackrestore>` Implemented, must use the alloca address space.
1187
1187
1188
1188
:ref:`llvm.get.fpmode.i32 <int_get_fpmode>` The natural floating-point mode type is i32. This
1189
- implemented by extracting relevant bits out of the MODE
1189
+ is implemented by extracting relevant bits out of the MODE
1190
1190
register with s_getreg_b32. The first 10 bits are the
1191
1191
core floating-point mode. Bits 12:18 are the exception
1192
1192
mask. On gfx9+, bit 23 is FP16_OVFL. Bitfields not
@@ -1266,14 +1266,14 @@ The AMDGPU backend implements the following LLVM IR intrinsics.
1266
1266
1267
1267
llvm.amdgcn.permlane16 Provides direct access to v_permlane16_b32. Performs arbitrary gather-style
1268
1268
operation within a row (16 contiguous lanes) of the second input operand.
1269
- The third and fourth inputs must be scalar values. these are combined into
1269
+ The third and fourth inputs must be scalar values. These are combined into
1270
1270
a single 64-bit value representing lane selects used to swizzle within each
1271
1271
row. Currently implemented for i16, i32, float, half, bfloat, <2 x i16>,
1272
1272
<2 x half>, <2 x bfloat>, i64, double, pointers, multiples of the 32-bit vectors.
1273
1273
1274
1274
llvm.amdgcn.permlanex16 Provides direct access to v_permlanex16_b32. Performs arbitrary gather-style
1275
1275
operation across two rows of the second input operand (each row is 16 contiguous
1276
- lanes). The third and fourth inputs must be scalar values. these are combined
1276
+ lanes). The third and fourth inputs must be scalar values. These are combined
1277
1277
into a single 64-bit value representing lane selects used to swizzle within each
1278
1278
row. Currently implemented for i16, i32, float, half, bfloat, <2 x i16>, <2 x half>,
1279
1279
<2 x bfloat>, i64, double, pointers, multiples of the 32-bit vectors.
@@ -1285,31 +1285,31 @@ The AMDGPU backend implements the following LLVM IR intrinsics.
1285
1285
32-bit vectors.
1286
1286
1287
1287
llvm.amdgcn.udot2 Provides direct access to v_dot2_u32_u16 across targets which
1288
- support such instructions. This performs unsigned dot product
1288
+ support such instructions. This performs an unsigned dot product
1289
1289
with two v2i16 operands, summed with the third i32 operand. The
1290
1290
i1 fourth operand is used to clamp the output.
1291
1291
1292
1292
llvm.amdgcn.udot4 Provides direct access to v_dot4_u32_u8 across targets which
1293
- support such instructions. This performs unsigned dot product
1293
+ support such instructions. This performs an unsigned dot product
1294
1294
with two i32 operands (holding a vector of 4 8bit values), summed
1295
1295
with the third i32 operand. The i1 fourth operand is used to clamp
1296
1296
the output.
1297
1297
1298
1298
llvm.amdgcn.udot8 Provides direct access to v_dot8_u32_u4 across targets which
1299
- support such instructions. This performs unsigned dot product
1299
+ support such instructions. This performs an unsigned dot product
1300
1300
with two i32 operands (holding a vector of 8 4bit values), summed
1301
1301
with the third i32 operand. The i1 fourth operand is used to clamp
1302
1302
the output.
1303
1303
1304
1304
llvm.amdgcn.sdot2 Provides direct access to v_dot2_i32_i16 across targets which
1305
- support such instructions. This performs signed dot product
1305
+ support such instructions. This performs a signed dot product
1306
1306
with two v2i16 operands, summed with the third i32 operand. The
1307
1307
i1 fourth operand is used to clamp the output.
1308
1308
When applicable (e.g. no clamping), this is lowered into
1309
1309
v_dot2c_i32_i16 for targets which support it.
1310
1310
1311
1311
llvm.amdgcn.sdot4 Provides direct access to v_dot4_i32_i8 across targets which
1312
- support such instructions. This performs signed dot product
1312
+ support such instructions. This performs a signed dot product
1313
1313
with two i32 operands (holding a vector of 4 8bit values), summed
1314
1314
with the third i32 operand. The i1 fourth operand is used to clamp
1315
1315
the output.
@@ -1321,7 +1321,7 @@ The AMDGPU backend implements the following LLVM IR intrinsics.
1321
1321
of this instruction for gfx11 targets.
1322
1322
1323
1323
llvm.amdgcn.sdot8 Provides direct access to v_dot8_u32_u4 across targets which
1324
- support such instructions. This performs signed dot product
1324
+ support such instructions. This performs a signed dot product
1325
1325
with two i32 operands (holding a vector of 8 4bit values), summed
1326
1326
with the third i32 operand. The i1 fourth operand is used to clamp
1327
1327
the output.
@@ -1401,7 +1401,7 @@ The AMDGPU backend implements the following LLVM IR intrinsics.
1401
1401
1402
1402
llvm.amdgcn.atomic.cond.sub.u32 Provides direct access to flat_atomic_cond_sub_u32, global_atomic_cond_sub_u32
1403
1403
and ds_cond_sub_u32 based on address space on gfx12 targets. This
1404
- performs subtraction only if the memory value is greater than or
1404
+ performs a subtraction only if the memory value is greater than or
1405
1405
equal to the data value.
1406
1406
1407
1407
llvm.amdgcn.s.barrier.signal.isfirst Provides access to the s_barrier_signal_first instruction;
@@ -1646,7 +1646,7 @@ The AMDGPU backend supports the following LLVM IR attributes.
1646
1646
llvm.amdgcn.queue.ptr intrinsic. Note that unlike the other ABI hint
1647
1647
attributes, the queue pointer may be required in situations where the
1648
1648
intrinsic call does not directly appear in the program. Some subtargets
1649
- require the queue pointer for to handle some addrspacecasts, as well
1649
+ require the queue pointer to handle some addrspacecasts, as well
1650
1650
as the llvm.amdgcn.is.shared, llvm.amdgcn.is.private, llvm.trap, and
1651
1651
llvm.debug intrinsics.
1652
1652
@@ -1947,7 +1947,7 @@ The following describes all emitted function resource usage symbols:
1947
1947
callees, contains an indirect call
1948
1948
===================================== ========= ========================================= ===============================================================================
1949
1949
1950
- Futhermore , three symbols are additionally emitted describing the compilation
1950
+ Furthermore , three symbols are additionally emitted describing the compilation
1951
1951
unit's worst case (i.e, maxima) ``num_vgpr``, ``num_agpr``, and
1952
1952
``numbered_sgpr`` which may be referenced and used by the aforementioned
1953
1953
symbolic expressions. These three symbols are ``amdgcn.max_num_vgpr``,
@@ -17948,7 +17948,7 @@ set architecture (ISA) version of the assembly program.
17948
17948
"AMD" and *arch* should always be equal to "AMDGPU".
17949
17949
17950
17950
By default, the assembler will derive the ISA version, *vendor*, and *arch*
17951
- from the value of the -mcpu option that is passed to the assembler.
17951
+ from the value of the `` -mcpu`` option that is passed to the assembler.
17952
17952
17953
17953
.. _amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel:
17954
17954
@@ -17972,7 +17972,7 @@ default value for all keys is 0, with the following exceptions:
17972
17972
- *amd_kernel_code_version_minor* defaults to 2.
17973
17973
- *amd_machine_kind* defaults to 1.
17974
17974
- *amd_machine_version_major*, *machine_version_minor*, and
17975
- *amd_machine_version_stepping* are derived from the value of the -mcpu option
17975
+ *amd_machine_version_stepping* are derived from the value of the `` -mcpu`` option
17976
17976
that is passed to the assembler.
17977
17977
- *kernel_code_entry_byte_offset* defaults to 256.
17978
17978
- *wavefront_size* defaults 6 for all targets before GFX10. For GFX10 onwards
0 commit comments