You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 v
84
85
85
86
Example 2, the column size of the matrix is 8 and the number of threads in the subgroup is 16.
86
-
The DPAS encoding of repeatCount=8, systolicDepth=8, executionSize=16, opsPerChannel=1 and sugGroupSize=16.
87
+
The DPAS encoding of repeatCount=8, systolicDepth=8, executionSize=16, opsPerChannel=1 and threadsPerWarp=16.
87
88
88
89
The layout for A operand:
89
90
K = 8 (K = systolic depth * opsPerChan)
@@ -102,7 +103,7 @@ The layouts for B operand is like the one of opsPerChan=2 but the K size is 8.
102
103
The layouts for C and D operands are same as the one of opsPerChan=2.
103
104
104
105
Example 3, the column size of the matrix is 32 and the number of threads in the subgroup is 16.
105
-
The DPAS encoding of repeatCount=8, systolicDepth=8, executionSize=16, opsPerChannel=4 and sugGroupSize=16.
106
+
The DPAS encoding of repeatCount=8, systolicDepth=8, executionSize=16, opsPerChannel=4 and threadsPerWarp=16.
106
107
107
108
The layout for A operand:
108
109
K = 32 (K = systolic depth * opsPerChan)
@@ -121,15 +122,21 @@ The layouts for B operand is like the one of opsPerChan=2 but the K size is 32.
121
122
The layouts for C and D operands are same as the one of opsPerChan=2.
122
123
123
124
The patterns (illustrated above) repeats every warpsPerTile[0] (resp. warpsPerTile[1]) blocks
124
-
along the row (resp. col) dimension. And the repetitions are clustered of the size of repCluster to optimize the memory accessing.
125
+
along the row (resp. col) dimension. And the repetitions are clustered of the size of repCluster to optimize the memory accessing.
125
126
126
-
Suppose we have a `tt.dot` operation of the block size [64, 128] += [64, 32] * [32, 128] of hf16/bf16.
127
-
The `warpsPerCTA` set to [2, 2]. The number of repetitions of the DPAS tile per warp is: A=8, B=8, C,D=16.
128
-
The DPAS repetitions are distributed as follows:
127
+
Suppose we have a `tt.dot` operation of the block size [64, 128] = [64, 32] * [32, 128] of f16/bf16. And its input tensor layout is defined as follows:
0 commit comments