diff --git a/media/docs/cpp/xe_rearchitecture.md b/media/docs/cpp/xe_rearchitecture.md index 7e5c49bfb4..5cea46d8bf 100644 --- a/media/docs/cpp/xe_rearchitecture.md +++ b/media/docs/cpp/xe_rearchitecture.md @@ -2,7 +2,7 @@ ## Limitations of Current Intel CuTe Architecture -* VNNI layout used by DPAS and block 2D VNNI loads is hidden from CuTe/CUTLASS. +* [VNNI layout](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_intel_matrix.asciidoc#packed-layout-format) used by DPAS and block 2D VNNI loads is hidden from CuTe/CUTLASS. - Compiler inserts extra interleave/deinterleave operations if there is any computation between VNNI load and DPAS. - Additionally, any such computation using the native B data type (instead of int) can lead to private memory traffic. * MMA and copy fragments must be carefully set up to match layouts @@ -289,7 +289,11 @@ Now that we have the basic thread mapping rule, let's apply it to a simple block An individual DPAS atom's A matrix follows the same pattern, with height ranging from 1 to 8, and width equal to 8 (tf32), 16 (f16/bf16), or 32 (s8/u8). The DPAS C matrix is also organized this way, except that its width is always 16. -As a more complicated example, let's consider a 16-bit VNNI load, with height = 4, width = 16: +As a more complicated example, let's consider a 16-bit [VNNI](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_intel_matrix.asciidoc#packed-layout-format) load, with height = 4, width = 16: + +Please note that the numbers in the subgroup-view below do not correspond to the logical indices of the original (pre-VNNI-transformation) plain-layout matrix. +They represent the order of the elements in the Xe general register file. + ```math \begin{array}{c} \text{Subgroup view}\\ @@ -312,6 +316,44 @@ As a more complicated example, let's consider a 16-bit VNNI load, with height = \end{array} ``` +If we instead assume that the numbers in the subgroup views below refer to the indices of the original plain layout matrix, +then we can view the 16-bit VNNI load from the perspective of where the plain layout matrix's elements end up after transformation. +e.g. after VNNI-transform, the number at (row 1, col 0) in the subgroup view is 8, which implies that the value at logical linear index 8 in the original plain-layout subgroup view would move to (row 1, col 0) after VNNI-transform. + +```math + \begin{array}{c} + \text{Subgroup view of data in global memory}\\ + \begin{array}{cccccc} + 0 & 1 & 2 & 3 & \cdots & 15\\ + 16 & 17 & 18 & 19 & \cdots & 31\\ + 32 & 33 & 34 & 35 & \cdots & 47\\ + 48 & 49 & 50 & 51 & \cdots & 63 + \end{array} + \end{array} +``` + +```math + \begin{array}{c} + \text{Subgroup view of data in registers after VNNI transformation that happened during the load}\\ + \begin{array}{cccccc} + 0 & 16 & 1 & 17 & \cdots & 7 & 23\\ + 8 & 24 & 9 & 25 & \cdots & 15 & 31\\ + 32 & 48 & 33 & 49 & \cdots & 39 & 55\\ + 40 & 56 & 41 & 57 & \cdots & 47 & 63 + \end{array} + \end{array} + \rightarrow + \begin{array}{c} + \text{Thread view}\\ + \begin{array}{cccc} + \text{T0V0} & \text{T1V0} & \text{T2V0} & \cdots & \text{T15V0}\\ + \text{T0V1} & \text{T1V1} & \text{T2V1} & \cdots & \text{T15V1}\\ + \text{T0V2} & \text{T1V2} & \text{T2V2} & \cdots & \text{T15V2}\\ + \text{T0V3} & \text{T1V3} & \text{T2V3} & \cdots & \text{T15V3} + \end{array} + \end{array} +``` + The DPAS B matrix follows the same pattern. @@ -504,7 +546,6 @@ gemm_device(ATensor const& A, // (M,K) } ``` - ## New Collective MMAs -... coming later! \ No newline at end of file +... coming later!