From 56c9ab1167b86d0e324390c927c4298c897f24a1 Mon Sep 17 00:00:00 2001 From: sanchitintel Date: Mon, 20 Oct 2025 15:35:43 -0700 Subject: [PATCH 1/4] Update VNNI load visualization The current visualization of the subgroup view of the VNNI load visualization does not correspond to the actual values, so it might be confusing. --- media/docs/cpp/xe_rearchitecture.md | 34 +++++++++++++++++++---------- 1 file changed, 23 insertions(+), 11 deletions(-) diff --git a/media/docs/cpp/xe_rearchitecture.md b/media/docs/cpp/xe_rearchitecture.md index 7e5c49bfb4..866a4a6a6e 100644 --- a/media/docs/cpp/xe_rearchitecture.md +++ b/media/docs/cpp/xe_rearchitecture.md @@ -290,24 +290,37 @@ Now that we have the basic thread mapping rule, let's apply it to a simple block An individual DPAS atom's A matrix follows the same pattern, with height ranging from 1 to 8, and width equal to 8 (tf32), 16 (f16/bf16), or 32 (s8/u8). The DPAS C matrix is also organized this way, except that its width is always 16. As a more complicated example, let's consider a 16-bit VNNI load, with height = 4, width = 16: + ```math \begin{array}{c} - \text{Subgroup view}\\ + \text{Subgroup view of data in global memory}\\ \begin{array}{cccccc} - 0 & 2 & 4 & 6 & \cdots & 30\\ - 1 & 3 & 5 & 7 & \cdots & 31\\ - 32 & 34 & 36 & 38 & \cdots & 62\\ - 33 & 35 & 37 & 39 & \cdots & 63 + 0 & 1 & 2 & 3 & \cdots & 15\\ + 16 & 17 & 18 & 19 & \cdots & 31\\ + 32 & 33 & 34 & 35 & \cdots & 47\\ + 48 & 49 & 50 & 51 & \cdots & 63 + \end{array} + \end{array} +``` + +```math + \begin{array}{c} + \text{Subgroup view of data in registers after VNNI transformation that happened during the load}\\ + \begin{array}{cccccc} + 0 & 16 & 1 & 17 & \cdots & 7 & 23\\ + 8 & 24 & 9 & 25 & \cdots & 15 & 31\\ + 32 & 48 & 33 & 49 & \cdots & 39 & 55\\ + 40 & 56 & 41 & 57 & \cdots & 47 & 63 \end{array} \end{array} \rightarrow \begin{array}{c} \text{Thread view}\\ \begin{array}{cccc} - \text{T0V0} & \text{T2V0} & \text{T4V0} & \cdots & \text{T14V0} & \text{T0V1} & \cdots & \text{T14V1}\\ - \text{T1V0} & \text{T3V0} & \text{T5V0} & \cdots & \text{T15V0} & \text{T1V1} & \cdots & \text{T15V1}\\ - \text{T0V2} & \text{T2V2} & \text{T4V2} & \cdots & \text{T14V2} & \text{T0V3} & \cdots & \text{T14V3}\\ - \text{T1V2} & \text{T3V2} & \text{T5V2} & \cdots & \text{T15V2} & \text{T1V3} & \cdots & \text{T15V3} + \text{T0V0} & \text{T1V0} & \text{T2V0} & \cdots & \text{T15V0}\\ + \text{T0V1} & \text{T1V1} & \text{T2V1} & \cdots & \text{T15V1}\\ + \text{T0V2} & \text{T1V2} & \text{T2V2} & \cdots & \text{T15V2}\\ + \text{T0V3} & \text{T1V3} & \text{T2V3} & \cdots & \text{T15V3} \end{array} \end{array} ``` @@ -504,7 +517,6 @@ gemm_device(ATensor const& A, // (M,K) } ``` - ## New Collective MMAs -... coming later! \ No newline at end of file +... coming later! From 552df61cd74d1d1131b5571af744bcc89981c03b Mon Sep 17 00:00:00 2001 From: sanchitintel Date: Mon, 20 Oct 2025 21:47:23 -0700 Subject: [PATCH 2/4] Retain original & explain what the numbers mean --- media/docs/cpp/xe_rearchitecture.md | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) diff --git a/media/docs/cpp/xe_rearchitecture.md b/media/docs/cpp/xe_rearchitecture.md index 866a4a6a6e..9ffb29d99e 100644 --- a/media/docs/cpp/xe_rearchitecture.md +++ b/media/docs/cpp/xe_rearchitecture.md @@ -291,6 +291,31 @@ An individual DPAS atom's A matrix follows the same pattern, with height ranging As a more complicated example, let's consider a 16-bit VNNI load, with height = 4, width = 16: +```math + \begin{array}{c} + \text{Subgroup view}\\ + \begin{array}{cccccc} + 0 & 2 & 4 & 6 & \cdots & 30\\ + 1 & 3 & 5 & 7 & \cdots & 31\\ + 32 & 34 & 36 & 38 & \cdots & 62\\ + 33 & 35 & 37 & 39 & \cdots & 63 + \end{array} + \end{array} + \rightarrow + \begin{array}{c} + \text{Thread view}\\ + \begin{array}{cccc} + \text{T0V0} & \text{T2V0} & \text{T4V0} & \cdots & \text{T14V0} & \text{T0V1} & \cdots & \text{T14V1}\\ + \text{T1V0} & \text{T3V0} & \text{T5V0} & \cdots & \text{T15V0} & \text{T1V1} & \cdots & \text{T15V1}\\ + \text{T0V2} & \text{T2V2} & \text{T4V2} & \cdots & \text{T14V2} & \text{T0V3} & \cdots & \text{T14V3}\\ + \text{T1V2} & \text{T3V2} & \text{T5V2} & \cdots & \text{T15V2} & \text{T1V3} & \cdots & \text{T15V3} + \end{array} + \end{array} +``` + +If we instead assume that the values in the VNNI-transformed matrix below refer to the corresponding indices of the original plain layout matrix, +then we can view the 16-bit VNNI load from a different perspective. + ```math \begin{array}{c} \text{Subgroup view of data in global memory}\\ From cc16d0ebf573c0042a0e4a25d95f6d3ee7fa6184 Mon Sep 17 00:00:00 2001 From: sanchitintel Date: Tue, 21 Oct 2025 12:03:37 -0700 Subject: [PATCH 3/4] Add more details Explain what the numbers mean in context of the original subgroup view for VNNI. Add link explaining VNNI layout --- media/docs/cpp/xe_rearchitecture.md | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/media/docs/cpp/xe_rearchitecture.md b/media/docs/cpp/xe_rearchitecture.md index 9ffb29d99e..2f690d47f9 100644 --- a/media/docs/cpp/xe_rearchitecture.md +++ b/media/docs/cpp/xe_rearchitecture.md @@ -2,7 +2,7 @@ ## Limitations of Current Intel CuTe Architecture -* VNNI layout used by DPAS and block 2D VNNI loads is hidden from CuTe/CUTLASS. +* [VNNI layout](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_intel_matrix.asciidoc#packed-layout-format) used by DPAS and block 2D VNNI loads is hidden from CuTe/CUTLASS. - Compiler inserts extra interleave/deinterleave operations if there is any computation between VNNI load and DPAS. - Additionally, any such computation using the native B data type (instead of int) can lead to private memory traffic. * MMA and copy fragments must be carefully set up to match layouts @@ -289,7 +289,10 @@ Now that we have the basic thread mapping rule, let's apply it to a simple block An individual DPAS atom's A matrix follows the same pattern, with height ranging from 1 to 8, and width equal to 8 (tf32), 16 (f16/bf16), or 32 (s8/u8). The DPAS C matrix is also organized this way, except that its width is always 16. -As a more complicated example, let's consider a 16-bit VNNI load, with height = 4, width = 16: +As a more complicated example, let's consider a 16-bit [VNNI](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_intel_matrix.asciidoc#packed-layout-format) load, with height = 4, width = 16: + +Please note that the numbers in the subgroup-view below do not correspond to the logical indices of the original (pre-VNNI-transformation) plain-layout matrix. +They represent the order of the elements in the Xe general register file. ```math \begin{array}{c} @@ -313,8 +316,8 @@ As a more complicated example, let's consider a 16-bit VNNI load, with height = \end{array} ``` -If we instead assume that the values in the VNNI-transformed matrix below refer to the corresponding indices of the original plain layout matrix, -then we can view the 16-bit VNNI load from a different perspective. +If we instead assume that the values in the subgroup views below refer to the indices of the original plain layout matrix, +then we can view the 16-bit VNNI load from the perspective of where the plain layout matrix's elements end up after transformation. ```math \begin{array}{c} From 61757f5de8d235575a9980f41147508be60ccc19 Mon Sep 17 00:00:00 2001 From: sanchitintel Date: Wed, 22 Oct 2025 13:33:07 -0700 Subject: [PATCH 4/4] Add more explanation --- media/docs/cpp/xe_rearchitecture.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/media/docs/cpp/xe_rearchitecture.md b/media/docs/cpp/xe_rearchitecture.md index 2f690d47f9..5cea46d8bf 100644 --- a/media/docs/cpp/xe_rearchitecture.md +++ b/media/docs/cpp/xe_rearchitecture.md @@ -316,8 +316,9 @@ They represent the order of the elements in the Xe general register file. \end{array} ``` -If we instead assume that the values in the subgroup views below refer to the indices of the original plain layout matrix, +If we instead assume that the numbers in the subgroup views below refer to the indices of the original plain layout matrix, then we can view the 16-bit VNNI load from the perspective of where the plain layout matrix's elements end up after transformation. +e.g. after VNNI-transform, the number at (row 1, col 0) in the subgroup view is 8, which implies that the value at logical linear index 8 in the original plain-layout subgroup view would move to (row 1, col 0) after VNNI-transform. ```math \begin{array}{c}