From 56c9ab1167b86d0e324390c927c4298c897f24a1 Mon Sep 17 00:00:00 2001
From: sanchitintel <sanchit.jain@intel.com>
Date: Mon, 20 Oct 2025 15:35:43 -0700
Subject: [PATCH 1/4] Update VNNI load visualization

The current visualization of the subgroup view of the VNNI load visualization does not correspond to the actual values, so it might be confusing.
---
 media/docs/cpp/xe_rearchitecture.md | 34 +++++++++++++++++++----------
 1 file changed, 23 insertions(+), 11 deletions(-)

diff --git a/media/docs/cpp/xe_rearchitecture.md b/media/docs/cpp/xe_rearchitecture.md
index 7e5c49bfb4..866a4a6a6e 100644
--- a/media/docs/cpp/xe_rearchitecture.md
+++ b/media/docs/cpp/xe_rearchitecture.md
@@ -290,24 +290,37 @@ Now that we have the basic thread mapping rule, let's apply it to a simple block
 An individual DPAS atom's A matrix follows the same pattern, with height ranging from 1 to 8, and width equal to 8 (tf32), 16 (f16/bf16), or 32 (s8/u8). The DPAS C matrix is also organized this way, except that its width is always 16.
 
 As a more complicated example, let's consider a 16-bit VNNI load, with height = 4, width = 16:
+
 ```math
     \begin{array}{c}
-    \text{Subgroup view}\\
+    \text{Subgroup view of data in global memory}\\
     \begin{array}{cccccc}
-    0 & 2 & 4 & 6 & \cdots & 30\\
-    1 & 3 & 5 & 7 & \cdots & 31\\
-    32 & 34 & 36 & 38 & \cdots & 62\\
-    33 & 35 & 37 & 39 & \cdots & 63
+    0 & 1 & 2 & 3 & \cdots & 15\\
+    16 & 17 & 18 & 19 & \cdots & 31\\
+    32 & 33 & 34 & 35 & \cdots & 47\\
+    48 & 49 & 50 & 51 & \cdots & 63
+    \end{array}
+    \end{array}
+```    
+
+```math
+    \begin{array}{c}
+    \text{Subgroup view of data in registers after VNNI transformation that happened during the load}\\
+    \begin{array}{cccccc}
+    0 & 16 & 1 & 17 & \cdots & 7 & 23\\
+    8 & 24 & 9 & 25 & \cdots & 15 & 31\\
+    32 & 48 & 33 & 49 & \cdots & 39 & 55\\
+    40 & 56 & 41 & 57 & \cdots & 47 & 63
     \end{array}
     \end{array}
     \rightarrow
     \begin{array}{c}
     \text{Thread view}\\
     \begin{array}{cccc}
-    \text{T0V0} & \text{T2V0} & \text{T4V0} & \cdots & \text{T14V0} & \text{T0V1} & \cdots & \text{T14V1}\\
-    \text{T1V0} & \text{T3V0} & \text{T5V0} & \cdots & \text{T15V0} & \text{T1V1} & \cdots & \text{T15V1}\\
-    \text{T0V2} & \text{T2V2} & \text{T4V2} & \cdots & \text{T14V2} & \text{T0V3} & \cdots & \text{T14V3}\\
-    \text{T1V2} & \text{T3V2} & \text{T5V2} & \cdots & \text{T15V2} & \text{T1V3} & \cdots & \text{T15V3}
+    \text{T0V0} & \text{T1V0} & \text{T2V0} & \cdots & \text{T15V0}\\
+    \text{T0V1} & \text{T1V1} & \text{T2V1} & \cdots & \text{T15V1}\\
+    \text{T0V2} & \text{T1V2} & \text{T2V2} & \cdots & \text{T15V2}\\
+    \text{T0V3} & \text{T1V3} & \text{T2V3} & \cdots & \text{T15V3}
     \end{array}
     \end{array}
 ```
@@ -504,7 +517,6 @@ gemm_device(ATensor   const& A,         // (M,K)
 }
 ```
 
-
 ## New Collective MMAs
 
-... coming later!
\ No newline at end of file
+... coming later!

From 552df61cd74d1d1131b5571af744bcc89981c03b Mon Sep 17 00:00:00 2001
From: sanchitintel <sanchit.jain@intel.com>
Date: Mon, 20 Oct 2025 21:47:23 -0700
Subject: [PATCH 2/4] Retain original & explain what the numbers mean

---
 media/docs/cpp/xe_rearchitecture.md | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/media/docs/cpp/xe_rearchitecture.md b/media/docs/cpp/xe_rearchitecture.md
index 866a4a6a6e..9ffb29d99e 100644
--- a/media/docs/cpp/xe_rearchitecture.md
+++ b/media/docs/cpp/xe_rearchitecture.md
@@ -291,6 +291,31 @@ An individual DPAS atom's A matrix follows the same pattern, with height ranging
 
 As a more complicated example, let's consider a 16-bit VNNI load, with height = 4, width = 16:
 
+```math
+    \begin{array}{c}
+    \text{Subgroup view}\\
+    \begin{array}{cccccc}
+    0 & 2 & 4 & 6 & \cdots & 30\\
+    1 & 3 & 5 & 7 & \cdots & 31\\
+    32 & 34 & 36 & 38 & \cdots & 62\\
+    33 & 35 & 37 & 39 & \cdots & 63
+    \end{array}
+    \end{array}
+    \rightarrow
+    \begin{array}{c}
+    \text{Thread view}\\
+    \begin{array}{cccc}
+    \text{T0V0} & \text{T2V0} & \text{T4V0} & \cdots & \text{T14V0} & \text{T0V1} & \cdots & \text{T14V1}\\
+    \text{T1V0} & \text{T3V0} & \text{T5V0} & \cdots & \text{T15V0} & \text{T1V1} & \cdots & \text{T15V1}\\
+    \text{T0V2} & \text{T2V2} & \text{T4V2} & \cdots & \text{T14V2} & \text{T0V3} & \cdots & \text{T14V3}\\
+    \text{T1V2} & \text{T3V2} & \text{T5V2} & \cdots & \text{T15V2} & \text{T1V3} & \cdots & \text{T15V3}
+    \end{array}
+    \end{array}
+```
+
+If we instead assume that the values in the VNNI-transformed matrix below refer to the corresponding indices of the original plain layout matrix,
+then we can view the 16-bit VNNI load from a different perspective.
+
 ```math
     \begin{array}{c}
     \text{Subgroup view of data in global memory}\\

From cc16d0ebf573c0042a0e4a25d95f6d3ee7fa6184 Mon Sep 17 00:00:00 2001
From: sanchitintel <sanchit.jain@intel.com>
Date: Tue, 21 Oct 2025 12:03:37 -0700
Subject: [PATCH 3/4] Add more details

Explain what the numbers mean in context of the original subgroup view for VNNI.

Add link explaining VNNI layout
---
 media/docs/cpp/xe_rearchitecture.md | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/media/docs/cpp/xe_rearchitecture.md b/media/docs/cpp/xe_rearchitecture.md
index 9ffb29d99e..2f690d47f9 100644
--- a/media/docs/cpp/xe_rearchitecture.md
+++ b/media/docs/cpp/xe_rearchitecture.md
@@ -2,7 +2,7 @@
 
 ## Limitations of Current Intel CuTe Architecture
 
-* VNNI layout used by DPAS and block 2D VNNI loads is hidden from CuTe/CUTLASS.
+* [VNNI layout](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_intel_matrix.asciidoc#packed-layout-format) used by DPAS and block 2D VNNI loads is hidden from CuTe/CUTLASS.
   - Compiler inserts extra interleave/deinterleave operations if there is any computation between VNNI load and DPAS.
   - Additionally, any such computation using the native B data type (instead of int) can lead to private memory traffic.
 * MMA and copy fragments must be carefully set up to match layouts
@@ -289,7 +289,10 @@ Now that we have the basic thread mapping rule, let's apply it to a simple block
 
 An individual DPAS atom's A matrix follows the same pattern, with height ranging from 1 to 8, and width equal to 8 (tf32), 16 (f16/bf16), or 32 (s8/u8). The DPAS C matrix is also organized this way, except that its width is always 16.
 
-As a more complicated example, let's consider a 16-bit VNNI load, with height = 4, width = 16:
+As a more complicated example, let's consider a 16-bit [VNNI](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_intel_matrix.asciidoc#packed-layout-format) load, with height = 4, width = 16:
+
+Please note that the numbers in the subgroup-view below do not correspond to the logical indices of the original (pre-VNNI-transformation) plain-layout matrix.
+They represent the order of the elements in the Xe general register file.
 
 ```math
     \begin{array}{c}
@@ -313,8 +316,8 @@ As a more complicated example, let's consider a 16-bit VNNI load, with height =
     \end{array}
 ```
 
-If we instead assume that the values in the VNNI-transformed matrix below refer to the corresponding indices of the original plain layout matrix,
-then we can view the 16-bit VNNI load from a different perspective.
+If we instead assume that the values in the subgroup views below refer to the indices of the original plain layout matrix,
+then we can view the 16-bit VNNI load from the perspective of where the plain layout matrix's elements end up after transformation.
 
 ```math
     \begin{array}{c}

From 61757f5de8d235575a9980f41147508be60ccc19 Mon Sep 17 00:00:00 2001
From: sanchitintel <sanchit.jain@intel.com>
Date: Wed, 22 Oct 2025 13:33:07 -0700
Subject: [PATCH 4/4] Add more explanation

---
 media/docs/cpp/xe_rearchitecture.md | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/media/docs/cpp/xe_rearchitecture.md b/media/docs/cpp/xe_rearchitecture.md
index 2f690d47f9..5cea46d8bf 100644
--- a/media/docs/cpp/xe_rearchitecture.md
+++ b/media/docs/cpp/xe_rearchitecture.md
@@ -316,8 +316,9 @@ They represent the order of the elements in the Xe general register file.
     \end{array}
 ```
 
-If we instead assume that the values in the subgroup views below refer to the indices of the original plain layout matrix,
+If we instead assume that the numbers in the subgroup views below refer to the indices of the original plain layout matrix,
 then we can view the 16-bit VNNI load from the perspective of where the plain layout matrix's elements end up after transformation.
+e.g. after VNNI-transform, the number at (row 1, col 0) in the subgroup view is 8, which implies that the value at logical linear index 8 in the original plain-layout subgroup view would move to (row 1, col 0) after VNNI-transform.
 
 ```math
     \begin{array}{c}