intel · sanchitintel · Oct 20, 2025 · Oct 21, 2025 · Oct 21, 2025 · Oct 22, 2025
diff --git a/media/docs/cpp/xe_rearchitecture.md b/media/docs/cpp/xe_rearchitecture.md
@@ -2,7 +2,7 @@
 
 ## Limitations of Current Intel CuTe Architecture
 
-* VNNI layout used by DPAS and block 2D VNNI loads is hidden from CuTe/CUTLASS.
+* [VNNI layout](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_intel_matrix.asciidoc#packed-layout-format) used by DPAS and block 2D VNNI loads is hidden from CuTe/CUTLASS.
   - Compiler inserts extra interleave/deinterleave operations if there is any computation between VNNI load and DPAS.
   - Additionally, any such computation using the native B data type (instead of int) can lead to private memory traffic.
 * MMA and copy fragments must be carefully set up to match layouts
@@ -289,7 +289,11 @@ Now that we have the basic thread mapping rule, let's apply it to a simple block
 
 An individual DPAS atom's A matrix follows the same pattern, with height ranging from 1 to 8, and width equal to 8 (tf32), 16 (f16/bf16), or 32 (s8/u8). The DPAS C matrix is also organized this way, except that its width is always 16.
 
-As a more complicated example, let's consider a 16-bit VNNI load, with height = 4, width = 16:
+As a more complicated example, let's consider a 16-bit [VNNI](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_intel_matrix.asciidoc#packed-layout-format) load, with height = 4, width = 16:
+
+Please note that the numbers in the subgroup-view below do not correspond to the logical indices of the original (pre-VNNI-transformation) plain-layout matrix.
+They represent the order of the elements in the Xe general register file.
+
 ```math
     \begin{array}{c}
     \text{Subgroup view}\\
@@ -312,6 +316,44 @@ As a more complicated example, let's consider a 16-bit VNNI load, with height =
     \end{array}
 ```
 
+If we instead assume that the numbers in the subgroup views below refer to the indices of the original plain layout matrix,
+then we can view the 16-bit VNNI load from the perspective of where the plain layout matrix's elements end up after transformation.
+e.g. after VNNI-transform, the number at (row 1, col 0) in the subgroup view is 8, which implies that the value at logical linear index 8 in the original plain-layout subgroup view would move to (row 1, col 0) after VNNI-transform.
+
+```math
+    \begin{array}{c}
+    \text{Subgroup view of data in global memory}\\
+    \begin{array}{cccccc}
+    0 & 1 & 2 & 3 & \cdots & 15\\
+    16 & 17 & 18 & 19 & \cdots & 31\\
+    32 & 33 & 34 & 35 & \cdots & 47\\
+    48 & 49 & 50 & 51 & \cdots & 63
+    \end{array}
+    \end{array}
+```    
+
+```math
+    \begin{array}{c}
+    \text{Subgroup view of data in registers after VNNI transformation that happened during the load}\\
+    \begin{array}{cccccc}
+    0 & 16 & 1 & 17 & \cdots & 7 & 23\\
+    8 & 24 & 9 & 25 & \cdots & 15 & 31\\
+    32 & 48 & 33 & 49 & \cdots & 39 & 55\\
+    40 & 56 & 41 & 57 & \cdots & 47 & 63
+    \end{array}
+    \end{array}
+    \rightarrow
+    \begin{array}{c}
+    \text{Thread view}\\
+    \begin{array}{cccc}
+    \text{T0V0} & \text{T1V0} & \text{T2V0} & \cdots & \text{T15V0}\\
+    \text{T0V1} & \text{T1V1} & \text{T2V1} & \cdots & \text{T15V1}\\
+    \text{T0V2} & \text{T1V2} & \text{T2V2} & \cdots & \text{T15V2}\\
+    \text{T0V3} & \text{T1V3} & \text{T2V3} & \cdots & \text{T15V3}
+    \end{array}
+    \end{array}
+```
+
 The DPAS B matrix follows the same pattern.
 
 
@@ -504,7 +546,6 @@ gemm_device(ATensor   const& A,         // (M,K)
 }
 ```
 
-
 ## New Collective MMAs
 
-... coming later!
+... coming later!