Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 45 additions & 4 deletions media/docs/cpp/xe_rearchitecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Limitations of Current Intel CuTe Architecture

* VNNI layout used by DPAS and block 2D VNNI loads is hidden from CuTe/CUTLASS.
* [VNNI layout](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_intel_matrix.asciidoc#packed-layout-format) used by DPAS and block 2D VNNI loads is hidden from CuTe/CUTLASS.
- Compiler inserts extra interleave/deinterleave operations if there is any computation between VNNI load and DPAS.
- Additionally, any such computation using the native B data type (instead of int) can lead to private memory traffic.
* MMA and copy fragments must be carefully set up to match layouts
Expand Down Expand Up @@ -289,7 +289,11 @@ Now that we have the basic thread mapping rule, let's apply it to a simple block

An individual DPAS atom's A matrix follows the same pattern, with height ranging from 1 to 8, and width equal to 8 (tf32), 16 (f16/bf16), or 32 (s8/u8). The DPAS C matrix is also organized this way, except that its width is always 16.

As a more complicated example, let's consider a 16-bit VNNI load, with height = 4, width = 16:
As a more complicated example, let's consider a 16-bit [VNNI](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_intel_matrix.asciidoc#packed-layout-format) load, with height = 4, width = 16:

Please note that the numbers in the subgroup-view below do not correspond to the logical indices of the original (pre-VNNI-transformation) plain-layout matrix.
They represent the order of the elements in the Xe general register file.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add the explanation of how to understand this representation which helped you? I think it would also be good to include (/link to) a good explanation of the VNNI format. Such as:
image

Copy link
Author

@sanchitintel sanchitintel Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing!

Peter clarified that the numbers in subgroup view in the current documentation represent the order of elements in the Xe register file. I added that information in the latest commit in this PR.

I think it would also be good to include (/link to) a good explanation of the VNNI format

I added a link to https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_intel_matrix.asciidoc#example-2-16-bit-elements that describes VNNI layout.

```math
\begin{array}{c}
\text{Subgroup view}\\
Expand All @@ -312,6 +316,44 @@ As a more complicated example, let's consider a 16-bit VNNI load, with height =
\end{array}
```

If we instead assume that the numbers in the subgroup views below refer to the indices of the original plain layout matrix,
then we can view the 16-bit VNNI load from the perspective of where the plain layout matrix's elements end up after transformation.
e.g. after VNNI-transform, the number at (row 1, col 0) in the subgroup view is 8, which implies that the value at logical linear index 8 in the original plain-layout subgroup view would move to (row 1, col 0) after VNNI-transform.

```math
\begin{array}{c}
\text{Subgroup view of data in global memory}\\
\begin{array}{cccccc}
0 & 1 & 2 & 3 & \cdots & 15\\
16 & 17 & 18 & 19 & \cdots & 31\\
32 & 33 & 34 & 35 & \cdots & 47\\
48 & 49 & 50 & 51 & \cdots & 63
\end{array}
\end{array}
```

```math
\begin{array}{c}
\text{Subgroup view of data in registers after VNNI transformation that happened during the load}\\
\begin{array}{cccccc}
0 & 16 & 1 & 17 & \cdots & 7 & 23\\
8 & 24 & 9 & 25 & \cdots & 15 & 31\\
32 & 48 & 33 & 49 & \cdots & 39 & 55\\
40 & 56 & 41 & 57 & \cdots & 47 & 63
\end{array}
\end{array}
\rightarrow
\begin{array}{c}
\text{Thread view}\\
\begin{array}{cccc}
\text{T0V0} & \text{T1V0} & \text{T2V0} & \cdots & \text{T15V0}\\
\text{T0V1} & \text{T1V1} & \text{T2V1} & \cdots & \text{T15V1}\\
\text{T0V2} & \text{T1V2} & \text{T2V2} & \cdots & \text{T15V2}\\
\text{T0V3} & \text{T1V3} & \text{T2V3} & \cdots & \text{T15V3}
\end{array}
\end{array}
```

The DPAS B matrix follows the same pattern.


Expand Down Expand Up @@ -504,7 +546,6 @@ gemm_device(ATensor const& A, // (M,K)
}
```


## New Collective MMAs

... coming later!
... coming later!
Loading