@@ -63,12 +63,10 @@ in Intel GPUs) using registers.
6363Registers are a kind of small and fast memory bank (called Register File) located just beside the compute engine, as
6464this can be seen on the following diagrams showing selected parts of an Intel GPU architecture.
6565
66- ![ Xe2 GPU Vector engine Illustration] ({{ '/assets/images/portal/article-images/2025-09-02-intel-gpu/ComputeUnit.jpg' |
67- relative_url }})<br >
66+ ![ Xe2 GPU Vector engine Illustration] ({{ '/assets/images/portal/article-images/2025-09-02-intel-gpu/ComputeUnit.jpg' | relative_url }})<br >
6867* Illustration of an Intel Xe2 GPU Vector engine architecture (simplified)*
6968
70- ![ XeCore GPU Illustration] ({{ '/assets/images/portal/article-images/2025-09-02-intel-gpu/XeCore.jpg' |
71- relative_url }})<br >
69+ ![ XeCore GPU Illustration] ({{ '/assets/images/portal/article-images/2025-09-02-intel-gpu/XeCore.jpg' | relative_url }})<br >
7270* Illustration of an Intel XeCore architecture (simplified)*
7371
7472Basically, the tensor core reads operands A and B from a the * Register File* and then writes the accumulated output C
@@ -162,8 +160,7 @@ from Global Memory to the L1 Cache, then the second step is carried out by the `
162160Registers, hopefully from the L1 cache if the data is still available in cache (cache hit).
163161The diagram below illustrates this process:
164162
165- ![ Intel Backend Memory Semantic] ({{ '/assets/images/portal/article-images/2025-09-02-intel-gpu/IntelMemory.jpg' |
166- relative_url }})<br >
163+ ![ Intel Backend Memory Semantic] ({{ '/assets/images/portal/article-images/2025-09-02-intel-gpu/IntelMemory.jpg' | relative_url }})<br >
167164* Intel Backend Memory Semantic (synchronous)*
168165
169166Nvidia has chosen to leverage the Shared Local Memory (SMEM) instead of the cache. SMEM is indeed a scratch pad memory
@@ -173,8 +170,7 @@ a memory buffer in SMEM, but also `TritonGPU::LocalLoadOp` and `TritonGPU::Local
173170between SMEM and Registers.
174171Consequently, the Triton process for loading and storing data (synchronously) in the Nvidia architecture is as follows:
175172
176- ![ Nvidia Backend Memory Semantic] ({{ '/assets/images/portal/article-images/2025-09-02-intel-gpu/NvidiaMemory.jpg' |
177- relative_url }})<br >
173+ ![ Nvidia Backend Memory Semantic] ({{ '/assets/images/portal/article-images/2025-09-02-intel-gpu/NvidiaMemory.jpg' | relative_url }})<br >
178174* Nvidia Backend Memory Semantic (synchronous)*
179175
180176---
@@ -199,8 +195,7 @@ So, in our example, if A needs $NumReg_A$ registers to be stored, this means tha
199195for A across the loop, and thus the compiler needs to fit the variables used between line 1 and 7 in $N - NumReg_A$
200196registers, with $N$ being the total number of registers available.
201197
202- ![ variable liveness simple example] ({{ '
203- /assets/images/portal/article-images/2025-09-02-intel-gpu/liveness_example_annotated.jpg' | relative_url }})<br >
198+ ![ variable liveness simple example] ({{ '/assets/images/portal/article-images/2025-09-02-intel-gpu/liveness_example_annotated.jpg' | relative_url }})<br >
204199* Variable liveness simple example*
205200
206201It is therefore easy to understand that in such a kernel, if the variable A is large and the kernel processing between
@@ -392,8 +387,7 @@ an [optimization pass](https://github.com/intel/intel-xpu-backend-for-triton/blo
392387which aims to reduce variable liveness where possible.
393388To this end, the pass attempts to bring load operations closer to the actual uses of the loaded data.
394389
395- ![ Reduce Variable Liveness pass diagram] ({{ '
396- /assets/images/portal/article-images/2025-09-02-intel-gpu/liveness-pass-diagram.jpg' | relative_url }})<br >
390+ ![ Reduce Variable Liveness pass diagram] ({{ '/assets/images/portal/article-images/2025-09-02-intel-gpu/liveness-pass-diagram.jpg' | relative_url }})<br >
397391* Reduce Variable Liveness pass diagram*
398392
399393The diagram above shows how the compiler pass works to reduce the liveness of ` DotOp ` operands.
@@ -442,10 +436,10 @@ We have evaluated the performance of Triton FlashAttention v2 on Intel GPU Max P
442436The following plots show the normalised performance of the FlashAttention kernel with the * reduce-liveness-pass* enabled
443437for different input configurations.
444438
445- ![ Normalized performance PVC1100] ({{ '/assets/images/portal/article-images/2025-09-02-intel-gpu/pvc1100_new.jpg ' | relative_url }})
439+ ![ Normalized performance PVC1100] ({{ '/assets/images/portal/article-images/2025-09-02-intel-gpu/pvc1100_new.png ' | relative_url }})
446440* FlashAttention v2 Normalized performance PVC1100*
447441
448- ![ Normalized performance PVC1550] ({{ '/assets/images/portal/article-images/2025-09-02-intel-gpu/pvc1550_new.jpg ' | relative_url }})
442+ ![ Normalized performance PVC1550] ({{ '/assets/images/portal/article-images/2025-09-02-intel-gpu/pvc1550_new.png ' | relative_url }})
449443* FlashAttention v2 Normalized performance PVC1550*
450444
451445The testbed used for these evaluations and a disclaimer can be found [ at the bottom] ( #disclaimer ) of this blog post.
0 commit comments