@@ -23,6 +23,153 @@ Potential Topics
2323* Function pointers revisited
2424* oneDPL C++ standard library support
2525
26+ 2023-06-05
27+ ==========
28+
29+
30+ * Ruyman Reyes
31+ * Rod Burns
32+ * Cohn, Robert S
33+ * Tom Deakin
34+ * Victor Lomuller
35+ * Khaldi, Dounia
36+ * Spruit, Neil R
37+ * Andrew Richards
38+ * Gene Amdahl Meeting Room
39+ * Reinders, James R
40+ * Yates, Brandon
41+ * Slavova, Gergana S
42+ * Voss, Michael J
43+ * Brodman, James
44+ * Xiong, Jianxin
45+ * Mehdi Goli
46+ * Keryell, Ronan (XILINX LABS)
47+ * Tu, Peng
48+ * Benie
49+ * Andrew Lumsdaine
50+ * Lueck, Gregory M
51+ * Richards, Alison L
52+ * Arteaga Molina, Jaime A
53+ * Lowney, Geoff
54+ * Mcguire, Russell W
55+ * Alastair Murray
56+ * Kukanov, Alexey
57+ * Videau, Brice
58+ * Wells, Alex M
59+ * Melonakos, John
60+
61+ Joint Matrix: A Unified SYCL Extension for Matrix Hardware Programming
62+ ------------------------------------------------------------------------
63+
64+ Dounia Khaldi,
65+ `Slides <presentations/2023-06-07-DK-matrix-oneapi-language.pdf.pdf >`_
66+
67+ * Great community reception, with contributions to MLIR dialects upstream
68+ * Different levels of abstraction are exposed to different users.
69+ Joint matrix is aim for the middle.
70+ Breaks down gemm into primitives, its low level,
71+ but its portable across targets.
72+ * This presentation will cover both the SPIR-V and the SYCL extension,
73+ both are needed for different targets
74+ * Joint matrix relies in various abstractions for Code generation
75+ (PTX ISA, SPIRV, GPU intrinsics..)
76+ * Joint matrix is not a replacement of the framework and the libraries,
77+ this is useful when implementing new operations
78+ or optimizing unexpected combinations of operations
79+ * This is also useful for library developers,
80+ they need to write code that is portable
81+ * Intel PVC has two stacks with slides, each slide has 16 XE core,
82+ 8 vector engines and 8 XMX engines (GEMM accelerators)
83+ * Intel Xeon codenamed Sapphire Rapids have AMX extensions,
84+ which are GEMM accelerators
85+ * NVIDIA and other hardware vendors have their own GEMM accelerators
86+ * You would need a lot of different intrinsics to target all of them
87+ * The SYCL Matrix extension is an experimental SYCL API at this point
88+ so it may change from one release to the other
89+ * The joint matrix has a type of group, only subgroup is supported.
90+ Use is the matrix A,B or accumulator for GEMM,
91+ then you specify the shape (Rows, columns) and the layout.
92+ * There are various operations supported, fill, load, store
93+ * (Slide shows an example of using the extension)
94+ * Example above multiply in SYCL. The load and mad happen on the K loop.
95+ * You can do an element-wise operation with data that is on the join_matrix
96+ * Q (Ronan): can you do negative strides or is just unsigned?
97+ * A: Stride is a positive number.
98+ * Same example and source can run across Intel CPU, GPU and NVIDIA GPU.
99+ * Additional functions to pass row/col. This is Intel specific,
100+ NVIDIA cannot support this on tensorcores
101+ * Q(Ruyman): Any restrictions on element wise operations supported?
102+ * A(Douina): No restriction, any SYCL kernel code is valid
103+ * Size combinations are different between AMX and XMX,
104+ and even between generations of XMX. NVIDIA Has different numbers.
105+ * How do we write portable code? There is a query interface, static and dynamic
106+ * Static queries require hardware architecture checks.
107+ Basic code is similar between SYCL joint matrix and CUDA Fragments
108+ * CUDA code migration to SYCL is simple as it is very close to the
109+ wmma operations
110+ * Joint matrix extension in MLIR generates SPIR-V code for multiple backends
111+ * Currently: Full support of SYCL joint matrix extension
112+ on AMX, XMX and NVIDIA Tensor Cores
113+ * Next steps: Standarization of joint matrix on SYCL and SPIR-V
114+
115+ Joint matrix in NVIDIA Tensor Cores
116+ ------------------------------------
117+
118+ Mehdi Goli, `Slides <presentations/2023-06-07_JointMatrix_NVIDIA.pdf.pdf> `
119+
120+ * Gemms are used everywhere and its very important we optimize those
121+ * Presentation about Joint Matrix Performance analysis,
122+ showing support for SM72 and SM80 (Jetson and Ampere)
123+ * we use the joint matrix extension on both,
124+ we can achieve 97% of cuDNN on Jetson
125+ * On SM80 / A100 we use different sizes and see mixed results
126+ (very good on small sizes, really bad on large sizes)
127+ performance comparison with cutlas and cudnn.
128+ * SYCL-BLAS Half and TF32 performance is slightly better for small sizes but
129+ gets much worse for bigger sizes performance comparison with cutlas and cudnn
130+ * NVIDIA uses ldmatrix and cp.async (Shared Store From Global Load) to get
131+ higher performance.
132+ These instructions allow to bypass the cache and apply prefetching
133+ * Tensorcore support has evolved across different NVIDIA architectures,
134+ and they have added new instructions that support some advanced
135+ features using a different part of the PTX ISA (wmma vs mma).
136+ * WMMA is a higher level instruction that mapps to multiple HMMA instructions
137+ on the SASS.
138+ * MMA instructions map to a single hmma wherever possible,
139+ or backwards compatible breaks down to multiple hmma instructions
140+ for previous geneerations
141+ * WMMA is exposed in CUDA and what we use today for joint_matrix extension,
142+ whereas MMA is what cutlas and other use via hard-coding assembly.
143+ * Results from NVIDIA suggest WMMA is slower than MMA.
144+ * The performance gap from our joint matrix numbers is due to the
145+ lack of cp.async and needs to be added to DPCPP.
146+ * Need somehow to expose the mma instruction to DPCPP so that we can fix
147+ the performance gap.
148+ * Q(Ruyman) you mean supporting it within joint_matrix extension?
149+ * A(Mehdi): Yes should be possible
150+ * Q(Jianxin): This would be an implementation detail?
151+ * A(Mehdi): Yes
152+ * Q(Geoff): Why don't we load this on local memory?
153+ * A(Mehdi): Is not supported in our backend
154+ * Q(Geoff): If we preload the stuff in SLM wouldnt be get more performance?
155+ * A(Mehdi): Our backend does not supported it, this is one of the key factor
156+ on the performance problems we see.
157+ * Q(Dounia) Are there technical challenges on the implementation
158+ * A(Mehdi): Its a lot of different configurations and maintenance to the
159+ backend. Individual mapping of builtins is difficult.
160+ * A(Dounia): ATS and PVC sizes are different, thats why we have the query.
161+ Implementaiton is bigger but its transparent,
162+ the user have to type which hardware they have.
163+ * Q(Geoff): Any matrix multiplication should tile itself onto SLM but seems
164+ its not the case? why joint matrix should be 3 times slower?
165+ they have a nice feature to do it on the ISA
166+ but you can do that yourself right?
167+ * A(Mehdi): They use a different instruction to implement the loading
168+ that gives better performance,
169+ we cannot use that instruction in DPC++ backend yet
170+
171+ (Meeting adjourned, out of time)
172+
261732023-03-14
27174==========
28175
0 commit comments