You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To support all M-N-K combinations we take a kernel as base and dynamically generate the rest handling of not multiple of M, N and K.
138
-
As a base we took the ``matmul_16m_4n_k`` kernel, which reached around ``130 GFLOPS`` as 64_48_64 kernel (i.e. the same as kernel from the previous
139
-
section with the batch dimension of one).
137
+
To support all combinations of M, N and K, we use one kernel as a base and dynamically generate the rest of the handling for numbers that are not multiples of M, N or K.
138
+
As a base we took the ``matmul_16m_4n_k`` kernel, which reached around ``130 GFLOPS`` as 64_48_64 kernel (i.e. the same as the kernel from the
139
+
previous section, with a batch dimension of one.).
140
140
The k dimension is always a multiple of 1 therefore we don't need a special case for this dimension.
141
-
To get full coverage on the remaining dimension, we implemented the variations:
141
+
To get full coverage on the remaining dimension, we implemented the following variations:
142
142
143
143
- `matmul_16m_lt4nRest_k`:
144
144
- M dimension must be multiple of 16
@@ -173,14 +173,79 @@ Together with the `matmul_16m_4n_k`, we have 6 kernels to cover the complete dim
173
173
2. Verify all matrices for ``1≤M≤64``, ``1≤N≤64``, ``K∈[1,16,32,64,128]``,``lda=M``, ``ldb=K``, and ``ldc=M``
Running the Benchmark in approximately 8 hours total. We produced the following results: :download:`GEMM_benchmarks.csv <../_static/resources/report_25_05_15/GEMM_benchmarks.csv>`
186
+
The benchmark took approximately eight hours in total to run. The following results were produced: :download:`GEMM_benchmarks.csv <../_static/resources/report_25_05_15/GEMM_benchmarks.csv>`
187
+
188
+
189
+
Batch-Reduce GEMM
190
+
-----------------
191
+
192
+
1. Extend generate to support batch dimension 1≤batch_size≤1024
m, n, k, br_size, trans_a, trans_b, trans_c, static_cast<int32_t>(dtype)));
230
+
}
231
+
...
232
+
233
+
This ``else if`` branch distributes to our extended ``br_matmul_*`` kernels with a larger batch dimension.
234
+
235
+
- `br_matmul_16m_lt4nRest_k`
236
+
- `br_matmul_16mRest_4n_k`
237
+
- `br_matmul_16mRest_lt4nRest_k`
238
+
- `br_matmul_lt16_4n_k`
239
+
- `br_matmul_lt16_lt4nRest_k`
240
+
241
+
2. Verify against reference implementation
242
+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
243
+
244
+
All kernels were tested. The tests are located in the file ``src/test/kernels/br_matmul_*.test.cpp``.
245
+
246
+
The batched MatMul generation was tested for 1≤M≤64, 1≤N≤64, K∈[1,16,32,64,128], 1≤BatchSize≤16, lda=M, ldb=K, and ldc=M. The test is located in the file ``src/test/Brgemm.test.cpp``.
247
+
248
+
3. Benchmark for 1≤M≤64, 1≤N≤64, K∈[1,16,32,64,128],lda=M, ldb=K,ldc=M , batch_size=16
0 commit comments