@@ -120,14 +120,232 @@ Optimization Passes
1201201. IR that supports transformations
121121^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
122122
123+ We created a struct ``TensorConfig `` in ``TensorConfig.h `` to support transformations and optimization passes on our tensor operation.
124+ This configuration contains all the input data for our tensor operation. Before handing this configuration over to our tensor operation
125+ setup, we run our optimization passes over it. We also added a ``equal(const TensorConfig &config1, const TensorConfig config2) `` and
126+ ``to_string() `` method for testing purposes.
127+
1231282. Implement optimization passes
124129^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
125130
131+ **Dimension Reordering Fusing **
132+
133+ We added dimension reordering to our optimization passes to improve dimension fusion.
134+ The idea is to move any dimension X next to dimension Y if they are the same type and the ``Stride(X) = |Y| * Stride(Y) `` condition is met.
135+
136+ .. code-block :: cpp
137+
138+ void mini_jit::TensorOptimization::_dimension_reordering_fusing(TensorConfig &config)
139+
140+ **Dimension Splitting **
141+
142+ We added dimension splitting to our optimization passes. The idea is to check if any dimension is greater than or equal to 256. If so, we
143+ split the dimension into two, starting at the floor of the square root of the dimension size, and check if it is a dominator. Otherwise,
144+ we decrement the possible dominator and test until it is 2. If a dominator is found, the dimension is split.
145+
146+ .. code-block :: cpp
147+
148+ void mini_jit::TensorOptimization::_dimension_splitting(TensorConfig &config)
149+
150+ **Dimension Fusing **
151+
152+ We added dimension fusion to our optimization passes. The idea is to check if two neighboring dimensions have the same dimension type and
153+ if the product of both dimension sizes is less than or equal to 256. We also check if the condition ``Stride(X) = |Y| * Stride(Y) `` is true.
154+ If so, we fuse the two dimensions.
155+
156+ .. code-block :: cpp
157+
158+ void mini_jit::TensorOptimization::_dimension_fusing(TensorConfig &config)
159+
160+ **Dimension Reordering Shared **
161+
162+ We added dimension reordering to our optimization passes for better shared identification. We reorder sequential loops with other sequential
163+ loops and shared loops with other shared loops. We sort by strides but discourage any k-dimensional or repeating dimensions. We sum the
164+ strides and divide by eight if it is a k-dimensional stride and divide by two if it is a repeating dimension, excluding the c-dimension.
165+
166+ .. code-block :: cpp
167+
168+ void mini_jit::TensorOptimization::_dimension_reordering_shared(TensorConfig &config)
169+ {
170+ ...
171+ uint64_t value = (*jStrideIn0 * *jStrideIn0) + (*jStrideIn1 * *jStrideIn1) + (*jStrideOut * *jStrideOut);
172+
173+ // value/8 if we have a k-dimension
174+ value >>= (*jDim == TensorConfig::dim_t::k) * 3;
175+
176+ // value/2 if we have the same dimension type as the last dimension, but not for c dimension
177+ value >>= (*jDim == previous_dim && *jDim != TensorConfig::dim_t::c) * 1;
178+ ...
179+ }
180+
181+
182+ **Primitive Identification **
183+
184+ We added primitive identification support to our optimization pass.
185+ The following rules are applied based on the dimension type:
186+ - m-dimension: search m-dimension with a unit-stride in the first input
187+ - n-dimension: search in the second input and in the output for the smallest stride
188+ - k-dimension: only applies to GEMM or BRGEMM, search for unit--stride in the second input
189+ - second-k-dimension: only applies to BRGEMM, search for the smallest stride in first input or second input, but not select the already found k-dimension
190+
191+ Additionally, we do not modify any existing chosen primitives by the user.
192+
193+ .. code-block :: cpp
194+
195+ void mini_jit::TensorOptimization::_primitive_identification(TensorConfig &config)
196+
197+
198+ **Shared Identification **
199+
200+ We added shared identification support to our optimization pass. At most, we can convert to shared until the first primitive arises or the
201+ first k-dimensional primitive. We only tag as many dimensions as are shared, i.e., if the first dimension is perfectly divisible by the
202+ number of OpenMP threads in use, we do not convert any further dimensions as shared. Additionally, we only convert to shared if the
203+ unbalanced ratio of the dimensions is greater than 1%. :math: `(shared_dimensions_size % thread_count) / shared_dimensions_size < 1 %`.
204+
205+ .. code-block ::
206+
207+ void mini_jit::TensorOptimization::_shared_identification(TensorConfig &config)
208+
209+
126210 3. Lower the optimized IR code to your tensor operation backend
127211^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
128212
213+ Since our IR is the struct ``TensorConfig ``, we only need to provide the configuration to our optimization, and then to our tensor operation
214+ setup. This order ensures that the optimizer creates a valid configuration for the tensor operation.
215+
216+ .. code-block :: cpp
217+
218+ mini_jit::TensorOperation::error_t mini_jit::TensorOperation::setup(const TensorConfig &config)
219+ {
220+ mini_jit::TensorOptimization optimization;
221+ TensorOperation::config = optimization.optimize(config);
222+
223+ return setup_no_optimization(TensorOperation::config.dtype, TensorOperation::config.first_touch, TensorOperation::config.main,
224+ TensorOperation::config.last_touch, TensorOperation::config.dim_types, TensorOperation::config.exec_types,
225+ TensorOperation::config.dim_sizes, TensorOperation::config.strides_in0, TensorOperation::config.strides_in1,
226+ TensorOperation::config.strides_out);
227+ }
228+
229+ Our ``TensorOptimization `` 's ``optimize `` method executes individual optimization passes on the config struct.
230+
231+ .. code-block :: cpp
232+
233+ mini_jit::TensorConfig mini_jit::TensorOptimization::optimize(TensorConfig config)
234+ {
235+ _dimension_reordering_fusing(config);
236+
237+ _dimension_splitting(config);
238+
239+ _dimension_fusing(config);
240+
241+ _primitive_identification(config);
242+
243+ _dimension_reordering_shared(config);
244+
245+ // Only call shared after reordering it only parallelize the first loops until the first seq k-loops at maximum
246+ _shared_identification(config);
247+ return config;
248+ }
249+
250+
129251 4. Benchmark the performance of your implementation
130252^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
131253
254+ File: ``TensorOptimization.bench.cpp ``
255+
256+ **Matrix multiplication example **
257+
258+ .. code-block :: bash
259+
260+ -------------------------------------------------------------------------------------------------------------------------------------------------------------------
261+ Benchmark Time CPU Iterations FLOPS
262+ -------------------------------------------------------------------------------------------------------------------------------------------------------------------
263+ BM_optimized_tensor_GEMM/size_a:2560000/size_b:2560000/size_c:2560000/config:0/min_warmup_time:0.300_mean 1316172 ns 1303763 ns 10 411.786G/s
264+ BM_optimized_tensor_GEMM/size_a:2560000/size_b:2560000/size_c:2560000/config:0/min_warmup_time:0.300_median 1313935 ns 1303515 ns 10 411.864G/s
265+ BM_optimized_tensor_GEMM/size_a:2560000/size_b:2560000/size_c:2560000/config:0/min_warmup_time:0.300_stddev 7770 ns 1120 ns 10 353.7M/s
266+ BM_optimized_tensor_GEMM/size_a:2560000/size_b:2560000/size_c:2560000/config:0/min_warmup_time:0.300_cv 0.59 % 0.09 % 10 0.09%
267+
268+ **Tensor contraction example **
269+
270+ .. code-block :: bash
271+
272+ -------------------------------------------------------------------------------------------------------------------------------------------------------------------
273+ Benchmark Time CPU Iterations FLOPS
274+ -------------------------------------------------------------------------------------------------------------------------------------------------------------------
275+ BM_optimized_tensor_BRGEMM/size_a:2560000/size_b:2560000/size_c:2560000/config:1/min_warmup_time:0.300_mean 1310327 ns 1295379 ns 10 414.451G/s
276+ BM_optimized_tensor_BRGEMM/size_a:2560000/size_b:2560000/size_c:2560000/config:1/min_warmup_time:0.300_median 1307359 ns 1295362 ns 10 414.456G/s
277+ BM_optimized_tensor_BRGEMM/size_a:2560000/size_b:2560000/size_c:2560000/config:1/min_warmup_time:0.300_stddev 8579 ns 1229 ns 10 393.184M/s
278+ BM_optimized_tensor_BRGEMM/size_a:2560000/size_b:2560000/size_c:2560000/config:1/min_warmup_time:0.300_cv 0.65 % 0.09 % 10 0.09%
279+
132280 5. Demonstrate the capabilities of your optimization passes
133281^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
282+
283+ We tested our optimization passes in ``TensorOptimization.test.cpp ``. One exhaustive test case is shown below. This optimization involves
284+ primitive ``reordering ``, ``fusing ``, ``primitive identification ``, and ``shared identification ``. In addition to testing the correctness of the tensor
285+ configuration after the optimization passes, we also test the correctness of the tensor operation.
286+
287+ .. code-block::cpp
288+ :emphasize-lines: 5-18, 20-33, 35-36
289+
290+ TEST_CASE("Test tensor operation with optimization dimension test reordering and fusing", "[tensor_optimization][gemm][correctness]")
291+ {
292+ using namespace mini_jit;
293+
294+ mini_jit::TensorConfig config{
295+ mini_jit::TensorConfig::prim_t::none, // first_touch
296+ mini_jit::TensorConfig::prim_t::gemm, // main
297+ mini_jit::TensorConfig::prim_t::none, // last touch
298+ {mini_jit::TensorConfig::dim_t::n, mini_jit::TensorConfig::dim_t::k, mini_jit::TensorConfig::dim_t::m, mini_jit::TensorConfig::dim_t::n,
299+ mini_jit::TensorConfig::dim_t::n, mini_jit::TensorConfig::dim_t::k}, // dim_types
300+ {mini_jit::TensorConfig::exec_t::seq, mini_jit::TensorConfig::exec_t::seq, mini_jit::TensorConfig::exec_t::seq,
301+ mini_jit::TensorConfig::exec_t::seq, mini_jit::TensorConfig::exec_t::seq, mini_jit::TensorConfig::exec_t::seq}, // exec_types
302+ {32, 8, 32, 5, 32, 32}, // dim_sizes
303+ {0, 1024, 1, 0, 0, 32}, // strides_in0
304+ {8192, 1024, 0, 8192 * 32, 32, 1}, // strides_in1
305+ {1024, 0, 1, 32768, 32, 0}, // strides_out
306+ mini_jit::TensorConfig::dtype_t::fp32, // dtype_t
307+ };
308+
309+ mini_jit::TensorConfig expected{
310+ mini_jit::TensorConfig::prim_t::none, // first_touch
311+ mini_jit::TensorConfig::prim_t::gemm, // main
312+ mini_jit::TensorConfig::prim_t::none, // last touch
313+ {mini_jit::TensorConfig::dim_t::n, mini_jit::TensorConfig::dim_t::k, mini_jit::TensorConfig::dim_t::m, mini_jit::TensorConfig::dim_t::n,
314+ mini_jit::TensorConfig::dim_t::k}, // dim_types
315+ {mini_jit::TensorConfig::exec_t::shared, mini_jit::TensorConfig::exec_t::seq, mini_jit::TensorConfig::exec_t::prim,
316+ mini_jit::TensorConfig::exec_t::prim, mini_jit::TensorConfig::exec_t::prim}, // exec_types
317+ {5 * 32, 8, 32, 32, 32}, // dim_sizes
318+ {0, 1024, 1, 0, 32}, // strides_in0
319+ {8192, 1024, 0, 32, 1}, // strides_in1
320+ {1024, 0, 1, 32, 0}, // strides_out
321+ mini_jit::TensorConfig::dtype_t::fp32, // dtype_t
322+ };
323+
324+ mini_jit::TensorOperation tensor_op;
325+ TensorOperation::error_t err = tensor_op.setup(config);
326+
327+ INFO(tensor_op.get_config().to_string());
328+
329+ REQUIRE(err == TensorOperation::error_t::success);
330+ REQUIRE_FALSE(mini_jit::TensorConfig::equals(config, tensor_op.get_config()));
331+ REQUIRE(mini_jit::TensorConfig::equals(expected, tensor_op.get_config()));
332+
333+ GenerationTest test(32, 32, 32, 32 * 1 * 32 * 8 * 1 * 1, 32 * 32 * 1 * 8 * 32 * 5, 1 * 32 * 32 * 1 * 32 * 5);
334+ test.SetUp(TestInfill::Random);
335+
336+ tensor_op.execute(test.matrix_a.data(), test.matrix_b.data(), test.matrix_c.data());
337+
338+ for (int64_t i0 = 0; i0 < expected.dim_sizes[0]; i0++)
339+ {
340+ for (int64_t i1 = 0; i1 < expected.dim_sizes[1]; i1++)
341+ {
342+ uint64_t offset_a = i0 * expected.strides_in0[0] + i1 * expected.strides_in0[1];
343+ uint64_t offset_b = i0 * expected.strides_in1[0] + i1 * expected.strides_in1[1];
344+ uint64_t offset_c = i0 * expected.strides_out[0] + i1 * expected.strides_out[1];
345+ test.naive_matmul_M_N_K_Batch(test.matrix_a.data() + offset_a, test.matrix_b.data() + offset_b,
346+ test.matrix_c_verify.data() + offset_c, 32, 32, 32, 32 * 32, 32 * 32);
347+ }
348+ }
349+
350+ test.verify_matmul(test.matrix_c_verify.data(), test.matrix_c.data(), test.matrix_c.size());
351+ }
0 commit comments