Skip to content

Commit caf8c0b

Browse files
committed
feat: optimization passes; doc; tests
1 parent 91fa35f commit caf8c0b

File tree

12 files changed

+923
-107
lines changed

12 files changed

+923
-107
lines changed

docs_sphinx/submissions/report_25_06_06.rst

Lines changed: 218 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -120,14 +120,232 @@ Optimization Passes
120120
1. IR that supports transformations
121121
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
122122

123+
We created a struct ``TensorConfig`` in ``TensorConfig.h`` to support transformations and optimization passes on our tensor operation.
124+
This configuration contains all the input data for our tensor operation. Before handing this configuration over to our tensor operation
125+
setup, we run our optimization passes over it. We also added a ``equal(const TensorConfig &config1, const TensorConfig config2)`` and
126+
``to_string()`` method for testing purposes.
127+
123128
2. Implement optimization passes
124129
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
125130

131+
**Dimension Reordering Fusing**
132+
133+
We added dimension reordering to our optimization passes to improve dimension fusion.
134+
The idea is to move any dimension X next to dimension Y if they are the same type and the ``Stride(X) = |Y| * Stride(Y)`` condition is met.
135+
136+
.. code-block:: cpp
137+
138+
void mini_jit::TensorOptimization::_dimension_reordering_fusing(TensorConfig &config)
139+
140+
**Dimension Splitting**
141+
142+
We added dimension splitting to our optimization passes. The idea is to check if any dimension is greater than or equal to 256. If so, we
143+
split the dimension into two, starting at the floor of the square root of the dimension size, and check if it is a dominator. Otherwise,
144+
we decrement the possible dominator and test until it is 2. If a dominator is found, the dimension is split.
145+
146+
.. code-block:: cpp
147+
148+
void mini_jit::TensorOptimization::_dimension_splitting(TensorConfig &config)
149+
150+
**Dimension Fusing**
151+
152+
We added dimension fusion to our optimization passes. The idea is to check if two neighboring dimensions have the same dimension type and
153+
if the product of both dimension sizes is less than or equal to 256. We also check if the condition ``Stride(X) = |Y| * Stride(Y)`` is true.
154+
If so, we fuse the two dimensions.
155+
156+
.. code-block:: cpp
157+
158+
void mini_jit::TensorOptimization::_dimension_fusing(TensorConfig &config)
159+
160+
**Dimension Reordering Shared**
161+
162+
We added dimension reordering to our optimization passes for better shared identification. We reorder sequential loops with other sequential
163+
loops and shared loops with other shared loops. We sort by strides but discourage any k-dimensional or repeating dimensions. We sum the
164+
strides and divide by eight if it is a k-dimensional stride and divide by two if it is a repeating dimension, excluding the c-dimension.
165+
166+
.. code-block:: cpp
167+
168+
void mini_jit::TensorOptimization::_dimension_reordering_shared(TensorConfig &config)
169+
{
170+
...
171+
uint64_t value = (*jStrideIn0 * *jStrideIn0) + (*jStrideIn1 * *jStrideIn1) + (*jStrideOut * *jStrideOut);
172+
173+
// value/8 if we have a k-dimension
174+
value >>= (*jDim == TensorConfig::dim_t::k) * 3;
175+
176+
// value/2 if we have the same dimension type as the last dimension, but not for c dimension
177+
value >>= (*jDim == previous_dim && *jDim != TensorConfig::dim_t::c) * 1;
178+
...
179+
}
180+
181+
182+
**Primitive Identification**
183+
184+
We added primitive identification support to our optimization pass.
185+
The following rules are applied based on the dimension type:
186+
- m-dimension: search m-dimension with a unit-stride in the first input
187+
- n-dimension: search in the second input and in the output for the smallest stride
188+
- k-dimension: only applies to GEMM or BRGEMM, search for unit--stride in the second input
189+
- second-k-dimension: only applies to BRGEMM, search for the smallest stride in first input or second input, but not select the already found k-dimension
190+
191+
Additionally, we do not modify any existing chosen primitives by the user.
192+
193+
.. code-block:: cpp
194+
195+
void mini_jit::TensorOptimization::_primitive_identification(TensorConfig &config)
196+
197+
198+
**Shared Identification**
199+
200+
We added shared identification support to our optimization pass. At most, we can convert to shared until the first primitive arises or the
201+
first k-dimensional primitive. We only tag as many dimensions as are shared, i.e., if the first dimension is perfectly divisible by the
202+
number of OpenMP threads in use, we do not convert any further dimensions as shared. Additionally, we only convert to shared if the
203+
unbalanced ratio of the dimensions is greater than 1%. :math:`(shared_dimensions_size % thread_count) / shared_dimensions_size < 1%`.
204+
205+
.. code-block::
206+
207+
void mini_jit::TensorOptimization::_shared_identification(TensorConfig &config)
208+
209+
126210
3. Lower the optimized IR code to your tensor operation backend
127211
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
128212

213+
Since our IR is the struct ``TensorConfig``, we only need to provide the configuration to our optimization, and then to our tensor operation
214+
setup. This order ensures that the optimizer creates a valid configuration for the tensor operation.
215+
216+
.. code-block:: cpp
217+
218+
mini_jit::TensorOperation::error_t mini_jit::TensorOperation::setup(const TensorConfig &config)
219+
{
220+
mini_jit::TensorOptimization optimization;
221+
TensorOperation::config = optimization.optimize(config);
222+
223+
return setup_no_optimization(TensorOperation::config.dtype, TensorOperation::config.first_touch, TensorOperation::config.main,
224+
TensorOperation::config.last_touch, TensorOperation::config.dim_types, TensorOperation::config.exec_types,
225+
TensorOperation::config.dim_sizes, TensorOperation::config.strides_in0, TensorOperation::config.strides_in1,
226+
TensorOperation::config.strides_out);
227+
}
228+
229+
Our ``TensorOptimization`` 's ``optimize`` method executes individual optimization passes on the config struct.
230+
231+
.. code-block:: cpp
232+
233+
mini_jit::TensorConfig mini_jit::TensorOptimization::optimize(TensorConfig config)
234+
{
235+
_dimension_reordering_fusing(config);
236+
237+
_dimension_splitting(config);
238+
239+
_dimension_fusing(config);
240+
241+
_primitive_identification(config);
242+
243+
_dimension_reordering_shared(config);
244+
245+
// Only call shared after reordering it only parallelize the first loops until the first seq k-loops at maximum
246+
_shared_identification(config);
247+
return config;
248+
}
249+
250+
129251
4. Benchmark the performance of your implementation
130252
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
131253

254+
File: ``TensorOptimization.bench.cpp``
255+
256+
**Matrix multiplication example**
257+
258+
.. code-block:: bash
259+
260+
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
261+
Benchmark Time CPU Iterations FLOPS
262+
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
263+
BM_optimized_tensor_GEMM/size_a:2560000/size_b:2560000/size_c:2560000/config:0/min_warmup_time:0.300_mean 1316172 ns 1303763 ns 10 411.786G/s
264+
BM_optimized_tensor_GEMM/size_a:2560000/size_b:2560000/size_c:2560000/config:0/min_warmup_time:0.300_median 1313935 ns 1303515 ns 10 411.864G/s
265+
BM_optimized_tensor_GEMM/size_a:2560000/size_b:2560000/size_c:2560000/config:0/min_warmup_time:0.300_stddev 7770 ns 1120 ns 10 353.7M/s
266+
BM_optimized_tensor_GEMM/size_a:2560000/size_b:2560000/size_c:2560000/config:0/min_warmup_time:0.300_cv 0.59 % 0.09 % 10 0.09%
267+
268+
**Tensor contraction example**
269+
270+
.. code-block:: bash
271+
272+
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
273+
Benchmark Time CPU Iterations FLOPS
274+
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
275+
BM_optimized_tensor_BRGEMM/size_a:2560000/size_b:2560000/size_c:2560000/config:1/min_warmup_time:0.300_mean 1310327 ns 1295379 ns 10 414.451G/s
276+
BM_optimized_tensor_BRGEMM/size_a:2560000/size_b:2560000/size_c:2560000/config:1/min_warmup_time:0.300_median 1307359 ns 1295362 ns 10 414.456G/s
277+
BM_optimized_tensor_BRGEMM/size_a:2560000/size_b:2560000/size_c:2560000/config:1/min_warmup_time:0.300_stddev 8579 ns 1229 ns 10 393.184M/s
278+
BM_optimized_tensor_BRGEMM/size_a:2560000/size_b:2560000/size_c:2560000/config:1/min_warmup_time:0.300_cv 0.65 % 0.09 % 10 0.09%
279+
132280
5. Demonstrate the capabilities of your optimization passes
133281
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
282+
283+
We tested our optimization passes in ``TensorOptimization.test.cpp``. One exhaustive test case is shown below. This optimization involves
284+
primitive ``reordering``, ``fusing``, ``primitive identification``, and ``shared identification``. In addition to testing the correctness of the tensor
285+
configuration after the optimization passes, we also test the correctness of the tensor operation.
286+
287+
.. code-block::cpp
288+
:emphasize-lines: 5-18, 20-33, 35-36
289+
290+
TEST_CASE("Test tensor operation with optimization dimension test reordering and fusing", "[tensor_optimization][gemm][correctness]")
291+
{
292+
using namespace mini_jit;
293+
294+
mini_jit::TensorConfig config{
295+
mini_jit::TensorConfig::prim_t::none, // first_touch
296+
mini_jit::TensorConfig::prim_t::gemm, // main
297+
mini_jit::TensorConfig::prim_t::none, // last touch
298+
{mini_jit::TensorConfig::dim_t::n, mini_jit::TensorConfig::dim_t::k, mini_jit::TensorConfig::dim_t::m, mini_jit::TensorConfig::dim_t::n,
299+
mini_jit::TensorConfig::dim_t::n, mini_jit::TensorConfig::dim_t::k}, // dim_types
300+
{mini_jit::TensorConfig::exec_t::seq, mini_jit::TensorConfig::exec_t::seq, mini_jit::TensorConfig::exec_t::seq,
301+
mini_jit::TensorConfig::exec_t::seq, mini_jit::TensorConfig::exec_t::seq, mini_jit::TensorConfig::exec_t::seq}, // exec_types
302+
{32, 8, 32, 5, 32, 32}, // dim_sizes
303+
{0, 1024, 1, 0, 0, 32}, // strides_in0
304+
{8192, 1024, 0, 8192 * 32, 32, 1}, // strides_in1
305+
{1024, 0, 1, 32768, 32, 0}, // strides_out
306+
mini_jit::TensorConfig::dtype_t::fp32, // dtype_t
307+
};
308+
309+
mini_jit::TensorConfig expected{
310+
mini_jit::TensorConfig::prim_t::none, // first_touch
311+
mini_jit::TensorConfig::prim_t::gemm, // main
312+
mini_jit::TensorConfig::prim_t::none, // last touch
313+
{mini_jit::TensorConfig::dim_t::n, mini_jit::TensorConfig::dim_t::k, mini_jit::TensorConfig::dim_t::m, mini_jit::TensorConfig::dim_t::n,
314+
mini_jit::TensorConfig::dim_t::k}, // dim_types
315+
{mini_jit::TensorConfig::exec_t::shared, mini_jit::TensorConfig::exec_t::seq, mini_jit::TensorConfig::exec_t::prim,
316+
mini_jit::TensorConfig::exec_t::prim, mini_jit::TensorConfig::exec_t::prim}, // exec_types
317+
{5 * 32, 8, 32, 32, 32}, // dim_sizes
318+
{0, 1024, 1, 0, 32}, // strides_in0
319+
{8192, 1024, 0, 32, 1}, // strides_in1
320+
{1024, 0, 1, 32, 0}, // strides_out
321+
mini_jit::TensorConfig::dtype_t::fp32, // dtype_t
322+
};
323+
324+
mini_jit::TensorOperation tensor_op;
325+
TensorOperation::error_t err = tensor_op.setup(config);
326+
327+
INFO(tensor_op.get_config().to_string());
328+
329+
REQUIRE(err == TensorOperation::error_t::success);
330+
REQUIRE_FALSE(mini_jit::TensorConfig::equals(config, tensor_op.get_config()));
331+
REQUIRE(mini_jit::TensorConfig::equals(expected, tensor_op.get_config()));
332+
333+
GenerationTest test(32, 32, 32, 32 * 1 * 32 * 8 * 1 * 1, 32 * 32 * 1 * 8 * 32 * 5, 1 * 32 * 32 * 1 * 32 * 5);
334+
test.SetUp(TestInfill::Random);
335+
336+
tensor_op.execute(test.matrix_a.data(), test.matrix_b.data(), test.matrix_c.data());
337+
338+
for (int64_t i0 = 0; i0 < expected.dim_sizes[0]; i0++)
339+
{
340+
for (int64_t i1 = 0; i1 < expected.dim_sizes[1]; i1++)
341+
{
342+
uint64_t offset_a = i0 * expected.strides_in0[0] + i1 * expected.strides_in0[1];
343+
uint64_t offset_b = i0 * expected.strides_in1[0] + i1 * expected.strides_in1[1];
344+
uint64_t offset_c = i0 * expected.strides_out[0] + i1 * expected.strides_out[1];
345+
test.naive_matmul_M_N_K_Batch(test.matrix_a.data() + offset_a, test.matrix_b.data() + offset_b,
346+
test.matrix_c_verify.data() + offset_c, 32, 32, 32, 32 * 32, 32 * 32);
347+
}
348+
}
349+
350+
test.verify_matmul(test.matrix_c_verify.data(), test.matrix_c.data(), test.matrix_c.size());
351+
}

src/main/TensorConfig.cpp

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
#include "TensorConfig.h"
22
#include <algorithm>
3+
#include <cstdint>
4+
#include <string>
35

46
bool mini_jit::TensorConfig::equals(const TensorConfig &config1, const TensorConfig config2)
57
{
@@ -14,4 +16,45 @@ bool mini_jit::TensorConfig::equals(const TensorConfig &config1, const TensorCon
1416
std::equal(config1.strides_in0.begin(), config1.strides_in0.end(), config2.strides_in0.begin()) &&
1517
std::equal(config1.strides_in1.begin(), config1.strides_in1.end(), config2.strides_in1.begin()) &&
1618
std::equal(config1.strides_out.begin(), config1.strides_out.end(), config2.strides_out.begin());
19+
}
20+
21+
std::string mini_jit::TensorConfig::to_string() const
22+
{
23+
std::string result = "TensorConfig: {\n";
24+
result += " first_touch: " + std::to_string(static_cast<uint32_t>(first_touch)) + ",\n";
25+
result += " main: " + std::to_string(static_cast<uint32_t>(main)) + ",\n";
26+
result += " last_touch: " + std::to_string(static_cast<uint32_t>(last_touch)) + ",\n";
27+
result += " dtype: " + std::to_string(static_cast<uint32_t>(dtype)) + ",\n";
28+
29+
result += " dim_types: [ ";
30+
for (const auto &dim : dim_types)
31+
result += std::to_string(static_cast<uint32_t>(dim)) + " ";
32+
result += "],\n";
33+
34+
result += " exec_types: [ ";
35+
for (const auto &exec : exec_types)
36+
result += std::to_string(static_cast<uint32_t>(exec)) + " ";
37+
result += "],\n";
38+
39+
result += " dim_sizes: [ ";
40+
for (const auto &size : dim_sizes)
41+
result += std::to_string(size) + " ";
42+
result += "],\n";
43+
44+
result += " strides_in0: [ ";
45+
for (const auto &stride : strides_in0)
46+
result += std::to_string(stride) + " ";
47+
result += "],\n";
48+
49+
result += " strides_in1: [ ";
50+
for (const auto &stride : strides_in1)
51+
result += std::to_string(stride) + " ";
52+
result += "],\n";
53+
54+
result += " strides_out: [ ";
55+
for (const auto &stride : strides_out)
56+
result += std::to_string(stride) + " ";
57+
result += "]\n}";
58+
59+
return result;
1760
}

src/main/TensorConfig.h

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
#define MINI_JIT_TENSORCONFIG_H
33

44
#include <cstdint>
5+
#include <string>
56
#include <vector>
67

78
namespace mini_jit
@@ -73,6 +74,13 @@ namespace mini_jit
7374
/// @brief The data type to be used in the tensor operation.
7475
dtype_t dtype;
7576

77+
/**
78+
* @brief Converts the config to a string.
79+
*
80+
* @return std::string The string representation
81+
*/
82+
std::string to_string() const;
83+
7684
/**
7785
* @brief Compares the two configuration and check if all values are equal.
7886
*

src/main/TensorOperation.cpp

Lines changed: 14 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
#include "TensorOperation.h"
2+
#include "TensorOptimization.h"
23
#include "release_assert.h"
34
#include <iostream>
45
#include <omp.h>
@@ -307,11 +308,16 @@ mini_jit::Unary::error_t mini_jit::TensorOperation::generateUnary(Unary &unary,
307308

308309
mini_jit::TensorOperation::error_t mini_jit::TensorOperation::setup(const TensorConfig &config)
309310
{
310-
return setup(config.dtype, config.first_touch, config.main, config.last_touch, config.dim_types, config.exec_types, config.dim_sizes,
311-
config.strides_in0, config.strides_in1, config.strides_out);
311+
mini_jit::TensorOptimization optimization;
312+
TensorOperation::config = optimization.optimize(config);
313+
314+
return setup_no_optimization(TensorOperation::config.dtype, TensorOperation::config.first_touch, TensorOperation::config.main,
315+
TensorOperation::config.last_touch, TensorOperation::config.dim_types, TensorOperation::config.exec_types,
316+
TensorOperation::config.dim_sizes, TensorOperation::config.strides_in0, TensorOperation::config.strides_in1,
317+
TensorOperation::config.strides_out);
312318
}
313319

314-
mini_jit::TensorOperation::error_t mini_jit::TensorOperation::setup(
320+
mini_jit::TensorOperation::error_t mini_jit::TensorOperation::setup_no_optimization(
315321
TensorConfig::dtype_t dtype, TensorConfig::prim_t prim_first_touch, TensorConfig::prim_t prim_main, TensorConfig::prim_t prim_last_touch,
316322
std::span<const TensorConfig::dim_t> dim_types, std::span<const TensorConfig::exec_t> exec_types, std::span<const int64_t> dim_sizes,
317323
std::span<const int64_t> strides_in0, std::span<const int64_t> strides_in1, std::span<const int64_t> strides_out)
@@ -800,4 +806,9 @@ void mini_jit::TensorOperation::execute_dimension_parallel(int64_t index_dim, ch
800806
}
801807
}
802808
}
809+
}
810+
811+
mini_jit::TensorConfig mini_jit::TensorOperation::get_config()
812+
{
813+
return config;
803814
}

src/main/TensorOperation.h

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,7 @@ namespace mini_jit
4848

4949
private:
5050
// Keep track over configuration parameters
51+
TensorConfig config;
5152
TensorConfig::dtype_t dtype;
5253
TensorConfig::prim_t prim_first = TensorConfig::prim_t::none;
5354
TensorConfig::prim_t prim_main = TensorConfig::prim_t::none;
@@ -191,10 +192,11 @@ namespace mini_jit
191192
* @param strides_out Strides of the output tensor.
192193
* @return error_t::success on success, another error_t value otherwise.
193194
**/
194-
error_t setup(TensorConfig::dtype_t dtype, TensorConfig::prim_t prim_first_touch, TensorConfig::prim_t prim_main,
195-
TensorConfig::prim_t prim_last_touch, std::span<const TensorConfig::dim_t> dim_types,
196-
std::span<const TensorConfig::exec_t> exec_types, std::span<const int64_t> dim_sizes,
197-
std::span<const int64_t> strides_in0, std::span<const int64_t> strides_in1, std::span<const int64_t> strides_out);
195+
error_t setup_no_optimization(TensorConfig::dtype_t dtype, TensorConfig::prim_t prim_first_touch, TensorConfig::prim_t prim_main,
196+
TensorConfig::prim_t prim_last_touch, std::span<const TensorConfig::dim_t> dim_types,
197+
std::span<const TensorConfig::exec_t> exec_types, std::span<const int64_t> dim_sizes,
198+
std::span<const int64_t> strides_in0, std::span<const int64_t> strides_in1,
199+
std::span<const int64_t> strides_out);
198200

199201
/**
200202
* Execute the tensor operation.
@@ -232,6 +234,13 @@ namespace mini_jit
232234
**/
233235
void execute_dimension_parallel(int64_t index_dimension, char const *ptr_in0, char const *ptr_in1, char *ptr_out, bool first_access,
234236
bool last_access);
237+
238+
/**
239+
* @brief Get the current configuration object.
240+
*
241+
* @return TensorConfig used by the Tensor operation.
242+
*/
243+
TensorConfig get_config();
235244
};
236245
}; // namespace mini_jit
237246

0 commit comments

Comments
 (0)