@@ -183,10 +183,10 @@ These 3 different ``fmla`` blocks gets repeated with ``.rept 2`` to achieve the
183183
184184 **Benchmarks **
185185
186- We run the benchmark with the following command:
186+ We run the benchmark with the following command:
187187
188- .. code-block ::
189-
188+ .. code-block ::
189+
190190 ./benchmarks --benchmark_counters_tabular=true --benchmark_repetitions=10 --benchmark_report_aggregates_only=true
191191
192192 Therefore we do 10 repetitions of the benchmark which do about ``120 000 000 `` iterations each on our matmul kernels.
@@ -197,17 +197,17 @@ Therefore we do 10 repetitions of the benchmark which do about ``120 000 000`` i
197197 ----------------------------------------------------------------------------------------------------------------------------------
198198 Benchmark Time CPU Iterations FLOPS
199199 ----------------------------------------------------------------------------------------------------------------------------------
200- Gemm16x6x1Fixture/BM_matmul_16_6_1_simple/min_warmup_time:1.000_mean 5.89 ns 5.87 ns 10 32.7048G /s
201- Gemm16x6x1Fixture/BM_matmul_16_6_1_simple/min_warmup_time:1.000_median 5.89 ns 5.87 ns 10 32.7228G /s
202- Gemm16x6x1Fixture/BM_matmul_16_6_1_simple/min_warmup_time:1.000_stddev 0.046 ns 0.044 ns 10 244.331M /s
203- Gemm16x6x1Fixture/BM_matmul_16_6_1_simple/min_warmup_time:1.000_cv 0.77 % 0.75 % 10 0.75 %
204- Gemm16x6x1Fixture/BM_matmul_16_6_1_unrolled/min_warmup_time:1.000_mean 5.74 ns 5.72 ns 10 33.5453G /s
205- Gemm16x6x1Fixture/BM_matmul_16_6_1_unrolled/min_warmup_time:1.000_median 5.73 ns 5.71 ns 10 33.6103G /s
206- Gemm16x6x1Fixture/BM_matmul_16_6_1_unrolled/min_warmup_time:1.000_stddev 0.051 ns 0.050 ns 10 291.918M /s
207- Gemm16x6x1Fixture/BM_matmul_16_6_1_unrolled/min_warmup_time:1.000_cv 0.90 % 0.88 % 10 0.87%
200+ Gemm16x6x1Fixture/BM_matmul_16_6_1_simple/min_warmup_time:1.000_mean 5.84 ns 5.82 ns 10 33.0036G /s
201+ Gemm16x6x1Fixture/BM_matmul_16_6_1_simple/min_warmup_time:1.000_median 5.83 ns 5.81 ns 10 33.0317G /s
202+ Gemm16x6x1Fixture/BM_matmul_16_6_1_simple/min_warmup_time:1.000_stddev 0.025 ns 0.025 ns 10 143.339M /s
203+ Gemm16x6x1Fixture/BM_matmul_16_6_1_simple/min_warmup_time:1.000_cv 0.43 % 0.44 % 10 0.43 %
204+ Gemm16x6x1Fixture/BM_matmul_16_6_1_unrolled/min_warmup_time:1.000_mean 5.71 ns 5.69 ns 10 33.7234G /s
205+ Gemm16x6x1Fixture/BM_matmul_16_6_1_unrolled/min_warmup_time:1.000_median 5.70 ns 5.68 ns 10 33.7732G /s
206+ Gemm16x6x1Fixture/BM_matmul_16_6_1_unrolled/min_warmup_time:1.000_stddev 0.038 ns 0.038 ns 10 224.892M /s
207+ Gemm16x6x1Fixture/BM_matmul_16_6_1_unrolled/min_warmup_time:1.000_cv 0.67 % 0.67 % 10 0.67
208208
209- We see that the simple first implementation of our matmul kernel gets about **32.7 GFLOPS **.
210- The optimized unrolled version gets about 0.8 GFLOPS more resulting in **33.5 GFLOPS **.
209+ We see that the simple first implementation of our matmul kernel gets about **33.0 GFLOPS **.
210+ The optimized unrolled version gets about 0.7 GFLOPS more resulting in **33.7 GFLOPS **.
211211
212212
213213Loops
@@ -395,7 +395,7 @@ Loops
395395
396396**Optimization **
397397
398- Usage of already optmiized `matmul_16_6_1 ` from task 2.
398+ Usage of already optimized `matmul_16_6_1 ` from task 2.
399399
400400**Benchmarks **
401401
@@ -412,20 +412,20 @@ We run the benchmark with the following command:
412412 ----------------------------------------------------------------------------------------------------------------------------------
413413 Benchmark Time CPU Iterations FLOPS
414414 ----------------------------------------------------------------------------------------------------------------------------------
415- GemmMxNxKFixture<16, 6, 64>/BM_matmul_16_6_64/min_warmup_time:1.000_mean 396 ns 396 ns 10 31.0266G /s
416- GemmMxNxKFixture<16, 6, 64>/BM_matmul_16_6_64/min_warmup_time:1.000_median 396 ns 396 ns 10 31.0281G /s
417- GemmMxNxKFixture<16, 6, 64>/BM_matmul_16_6_64/min_warmup_time:1.000_stddev 0.069 ns 0.057 ns 10 4.50274M /s
418- GemmMxNxKFixture<16, 6, 64>/BM_matmul_16_6_64/min_warmup_time:1.000_cv 0.02 % 0.01 % 10 0.01 %
419- GemmMxNxKFixture<64, 6, 64>/BM_matmul_64_6_64/min_warmup_time:1.000_mean 1728 ns 1728 ns 10 28.4438G /s
420- GemmMxNxKFixture<64, 6, 64>/BM_matmul_64_6_64/min_warmup_time:1.000_median 1728 ns 1728 ns 10 28.4445G /s
421- GemmMxNxKFixture<64, 6, 64>/BM_matmul_64_6_64/min_warmup_time:1.000_stddev 0.115 ns 0.106 ns 10 1.7484M /s
422- GemmMxNxKFixture<64, 6, 64>/BM_matmul_64_6_64/min_warmup_time:1.000_cv 0.01 % 0.01 % 10 0.01 %
423- GemmMxNxKFixture<64, 48, 64>/BM_matmul_64_48_64/min_warmup_time:1.000_mean 13078 ns 13077 ns 10 22.5524G /s
424- GemmMxNxKFixture<64, 48, 64>/BM_matmul_64_48_64/min_warmup_time:1.000_median 13078 ns 13077 ns 10 22.552G /s
425- GemmMxNxKFixture<64, 48, 64>/BM_matmul_64_48_64/min_warmup_time:1.000_stddev 1.83 ns 1.60 ns 10 2.76464M /s
426- GemmMxNxKFixture<64, 48, 64>/BM_matmul_64_48_64/min_warmup_time:1.000_cv 0.01 % 0.01 % 10 0.01 %
427-
428-
429- - Mean FLOPS for loop over K: **31.0 GFLOPS **.
430- - Mean FLOPS for loop over M: **28.4 GFLOPS **.
431- - Mean FLOPS for loop over N: **22.6 GFLOPS **.
415+ GemmMxNxKFixture<16, 6, 64>/BM_matmul_16_6_64/min_warmup_time:1.000_mean 368 ns 367 ns 10 33.4632G /s
416+ GemmMxNxKFixture<16, 6, 64>/BM_matmul_16_6_64/min_warmup_time:1.000_median 368 ns 367 ns 10 33.5034G /s
417+ GemmMxNxKFixture<16, 6, 64>/BM_matmul_16_6_64/min_warmup_time:1.000_stddev 1.78 ns 1.75 ns 10 158.697M /s
418+ GemmMxNxKFixture<16, 6, 64>/BM_matmul_16_6_64/min_warmup_time:1.000_cv 0.48 % 0.48 % 10 0.47 %
419+ GemmMxNxKFixture<64, 6, 64>/BM_matmul_64_6_64/min_warmup_time:1.000_mean 1526 ns 1520 ns 10 32.3285G /s
420+ GemmMxNxKFixture<64, 6, 64>/BM_matmul_64_6_64/min_warmup_time:1.000_median 1526 ns 1520 ns 10 32.3321G /s
421+ GemmMxNxKFixture<64, 6, 64>/BM_matmul_64_6_64/min_warmup_time:1.000_stddev 10.2 ns 9.97 ns 10 211.542M /s
422+ GemmMxNxKFixture<64, 6, 64>/BM_matmul_64_6_64/min_warmup_time:1.000_cv 0.67 % 0.66 % 10 0.65 %
423+ GemmMxNxKFixture<64, 48, 64>/BM_matmul_64_48_64/min_warmup_time:1.000_mean 12177 ns 12135 ns 10 24.3028G /s
424+ GemmMxNxKFixture<64, 48, 64>/BM_matmul_64_48_64/min_warmup_time:1.000_median 12167 ns 12126 ns 10 24.3211G /s
425+ GemmMxNxKFixture<64, 48, 64>/BM_matmul_64_48_64/min_warmup_time:1.000_stddev 54.9 ns 54.1 ns 10 107.995M /s
426+ GemmMxNxKFixture<64, 48, 64>/BM_matmul_64_48_64/min_warmup_time:1.000_cv 0.45 % 0.45 % 10 0.44 %
427+
428+
429+ - Mean FLOPS for loop over K: **33.5 GFLOPS **.
430+ - Mean FLOPS for loop over M: **32.3 GFLOPS **.
431+ - Mean FLOPS for loop over N: **24.3 GFLOPS **.
0 commit comments