@@ -183,10 +183,10 @@ These 3 different ``fmla`` blocks gets repeated with ``.rept 2`` to achieve the
183183
184184 **Benchmarks **
185185
186- We run the benchmark with the following command:
186+ We run the benchmark with the following command:
187187
188- .. code-block ::
189-
188+ .. code-block ::
189+
190190 ./benchmarks --benchmark_counters_tabular=true --benchmark_repetitions=10 --benchmark_report_aggregates_only=true
191191
192192 Therefore we do 10 repetitions of the benchmark which do about ``120 000 000 `` iterations each on our matmul kernels.
@@ -197,17 +197,17 @@ Therefore we do 10 repetitions of the benchmark which do about ``120 000 000`` i
197197 ----------------------------------------------------------------------------------------------------------------------------------
198198 Benchmark Time CPU Iterations FLOPS
199199 ----------------------------------------------------------------------------------------------------------------------------------
200- Gemm16x6x1Fixture/BM_matmul_16_6_1_simple/min_warmup_time:1.000_mean 5.89 ns 5.87 ns 10 32.7048G /s
201- Gemm16x6x1Fixture/BM_matmul_16_6_1_simple/min_warmup_time:1.000_median 5.89 ns 5.87 ns 10 32.7228G /s
202- Gemm16x6x1Fixture/BM_matmul_16_6_1_simple/min_warmup_time:1.000_stddev 0.046 ns 0.044 ns 10 244.331M /s
203- Gemm16x6x1Fixture/BM_matmul_16_6_1_simple/min_warmup_time:1.000_cv 0.77 % 0.75 % 10 0.75 %
204- Gemm16x6x1Fixture/BM_matmul_16_6_1_unrolled/min_warmup_time:1.000_mean 5.74 ns 5.72 ns 10 33.5453G /s
205- Gemm16x6x1Fixture/BM_matmul_16_6_1_unrolled/min_warmup_time:1.000_median 5.73 ns 5.71 ns 10 33.6103G /s
206- Gemm16x6x1Fixture/BM_matmul_16_6_1_unrolled/min_warmup_time:1.000_stddev 0.051 ns 0.050 ns 10 291.918M /s
207- Gemm16x6x1Fixture/BM_matmul_16_6_1_unrolled/min_warmup_time:1.000_cv 0.90 % 0.88 % 10 0.87%
200+ Gemm16x6x1Fixture/BM_matmul_16_6_1_simple/min_warmup_time:1.000_mean 5.84 ns 5.82 ns 10 33.0036G /s
201+ Gemm16x6x1Fixture/BM_matmul_16_6_1_simple/min_warmup_time:1.000_median 5.83 ns 5.81 ns 10 33.0317G /s
202+ Gemm16x6x1Fixture/BM_matmul_16_6_1_simple/min_warmup_time:1.000_stddev 0.025 ns 0.025 ns 10 143.339M /s
203+ Gemm16x6x1Fixture/BM_matmul_16_6_1_simple/min_warmup_time:1.000_cv 0.43 % 0.44 % 10 0.43 %
204+ Gemm16x6x1Fixture/BM_matmul_16_6_1_unrolled/min_warmup_time:1.000_mean 5.71 ns 5.69 ns 10 33.7234G /s
205+ Gemm16x6x1Fixture/BM_matmul_16_6_1_unrolled/min_warmup_time:1.000_median 5.70 ns 5.68 ns 10 33.7732G /s
206+ Gemm16x6x1Fixture/BM_matmul_16_6_1_unrolled/min_warmup_time:1.000_stddev 0.038 ns 0.038 ns 10 224.892M /s
207+ Gemm16x6x1Fixture/BM_matmul_16_6_1_unrolled/min_warmup_time:1.000_cv 0.67 % 0.67 % 10 0.67
208208
209- We see that the simple first implementation of our matmul kernel gets about **32.7 GFLOPS **.
210- The optimized unrolled version gets about 0.8 GFLOPS more resulting in **33.5 GFLOPS **.
209+ We see that the simple first implementation of our matmul kernel gets about **33.0 GFLOPS **.
210+ The optimized unrolled version gets about 0.7 GFLOPS more resulting in **33.7 GFLOPS **.
211211
212212
213213Loops
@@ -438,7 +438,7 @@ Loops
438438
439439**Optimization **
440440
441- Usage of already optmiized `matmul_16_6_1 ` from task 2.
441+ Usage of already optimized `matmul_16_6_1 ` from task 2.
442442
443443**Benchmarks **
444444
0 commit comments