@@ -17,7 +17,47 @@ Algorithms implemented:
1717* ` nonsimd ` : vertical sum over lanes with a reduce at the end using Rust arrays
1818* ` naive ` : sum using rust iterators
1919
20- ## Bench results on my computer
20+ ## Bench results on native
21+
22+ Command:
23+
24+ ```
25+ RUSTFLAGS="-C target-cpu=native" cargo bench -- "2\^20"
26+ ```
27+
28+ ### Sum of values
29+
30+ ```
31+ core_simd_sum 2^20 f32 [156.96 us 158.06 us 159.40 us]
32+ packed_simd_sum 2^20 f32 [184.17 us 184.47 us 184.85 us]
33+ nonsimd_sum 2^20 f32 [175.05 us 176.26 us 177.95 us]
34+ naive_sum 2^20 f32 [1.6636 ms 1.6700 ms 1.6778 ms]
35+ ```
36+
37+ ### Sum of nullable values (` Vec<bool> ` )
38+
39+ ```
40+ core_simd_sum null 2^20 f32 [2.3610 ms 2.3713 ms 2.3831 ms]
41+ packed_simd_sum null 2^20 f32 [1.5737 ms 1.5869 ms 1.6022 ms]
42+ nonsimd_sum null 2^20 f32 [1.8009 ms 1.8133 ms 1.8276 ms]
43+ naive_sum null 2^20 f32 [1.6418 ms 1.6520 ms 1.6660 ms]
44+ ```
45+
46+ ### Sum of nullable values (` Bitmap ` )
47+
48+ ```
49+ core_simd_sum bitmap 2^20 f32 [174.24 us 175.10 us 176.21 us]
50+ nonsimd_sum bitmap 2^20 f32 [541.78 us 545.16 us 549.09 us]
51+ naive_sum bitmap 2^20 f32 [1.6740 ms 1.6922 ms 1.7149 ms]
52+ ```
53+
54+ ## Bench results on default
55+
56+ Command:
57+
58+ ```
59+ cargo bench -- "2\^20"
60+ ```
2161
2262### Sum of values
2363
@@ -45,10 +85,36 @@ nonsimd_sum bitmap 2^20 f32 [454.78 us 462.08 us 471.82 us]
4585naive_sum bitmap 2^20 f32 [1.7633 ms 1.7736 ms 1.7855 ms]
4686```
4787
48- ### Conclusions so far:
88+ ### Conditions
4989
50- * for non-null sums, it is advantageous (by 10%) to use ` packed ` or ` core `
51- * for sums with nulls, it is advantageous (by 50%) to use arrays
90+ ```
91+ $ lscpu
92+ Architecture: x86_64
93+ CPU op-mode(s): 32-bit, 64-bit
94+ Byte Order: Little Endian
95+ CPU(s): 4
96+ On-line CPU(s) list: 0-3
97+ Thread(s) per core: 2
98+ Core(s) per socket: 2
99+ Socket(s): 1
100+ NUMA node(s): 1
101+ Vendor ID: GenuineIntel
102+ CPU family: 6
103+ Model: 85
104+ Model name: Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
105+ Stepping: 4
106+ CPU MHz: 2095.077
107+ BogoMIPS: 4190.15
108+ Virtualization: VT-x
109+ Hypervisor vendor: Microsoft
110+ Virtualization type: full
111+ L1d cache: 32K
112+ L1i cache: 32K
113+ L2 cache: 1024K
114+ L3 cache: 36608K
115+ NUMA node0 CPU(s): 0-3
116+ Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti tpr_shadow vnmi ept vpid fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt avx512cd avx512bw avx512vl xsaveopt xsavec xsaves md_clear
117+ ```
52118
53119## License
54120
0 commit comments