-
-
Notifications
You must be signed in to change notification settings - Fork 69
Description
Hello, I just want to make sure I'm not doing something wrong. I stumbled upon this project when I found that IPP's radix sort had buffer size calculations that were overflowing well before the limits of an INT32_MAX and all of the "Index" variants also only used 32 bit types. To be clear, a 32 bit index variant still would be beneficial and I see there are plans to eventually implement that. It's just I need a conditional code path for when there are too many values (I have a dataset generated that has length in the billions). I noticed that when the code makes use of the AVX512 instructions for argsort, it severely under performs against the radix sort (which as far as I can tell from the assembly, is completely scalar except for an iota like scan to generate the sequential index).
The benchmarks distributed with this project do show it winning by some margins against the scalar variants in here but not by the typical order of magnitudes that were touted when this project was first introduced. Is this maybe a result of the gather data sampling mitigations in microcode? I know this heavily uses compressed store and gather, and those functions were severely limited by this.
Here's my benchmark on my hardware:
[astylinski@fedoravm builddir]$ ./benchexe --benchmark_filter=/int32_t
2023-12-03T11:30:05-05:00
Running ./benchexe
Run on (6 X 3312 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x6)
L1 Instruction 32 KiB (x6)
L2 Unified 4096 KiB (x6)
L3 Unified 16384 KiB (x1)
Load Average: 1.02, 0.68, 0.32
--------------------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------------------
simdargsort/random_128/int32_t 959 ns 956 ns 731536
simdargsort/random_256/int32_t 2045 ns 2040 ns 343034
simdargsort/random_512/int32_t 3866 ns 3857 ns 181511
simdargsort/random_1k/int32_t 9289 ns 9267 ns 75493
simdargsort/random_5k/int32_t 50900 ns 50768 ns 13789
simdargsort/random_100k/int32_t 1764351 ns 1757047 ns 400
simdargsort/random_1m/int32_t 29256670 ns 29111150 ns 24
simdargsort/random_10m/int32_t 959237861 ns 954095660 ns 1
simdargsort/smallrange_128/int32_t 722 ns 720 ns 975799
simdargsort/smallrange_256/int32_t 2049 ns 2044 ns 342384
simdargsort/smallrange_512/int32_t 3655 ns 3646 ns 191985
simdargsort/smallrange_1k/int32_t 9169 ns 9147 ns 76497
simdargsort/smallrange_5k/int32_t 21375 ns 21320 ns 32747
simdargsort/smallrange_100k/int32_t 467703 ns 465853 ns 1502
simdargsort/smallrange_1m/int32_t 9097255 ns 9047326 ns 75
simdargsort/smallrange_10m/int32_t 183596592 ns 182645622 ns 4
simdargsort/sorted_10k/int32_t 104574 ns 104314 ns 6668
simdargsort/constant_10k/int32_t 8648 ns 8628 ns 81134
simdargsort/reverse_10k/int32_t 105977 ns 105722 ns 6558
scalarargsort/random_128/int32_t 797 ns 795 ns 882026
scalarargsort/random_256/int32_t 1736 ns 1732 ns 403901
scalarargsort/random_512/int32_t 4000 ns 3990 ns 176210
scalarargsort/random_1k/int32_t 21879 ns 21822 ns 31937
scalarargsort/random_5k/int32_t 262249 ns 261537 ns 2676
scalarargsort/random_100k/int32_t 7551632 ns 7525002 ns 93
scalarargsort/random_1m/int32_t 103352847 ns 102895746 ns 7
scalarargsort/random_10m/int32_t 1974523438 ns 1964054017 ns 1
scalarargsort/smallrange_128/int32_t 667 ns 665 ns 1051997
scalarargsort/smallrange_256/int32_t 1498 ns 1494 ns 469065
scalarargsort/smallrange_512/int32_t 3152 ns 3144 ns 223444
scalarargsort/smallrange_1k/int32_t 13730 ns 13697 ns 50943
scalarargsort/smallrange_5k/int32_t 127606 ns 127285 ns 5499
scalarargsort/smallrange_100k/int32_t 3031465 ns 3019854 ns 232
scalarargsort/smallrange_1m/int32_t 38177852 ns 37995219 ns 18
scalarargsort/smallrange_10m/int32_t 695700305 ns 692244521 ns 1
scalarargsort/sorted_10k/int32_t 92423 ns 92202 ns 7578
scalarargsort/constant_10k/int32_t 83139 ns 82931 ns 8439
scalarargsort/reverse_10k/int32_t 74214 ns 74031 ns 9359