|
| 1 | +# x86-simd-sort |
| 2 | + |
| 3 | +C++ header file library for SIMD based 16-bit, 32-bit and 64-bit data type |
| 4 | +sorting on x86 processors. Source header files are available in src directory. |
| 5 | +We currently only have AVX-512 based implementation of quicksort. This |
| 6 | +repository also includes a test suite which can be built and run to test the |
| 7 | +sorting algorithms for correctness. It also has benchmarking code to compare |
| 8 | +its performance relative to std::sort. |
| 9 | + |
| 10 | +## Algorithm details |
| 11 | + |
| 12 | +The ideas and code are based on these two research papers [1] and [2]. On a |
| 13 | +high level, the idea is to vectorize quicksort partitioning using AVX-512 |
| 14 | +compressstore instructions. If the array size is < 128, then use Bitonic |
| 15 | +sorting network implemented on 512-bit registers. The precise network |
| 16 | +definitions depend on the size of the dtype and are defined in separate files: |
| 17 | +`avx512-16bit-qsort.hpp`, `avx512-32bit-qsort.hpp` and |
| 18 | +`avx512-64bit-qsort.hpp`. Article [4] is a good resource for bitonic sorting |
| 19 | +network. The core implementations of the vectorized qsort functions |
| 20 | +`avx512_qsort<T>(T*, int64_t)` are modified versions of avx2 quicksort |
| 21 | +presented in the paper [2] and source code associated with that paper [3]. |
| 22 | + |
| 23 | +## Handling NAN in float and double arrays |
| 24 | + |
| 25 | +If you expect your array to contain NANs, please be aware that the these |
| 26 | +routines **do not preserve your NANs as you pass them**. The |
| 27 | +`avx512_qsort<T>()` routine will put all your NAN's at the end of the sorted |
| 28 | +array and replace them with `std::nan("1")`. Please take a look at |
| 29 | +`avx512_qsort<float>()` and `avx512_qsort<double>()` functions for details. |
| 30 | + |
| 31 | +## Example to include and build this in a C++ code |
| 32 | + |
| 33 | +### Sample code `main.cpp` |
| 34 | + |
| 35 | +```cpp |
| 36 | +#include "src/avx512-32bit-qsort.hpp" |
| 37 | + |
| 38 | +int main() { |
| 39 | + const int ARRSIZE = 10; |
| 40 | + std::vector<float> arr; |
| 41 | + |
| 42 | + /* Initialize elements is reverse order */ |
| 43 | + for (int ii = 0; ii < ARRSIZE; ++ii) { |
| 44 | + arr.push_back(ARRSIZE - ii); |
| 45 | + } |
| 46 | + |
| 47 | + /* call avx512 quicksort */ |
| 48 | + avx512_qsort<float>(arr.data(), ARRSIZE); |
| 49 | + return 0; |
| 50 | +} |
| 51 | + |
| 52 | +``` |
| 53 | + |
| 54 | +### Build using gcc |
| 55 | + |
| 56 | +``` |
| 57 | +gcc main.cpp -mavx512f -mavx512dq -O3 |
| 58 | +``` |
| 59 | + |
| 60 | +This is a header file only library and we do not provide any compile time and |
| 61 | +run time checks which is recommended while including this your source code. A |
| 62 | +slightly modified version of this source code has been contributed to |
| 63 | +[NumPy](https://github.com/numpy/numpy) (see this [pull |
| 64 | +request](https://github.com/numpy/numpy/pull/22315) for details). This NumPy |
| 65 | +pull request is a good reference for how to include and build this library with |
| 66 | +your source code. |
| 67 | + |
| 68 | +## Build requirements |
| 69 | + |
| 70 | +None, its header files only. However you will need `make` or `meson` to build |
| 71 | +the unit tests and benchmarking suite. You will need a relatively modern |
| 72 | +compiler to build. |
| 73 | + |
| 74 | +``` |
| 75 | +gcc >= 8.x |
| 76 | +``` |
| 77 | + |
| 78 | +### Build using Make |
| 79 | + |
| 80 | +`make` command builds two executables: |
| 81 | +- `testexe`: runs a bunch of tests written in ./tests directory. |
| 82 | +- `benchexe`: measures performance of these algorithms for various data types |
| 83 | + and compares them to std::sort. |
| 84 | + |
| 85 | +You can use `make test` and `make bench` to build just the `testexe` and |
| 86 | +`benchexe` respectively. |
| 87 | + |
| 88 | +### Build using Meson |
| 89 | + |
| 90 | +You can also build `testexe` and `benchexe` using Meson/Ninja with the following |
| 91 | +command: |
| 92 | + |
| 93 | +``` |
| 94 | +meson setup builddir && cd builddir && ninja |
| 95 | +``` |
| 96 | + |
| 97 | +## Requirements and dependencies |
| 98 | + |
| 99 | +The sorting routines relies only on the C++ Standard Library and requires a |
| 100 | +relatively modern compiler to build (gcc 8.x and above). Since they use the |
| 101 | +AVX-512 instruction set, they can only run on processors that have AVX-512. |
| 102 | +Specifically, the 32-bit and 64-bit require AVX-512F and AVX-512DQ instruction |
| 103 | +set. The 16-bit sorting requires the AVX-512F, AVX-512BW and AVX-512 VMBI2 |
| 104 | +instruction set. The test suite is written using the Google test framework. |
| 105 | + |
| 106 | +## References |
| 107 | + |
| 108 | +* [1] Fast and Robust Vectorized In-Place Sorting of Primitive Types |
| 109 | + https://drops.dagstuhl.de/opus/volltexte/2021/13775/ |
| 110 | + |
| 111 | +* [2] A Novel Hybrid Quicksort Algorithm Vectorized using AVX-512 on Intel |
| 112 | +Skylake https://arxiv.org/pdf/1704.08579.pdf |
| 113 | + |
| 114 | +* [3] https://github.com/simd-sorting/fast-and-robust: SPDX-License-Identifier: MIT |
| 115 | + |
| 116 | +* [4] http://mitp-content-server.mit.edu:18180/books/content/sectbyfn?collid=books_pres_0&fn=Chapter%2027.pdf&id=8030 |
| 117 | + |
0 commit comments