Skip to content

Commit 4ff5b36

Browse files
author
Raghuveer Devulapalli
committed
Add AVX-512 based quicksort source, tests and benchmarks
Signed-off-by: Raghuveer Devulapalli <[email protected]>
1 parent d3f2b2a commit 4ff5b36

16 files changed

+2899
-0
lines changed

LICENSE.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
BSD 3-Clause License
2+
3+
Copyright (c) 2022, Intel. All rights reserved.
4+
5+
Redistribution and use in source and binary forms, with or without
6+
modification, are permitted provided that the following conditions are met:
7+
8+
1. Redistributions of source code must retain the above copyright notice, this
9+
list of conditions and the following disclaimer.
10+
11+
2. Redistributions in binary form must reproduce the above copyright notice,
12+
this list of conditions and the following disclaimer in the documentation
13+
and/or other materials provided with the distribution.
14+
15+
3. Neither the name of the copyright holder nor the names of its
16+
contributors may be used to endorse or promote products derived from
17+
this software without specific prior written permission.
18+
19+
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
20+
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
21+
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
22+
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
23+
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
24+
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
25+
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
26+
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
27+
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
28+
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Makefile

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
CXX ?= g++
2+
SRCDIR = ./src
3+
TESTDIR = ./tests
4+
BENCHDIR = ./benchmarks
5+
UTILS = ./utils
6+
SRCS = $(wildcard $(SRCDIR)/*.hpp)
7+
TESTS = $(wildcard $(TESTDIR)/*.cpp)
8+
TESTOBJS = $(patsubst $(TESTDIR)/%.cpp,$(TESTDIR)/%.o,$(TESTS))
9+
TESTOBJS := $(filter-out $(TESTDIR)/main.o ,$(TESTOBJS))
10+
GTEST_LIB = gtest
11+
GTEST_INCLUDE = /usr/local/include
12+
CXXFLAGS += -I$(SRCDIR) -I$(GTEST_INCLUDE) -I$(UTILS)
13+
LD_FLAGS = -L /usr/local/lib -l $(GTEST_LIB) -l pthread
14+
15+
all : test bench
16+
17+
$(TESTDIR)/%.o : $(TESTDIR)/%.cpp $(SRCS)
18+
$(CXX) -march=icelake-client -O3 $(CXXFLAGS) -c $< -o $@
19+
20+
test: $(TESTDIR)/main.cpp $(TESTOBJS) $(SRCS)
21+
$(CXX) tests/main.cpp $(TESTOBJS) $(CXXFLAGS) $(LD_FLAGS) -o testexe
22+
23+
bench: $(BENCHDIR)/main.cpp $(SRCS)
24+
$(CXX) $(BENCHDIR)/main.cpp $(CXXFLAGS) -march=icelake-client -O3 -o benchexe
25+
26+
clean:
27+
rm -f $(TESTDIR)/*.o testexe benchexe

README.md

Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
# x86-simd-sort
2+
3+
C++ header file library for SIMD based 16-bit, 32-bit and 64-bit data type
4+
sorting on x86 processors. Source header files are available in src directory.
5+
We currently only have AVX-512 based implementation of quicksort. This
6+
repository also includes a test suite which can be built and run to test the
7+
sorting algorithms for correctness. It also has benchmarking code to compare
8+
its performance relative to std::sort.
9+
10+
## Algorithm details
11+
12+
The ideas and code are based on these two research papers [1] and [2]. On a
13+
high level, the idea is to vectorize quicksort partitioning using AVX-512
14+
compressstore instructions. If the array size is < 128, then use Bitonic
15+
sorting network implemented on 512-bit registers. The precise network
16+
definitions depend on the size of the dtype and are defined in separate files:
17+
`avx512-16bit-qsort.hpp`, `avx512-32bit-qsort.hpp` and
18+
`avx512-64bit-qsort.hpp`. Article [4] is a good resource for bitonic sorting
19+
network. The core implementations of the vectorized qsort functions
20+
`avx512_qsort<T>(T*, int64_t)` are modified versions of avx2 quicksort
21+
presented in the paper [2] and source code associated with that paper [3].
22+
23+
## Handling NAN in float and double arrays
24+
25+
If you expect your array to contain NANs, please be aware that the these
26+
routines **do not preserve your NANs as you pass them**. The
27+
`avx512_qsort<T>()` routine will put all your NAN's at the end of the sorted
28+
array and replace them with `std::nan("1")`. Please take a look at
29+
`avx512_qsort<float>()` and `avx512_qsort<double>()` functions for details.
30+
31+
## Example to include and build this in a C++ code
32+
33+
### Sample code `main.cpp`
34+
35+
```cpp
36+
#include "src/avx512-32bit-qsort.hpp"
37+
38+
int main() {
39+
const int ARRSIZE = 10;
40+
std::vector<float> arr;
41+
42+
/* Initialize elements is reverse order */
43+
for (int ii = 0; ii < ARRSIZE; ++ii) {
44+
arr.push_back(ARRSIZE - ii);
45+
}
46+
47+
/* call avx512 quicksort */
48+
avx512_qsort<float>(arr.data(), ARRSIZE);
49+
return 0;
50+
}
51+
52+
```
53+
54+
### Build using gcc
55+
56+
```
57+
gcc main.cpp -mavx512f -mavx512dq -O3
58+
```
59+
60+
This is a header file only library and we do not provide any compile time and
61+
run time checks which is recommended while including this your source code. A
62+
slightly modified version of this source code has been contributed to
63+
[NumPy](https://github.com/numpy/numpy) (see this [pull
64+
request](https://github.com/numpy/numpy/pull/22315) for details). This NumPy
65+
pull request is a good reference for how to include and build this library with
66+
your source code.
67+
68+
## Build requirements
69+
70+
None, its header files only. However you will need `make` or `meson` to build
71+
the unit tests and benchmarking suite. You will need a relatively modern
72+
compiler to build.
73+
74+
```
75+
gcc >= 8.x
76+
```
77+
78+
### Build using Make
79+
80+
`make` command builds two executables:
81+
- `testexe`: runs a bunch of tests written in ./tests directory.
82+
- `benchexe`: measures performance of these algorithms for various data types
83+
and compares them to std::sort.
84+
85+
You can use `make test` and `make bench` to build just the `testexe` and
86+
`benchexe` respectively.
87+
88+
### Build using Meson
89+
90+
You can also build `testexe` and `benchexe` using Meson/Ninja with the following
91+
command:
92+
93+
```
94+
meson setup builddir && cd builddir && ninja
95+
```
96+
97+
## Requirements and dependencies
98+
99+
The sorting routines relies only on the C++ Standard Library and requires a
100+
relatively modern compiler to build (gcc 8.x and above). Since they use the
101+
AVX-512 instruction set, they can only run on processors that have AVX-512.
102+
Specifically, the 32-bit and 64-bit require AVX-512F and AVX-512DQ instruction
103+
set. The 16-bit sorting requires the AVX-512F, AVX-512BW and AVX-512 VMBI2
104+
instruction set. The test suite is written using the Google test framework.
105+
106+
## References
107+
108+
* [1] Fast and Robust Vectorized In-Place Sorting of Primitive Types
109+
https://drops.dagstuhl.de/opus/volltexte/2021/13775/
110+
111+
* [2] A Novel Hybrid Quicksort Algorithm Vectorized using AVX-512 on Intel
112+
Skylake https://arxiv.org/pdf/1704.08579.pdf
113+
114+
* [3] https://github.com/simd-sorting/fast-and-robust: SPDX-License-Identifier: MIT
115+
116+
* [4] http://mitp-content-server.mit.edu:18180/books/content/sectbyfn?collid=books_pres_0&fn=Chapter%2027.pdf&id=8030
117+

_clang-format

Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
---
2+
Language: Cpp
3+
AccessModifierOffset: -4
4+
AlignAfterOpenBracket: Align
5+
AlignConsecutiveAssignments: false
6+
AlignConsecutiveDeclarations: false
7+
AlignEscapedNewlines: DontAlign
8+
AlignOperands: false
9+
AlignTrailingComments: false
10+
AllowAllParametersOfDeclarationOnNextLine: true
11+
AllowShortBlocksOnASingleLine: true
12+
AllowShortCaseLabelsOnASingleLine: true
13+
AllowShortFunctionsOnASingleLine: Empty
14+
AllowShortIfStatementsOnASingleLine: true
15+
AllowShortLoopsOnASingleLine: false
16+
AlwaysBreakAfterDefinitionReturnType: None
17+
AlwaysBreakAfterReturnType: None
18+
AlwaysBreakBeforeMultilineStrings: true
19+
AlwaysBreakTemplateDeclarations: Yes
20+
BinPackArguments: false
21+
BinPackParameters: false
22+
BraceWrapping:
23+
AfterClass: false
24+
AfterControlStatement: false
25+
AfterEnum: false
26+
AfterFunction: true
27+
AfterNamespace: false
28+
AfterObjCDeclaration: false
29+
AfterStruct: false
30+
AfterUnion: false
31+
AfterExternBlock: false
32+
BeforeCatch: false
33+
BeforeElse: true
34+
IndentBraces: false
35+
SplitEmptyFunction: true
36+
SplitEmptyRecord: true
37+
SplitEmptyNamespace: true
38+
BreakBeforeBinaryOperators: All
39+
BreakBeforeBraces: Custom
40+
BreakBeforeInheritanceComma: false
41+
BreakInheritanceList: BeforeColon
42+
BreakBeforeTernaryOperators: true
43+
BreakConstructorInitializers: BeforeComma
44+
BreakAfterJavaFieldAnnotations: false
45+
BreakStringLiterals: true
46+
ColumnLimit: 80
47+
CommentPragmas: '^ IWYU pragma:'
48+
CompactNamespaces: false
49+
ConstructorInitializerAllOnOneLineOrOnePerLine: true
50+
ConstructorInitializerIndentWidth: 4
51+
ContinuationIndentWidth: 8
52+
Cpp11BracedListStyle: true
53+
DerivePointerAlignment: false
54+
FixNamespaceComments: true
55+
ForEachMacros:
56+
IncludeBlocks: Preserve
57+
IndentCaseLabels: true
58+
# IndentPPDirectives: AfterHash
59+
IndentPPDirectives: None
60+
IndentWidth: 4
61+
IndentWrappedFunctionNames: false
62+
KeepEmptyLinesAtTheStartOfBlocks: true
63+
MacroBlockBegin: ''
64+
MacroBlockEnd: ''
65+
MaxEmptyLinesToKeep: 1
66+
NamespaceIndentation: None
67+
PenaltyBreakAssignment: 2
68+
PenaltyBreakBeforeFirstCallParameter: 19
69+
PenaltyBreakComment: 300
70+
PenaltyBreakFirstLessLess: 120
71+
PenaltyBreakString: 1000
72+
PenaltyBreakTemplateDeclaration: 10
73+
PenaltyExcessCharacter: 1000000
74+
PenaltyReturnTypeOnItsOwnLine: 60
75+
PointerAlignment: Right
76+
ReflowComments: false
77+
SortIncludes: true
78+
SortUsingDeclarations: true
79+
SpaceAfterCStyleCast: false
80+
SpaceAfterTemplateKeyword: true
81+
SpaceBeforeAssignmentOperators: true
82+
SpaceBeforeCpp11BracedList: true
83+
SpaceBeforeCtorInitializerColon: true
84+
SpaceBeforeInheritanceColon: true
85+
SpaceBeforeParens: ControlStatements
86+
SpaceBeforeRangeBasedForLoopColon: true
87+
SpaceInEmptyParentheses: false
88+
SpacesBeforeTrailingComments: 1
89+
SpacesInAngles: false
90+
SpacesInContainerLiterals: false
91+
SpacesInCStyleCastParentheses: false
92+
SpacesInParentheses: false
93+
SpacesInSquareBrackets: false
94+
Standard: Cpp11
95+
TabWidth: 4
96+
UseTab: Never
97+
...
98+
# vim:ft=conf et ts=2 sw=2

benchmarks/bench.hpp

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
/*******************************************
2+
* * Copyright (C) 2022 Intel Corporation
3+
* * SPDX-License-Identifier: BSD-3-Clause
4+
* *******************************************/
5+
6+
#include "avx512-16bit-qsort.hpp"
7+
#include "avx512-32bit-qsort.hpp"
8+
#include "avx512-64bit-qsort.hpp"
9+
#include <iostream>
10+
#include <numeric>
11+
#include <tuple>
12+
#include <vector>
13+
14+
static inline uint64_t cycles_start(void)
15+
{
16+
unsigned a, d;
17+
__asm__ __volatile__(
18+
"cpuid\n\t"
19+
"rdtsc\n\t"
20+
: "=a"(a), // comma separated output operands
21+
"=d"(d)
22+
: // comma separated input operands
23+
: "rbx", "rcx" // list of clobbered registers
24+
);
25+
return (((uint64_t)d << 32) | a);
26+
}
27+
28+
static inline uint64_t cycles_end(void)
29+
{
30+
unsigned high, low;
31+
__asm__ __volatile__(
32+
"rdtscp\n\t"
33+
"movl %%eax, %[low]\n\t"
34+
"movl %%edx, %[high]\n\t"
35+
"cpuid\n\t"
36+
: [high] "=r"(high), [low] "=r"(low)
37+
:
38+
: "rax", "rbx", "rcx", "rdx");
39+
return (((uint64_t)high << 32) | low);
40+
}
41+
42+
template <typename T>
43+
std::tuple<uint64_t, uint64_t> bench_sort(const std::vector<T> arr,
44+
const uint64_t iters,
45+
const uint64_t lastfew)
46+
{
47+
std::vector<T> arr_bckup = arr;
48+
std::vector<uint64_t> runtimes1, runtimes2;
49+
uint64_t start(0), end(0);
50+
for (uint64_t ii = 0; ii < iters; ++ii) {
51+
start = cycles_start();
52+
avx512_qsort<T>(arr_bckup.data(), arr_bckup.size());
53+
end = cycles_end();
54+
runtimes1.emplace_back(end - start);
55+
arr_bckup = arr;
56+
}
57+
uint64_t avx_sort = std::accumulate(runtimes1.end() - lastfew,
58+
runtimes1.end(),
59+
(uint64_t)0)
60+
/ lastfew;
61+
62+
for (uint64_t ii = 0; ii < iters; ++ii) {
63+
start = cycles_start();
64+
std::sort(arr_bckup.begin(), arr_bckup.end());
65+
end = cycles_end();
66+
runtimes2.emplace_back(end - start);
67+
arr_bckup = arr;
68+
}
69+
uint64_t std_sort = std::accumulate(runtimes2.end() - lastfew,
70+
runtimes2.end(),
71+
(uint64_t)0)
72+
/ lastfew;
73+
return std::make_tuple(avx_sort, std_sort);
74+
}

0 commit comments

Comments
 (0)