numpy
diff --git a/‎README.md‎
Lines changed: 82 additions & 14 deletions b/‎README.md‎
Lines changed: 82 additions & 14 deletions
diff --git a/‎benchmarks/object_qsort-perf.jpg‎
152 KB b/‎benchmarks/object_qsort-perf.jpg‎
152 KB
@@ -1,32 +1,57 @@
 # x86-simd-sort
 
 C++ template library for high performance SIMD based sorting routines for
-16-bit, 32-bit and 64-bit data types. The sorting routines are accelerated
-using AVX-512/AVX2 when available. The library auto picks the best version
-depending on the processor it is run on. If you are looking for the AVX-512 or
-AVX2 specific implementations, please see
-[README](https://github.com/intel/x86-simd-sort/blob/main/src/README.md) file under
-`src/` directory. The following routines are currently supported:
+built-in integers and floats (16-bit, 32-bit and 64-bit data types) and custom
+defined C++ objects. The sorting routines are accelerated using AVX-512/AVX2
+when available. The library auto picks the best version depending on the
+processor it is run on. If you are looking for the AVX-512 or AVX2 specific
+implementations, please see
+[README](https://github.com/intel/x86-simd-sort/blob/main/src/README.md) file
+under `src/` directory. The following routines are currently supported:
+
+## Sort an array of custom defined class objects (uses `O(N)` space)
+``` cpp
+template <typename T, typename Func>
+void x86simdsort::object_qsort(T *arr, uint32_t arrsize, Func key_func)
+```
+`T` is any user defined struct or class and `arr` is a pointer to the first
+element in the array of objects of type `T`. `Func` is a lambda function that
+computes the `key` value for each object which is the metric used to sort the
+objects. `Func` needs to have the following signature:
 
+```cpp
+[] (T obj) -> type_t { type_t key; /* compute key for obj */ return key; }
+```
 
-### Sort routines on arrays
+Note that the return type of the key `type_t` needs to be one of the following
+: `[float, uint32_t, int32_t, double, uint64_t, int64_t]`. `object_qsort` has a
+space complexity of `O(N)`. Specifically, it requires `arrsize*(sizeof(type_t)`
+\+ `sizeof(uint32_t))` additional space. It allocates two `std::vectors`: one
+for storing all the keys and another storing the indexes of the object array.
+    For performance reasons, we support `object_qsort` only when the array size
+    is less than or equal to `UINT32_MAX`. An example usage of `object_qsort`
+    is provided in the [examples](#Sort-an-array-of-Points-using-object_qsort)
+    section.  Refer to [section](#Performance-of-object_qsort) to get a sense
+    of how fast this is relative to `std::sort`.
+
+## Sort an array of built-in integers and floats
 ```cpp
-x86simdsort::qsort(T* arr, size_t size, bool hasnan);
-x86simdsort::qselect(T* arr, size_t k, size_t size, bool hasnan);
-x86simdsort::partial_qsort(T* arr, size_t k, size_t size, bool hasnan);
+void x86simdsort::qsort(T* arr, size_t size, bool hasnan);
+void x86simdsort::qselect(T* arr, size_t k, size_t size, bool hasnan);
+void x86simdsort::partial_qsort(T* arr, size_t k, size_t size, bool hasnan);
 ```
 Supported datatypes: `T` $\in$ `[_Float16, uint16_t, int16_t, float, uint32_t,
 int32_t, double, uint64_t, int64_t]`
 
-### Key-value sort routines on pairs of arrays
+## Key-value sort routines on pairs of arrays
 ```cpp
-x86simdsort::keyvalue_qsort(T1* key, T2* val, size_t size, bool hasnan);
+void x86simdsort::keyvalue_qsort(T1* key, T2* val, size_t size, bool hasnan);
 ```
 Supported datatypes: `T1`, `T2` $\in$ `[float, uint32_t, int32_t, double,
 uint64_t, int64_t]` Note that keyvalue sort is not yet supported for 16-bit
 data types.
 
-### Arg sort routines on arrays
+## Arg sort routines on arrays
 ```cpp
 std::vector<size_t> arg = x86simdsort::argsort(T* arr, size_t size, bool hasnan);
 std::vector<size_t> arg = x86simdsort::argselect(T* arr, size_t k, size_t size, bool hasnan);
@@ -55,16 +80,38 @@ can configure meson to build them both by using `-Dbuild_tests=true` and
 
 ## Example usage
 
+#### Sort an array of floats
+
 ```cpp
 #include "x86simdsort.h"
 
 int main() {
     std::vector<float> arr{1000};
-    x86simdsort::qsort(arr, 1000, true);
+    x86simdsort::qsort(arr.data(), 1000, true);
     return 0;
 }
 ```
 
+#### Sort an array of Points using object_qsort
+```cpp
+#include "x86simdsort.h"
+#include <cmath>
+
+struct Point {
+    double x, y, z;
+};
+
+int main() {
+    std::vector<Point> arr{1000};
+    // Sort an array of Points by its x value:
+    x86simdsort::object_qsort(arr.data(), 1000, [](Point p) { return p.x; });
+    // Sort an array of Points by its distance from origin:
+    x86simdsort::object_qsort(arr.data(), 1000, [](Point p) {
+        return sqrt(p.x*p.x+p.y*p.y+p.z*p.z);
+        });
+    return 0;
+}
+```
 
 ## Details
 
@@ -95,6 +142,27 @@ argselect) will not use the SIMD based algorithms if they detect NAN's in the
 array. You can read details of all the implementations
 [here](https://github.com/intel/x86-simd-sort/blob/main/src/README.md).
 
+## Performance comparison on AVX-512: `object_qsort` v/s `std::sort`
+`object_qsort` relies on key-value sort which is currently accelerated only on
+AVX-512 (we plan to add AVX2 version soon). Benchmarks added in
+[bench-objsort.hpp](./benchmarks/bench-objsort.hpp) measures performance of
+`object_qsort` relative to `std::sort` when sorting an array of `struct Point
+{double x, y, z;}` and `struct Point {float x, y, x;}` for various metrics:
+
++ sort by coordinate `x`
++ sort by manhanttan distance (relative to origin): `abs(x) + abx(y) + abs(z)`
++ sort by Euclidean distance (relative to origin): `sqrt(x*x + y*y + z*z)`
++ sort by Chebyshev distance (relative to origin): `max(x, y, z)`
+
+The data was collected on a processor with AVX-512 and is shown in the plot
+below. For the simplest of cases where we want to sort an array of struct by
+one of its members, `object_qsort` can be up-to 5x faster for 32-bit data type
+and about 4x for 64-bit data type. It tends to do better when the metric to
+sort by gets more complicated. Sorting by Euclidean distance can be up-to 10x
+faster.
+
+![alt text](./benchmarks/object_qsort-perf.jpg?raw=true)
+
 ## Downstream projects using x86-simd-sort
 
 - NumPy uses this as a [submodule](https://github.com/numpy/numpy/pull/22315) to accelerate `np.sort, np.argsort, np.partition and np.argpartition`.