You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+21-7Lines changed: 21 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,17 +9,31 @@
9
9
10
10
## Features ✨
11
11
12
-
***Fast**: written in rust with pyo3 bindings
12
+
***Fast**: written in rust with PyO3 bindings
13
13
- leverages optimized [argminmax](https://github.com/jvdd/argminmax) - which is SIMD accelerated with runtime feature detection
14
14
- scales linearly with the number of data points
15
-
- scales multi-threaded with rayon (rust)
15
+
- multithreaded with Rayon (in Rust)
16
+
<details>
17
+
<summary><i>Why we do not use Python multiprocessing</i></summary>
18
+
Citing the <ahref="https://pyo3.rs/v0.17.3/parallelism.html">PyO3 docs on parallelism</a>:<br>
19
+
<blockquote>
20
+
CPython has the infamous Global Interpreter Lock, which prevents several threads from executing Python bytecode in parallel. This makes threading in Python a bad fit for CPU-bound tasks and often forces developers to accept the overhead of multiprocessing.
21
+
</blockquote>
22
+
In Rust - which is a compiled language - there is no GIL, so CPU-bound tasks can be parallelized (with <ahref="https://github.com/rayon-rs/rayon">Rayon</a>) with little to no overhead.
<summary><i>!! 🚀 <code>f16</code> <ahref="https://github.com/jvdd/argminmax">argminmax</a> is 200-300x faster than numpy</i></summary>
31
+
In contrast with all other data types above, <code>f16</code> is *not* hardware supported (i.e., no instructions for f16) by most modern CPUs!! <br>
32
+
🐌 Programming languages facilitate support for this datatype by either (i) upcasting to `f32` or (ii) using a software implementation. <br>
33
+
💡 As for argminmax, only comparisons are needed - and thus no arithmetic operations - creating a <ins>symmetrical ordinal mapping from <code>f16</code> to <code>i16</code></ins> is sufficient. This mapping allows to use the hardware supported scalar and SIMD <code>i16</code> instructions - while not producing any memory overhead 🎉 <br>
34
+
<i>More details are described in <ahref="https://github.com/jvdd/argminmax/pull/1">argminmax PR #1</a>.</i>
0 commit comments