[libc++] Optimize std::min_element #100616

mrdaybird · 2024-07-25T17:55:30Z

This PR provides an alternate implementation of std::min_element which is faster in most cases and can leverage clang's auto-vectorization capabilities.

This algorithm is inspired by Argmin in SIMD. You can refer to it for the reason for choosing the algorithm.

The algorithm performs min-reduction and std::find in blocks, to get the global minimum element. Since clang can auto-vectorize min-reduction and soon may also auto-vectorize std::find(#88385), the new algorithm can partially be auto-vectorized, and in the future may get completely auto-vectorized.

You can see the benchmark results in this repo.

Gist of benchmark-

If vectorization is enabled, up to 22x speedup
If vectorization is not enabled, up to 3x speedup, though regression in case of (unsigned) long long.

I would encourage the reviewers to perform the benchmark on other systems as well.

llvmbot · 2024-07-25T17:56:02Z

@llvm/pr-subscribers-libcxx

Author: vaibhav (mrdaybird)

Changes

This PR provides an alternate implementation of std::min_element which is faster in most cases and can leverage clang's auto-vectorization capabilities.

The algorithm is inspired by Argmin in SIMD. You can refer to it for reason of choosing the algorithm.

The algorithm performs min-reduction and std::find in blocks, to get the global minimum element. Since, clang can auto-vectorize min-reduction and soon may also auto-vectorize std::find(#88385), the new algorithm can partially be auto-vectorized, and in the future may get completely auto-vectorized.

You can see the benchmark results in this repo
Gist of benchmark-

If vectorization is enabled, up-to 22x speedup
If vectorization is not enabled, up-to 3x speedup, though regression in case of (unsigned) long long.

I would encourage the reviewers to perform the benchmark on other systems as well.

Full diff: https://github.com/llvm/llvm-project/pull/100616.diff

3 Files Affected:

(modified) libcxx/benchmarks/CMakeLists.txt (+1)
(added) libcxx/benchmarks/min_element.bench.cpp (+48)
(modified) libcxx/include/__algorithm/min_element.h (+50-5)

diff --git a/libcxx/benchmarks/CMakeLists.txt b/libcxx/benchmarks/CMakeLists.txt
index d96ccc1e49f66..952777d5346f7 100644
--- a/libcxx/benchmarks/CMakeLists.txt
+++ b/libcxx/benchmarks/CMakeLists.txt
@@ -121,6 +121,7 @@ set(BENCHMARK_TESTS
     algorithms/make_heap_then_sort_heap.bench.cpp
     algorithms/min.bench.cpp
     algorithms/minmax.bench.cpp
+    algorithms/min_element.bench.cpp
     algorithms/min_max_element.bench.cpp
     algorithms/mismatch.bench.cpp
     algorithms/pop_heap.bench.cpp
diff --git a/libcxx/benchmarks/min_element.bench.cpp b/libcxx/benchmarks/min_element.bench.cpp
new file mode 100644
index 0000000000000..ed769aeeea7d2
--- /dev/null
+++ b/libcxx/benchmarks/min_element.bench.cpp
@@ -0,0 +1,48 @@
+#include <vector>
+#include <algorithm>
+#include <limits>
+
+#include <benchmark/benchmark.h>
+#include <random>
+
+template<typename T>
+static void BM_stdmin_element_decreasing(benchmark::State &state){
+    std::vector<T> v(state.range(0));
+    T start = std::numeric_limits<T>::max();
+    T end = std::numeric_limits<T>::min();
+
+    for(size_t i = 0; i < v.size(); i++)
+        v[i] = ((start != end) ? start-- : end);
+
+    for(auto _ : state){
+        benchmark::DoNotOptimize(v);
+        benchmark::DoNotOptimize(std::min_element(v.begin(), v.end()));
+    }
+}
+
+BENCHMARK(BM_stdmin_element_decreasing<char>)
+    ->DenseRange(1, 8)->Range(32, 128)->Range(256, 4096)->DenseRange(5000, 10000, 1000)
+    ->Range(1<<14, 1<<16)->Arg(70000);
+BENCHMARK(BM_stdmin_element_decreasing<short>)
+    ->DenseRange(1, 8)->Range(32, 128)->Range(256, 4096)->DenseRange(5000, 10000, 1000)
+    ->Range(1<<14, 1<<16)->Arg(70000);
+BENCHMARK(BM_stdmin_element_decreasing<int>)
+    ->DenseRange(1, 8)->Range(32, 128)->Range(256, 4096)->DenseRange(5000, 10000, 1000)
+    ->Range(1<<14, 1<<16)->Arg(70000);
+BENCHMARK(BM_stdmin_element_decreasing<long long>)
+    ->DenseRange(1, 8)->Range(32, 128)->Range(256, 4096)->DenseRange(5000, 10000, 1000)
+    ->Range(1<<14, 1<<16)->Arg(70000);
+BENCHMARK(BM_stdmin_element_decreasing<unsigned char>)
+    ->DenseRange(1, 8)->Range(32, 128)->Range(256, 4096)->DenseRange(5000, 10000, 1000)
+    ->Range(1<<14, 1<<16)->Arg(70000);
+BENCHMARK(BM_stdmin_element_decreasing<unsigned short>)
+    ->DenseRange(1, 8)->Range(32, 128)->Range(256, 4096)->DenseRange(5000, 10000, 1000)
+    ->Range(1<<14, 1<<16)->Arg(70000);
+BENCHMARK(BM_stdmin_element_decreasing<unsigned int>)
+    ->DenseRange(1, 8)->Range(32, 128)->Range(256, 4096)->DenseRange(5000, 10000, 1000)
+    ->Range(1<<14, 1<<16)->Arg(70000);
+BENCHMARK(BM_stdmin_element_decreasing<unsigned long long>)
+    ->DenseRange(1, 8)->Range(32, 128)->Range(256, 4096)->DenseRange(5000, 10000, 1000)
+    ->Range(1<<14, 1<<16)->Arg(70000);
+
+BENCHMARK_MAIN();
diff --git a/libcxx/include/__algorithm/min_element.h b/libcxx/include/__algorithm/min_element.h
index 65f3594d630ce..891a4afae3ea1 100644
--- a/libcxx/include/__algorithm/min_element.h
+++ b/libcxx/include/__algorithm/min_element.h
@@ -11,6 +11,7 @@
 
 #include <__algorithm/comp.h>
 #include <__algorithm/comp_ref_type.h>
+#include <__algorithm/iterator_operations.h>
 #include <__config>
 #include <__functional/identity.h>
 #include <__functional/invoke.h>
@@ -33,12 +34,56 @@ __min_element(_Iter __first, _Sent __last, _Comp __comp, _Proj& __proj) {
   if (__first == __last)
     return __first;
 
-  _Iter __i = __first;
-  while (++__i != __last)
-    if (std::__invoke(__comp, std::__invoke(__proj, *__i), std::__invoke(__proj, *__first)))
-      __first = __i;
+  const size_t __n = static_cast<size_t>(std::distance(__first, __last));
 
-  return __first;
+  if (__n <= 64) {
+    _Iter __i = __first;
+    while (++__i != __last)
+      if (std::__invoke(__comp, std::__invoke(__proj, *__i), std::__invoke(__proj, *__first)))
+        __first = __i;
+    return __first;
+  }
+
+  size_t __block_size = 256;
+
+  size_t __n_blocked  = __n - (__n % __block_size);
+  _Iter __block_start = __first, __block_end = __first;
+
+  typedef typename std::iterator_traits<_Iter>::value_type value_type;
+  value_type __min_val = std::invoke(__proj, *__first);
+
+  _Iter __curr = __first;
+  for (size_t __i = 0; __i < __n_blocked; __i += __block_size) {
+    _Iter __start          = __curr;
+    value_type __block_min = __min_val;
+    for (size_t j = 0; j < __block_size; j++) {
+      if (std::__invoke(__comp, std::__invoke(__proj, *__curr), __block_min)) {
+        __block_min = *__curr;
+      }
+      __curr++;
+    }
+    if (std::invoke(__comp, __block_min, __min_val)) {
+      __min_val     = __block_min;
+      __block_start = __start;
+      __block_end   = __curr;
+    }
+  }
+
+  value_type __epilogue_min = __min_val;
+  _Iter __epilogue_start    = __curr;
+  while (__curr != __last) {
+    if (std::__invoke(__comp, std::__invoke(__proj, *__curr), __epilogue_min)) {
+      __epilogue_min = *__curr;
+    }
+    __curr++;
+  }
+  if (std::__invoke(__comp, __epilogue_min, __min_val)) {
+    __min_val     = __epilogue_min;
+    __block_start = __epilogue_start;
+    __block_end   = __last;
+  }
+
+  return find(__block_start, __block_end, __min_val);
 }
 
 template <class _Comp, class _Iter, class _Sent>

mrdaybird · 2024-07-25T17:57:56Z

Pinging @philnik777, if you are interested in reviewing this, because #84663 was the inspiration for this.

github-actions · 2024-07-25T18:04:39Z

✅ With the latest revision this PR passed the C/C++ code formatter.

[libc++] Fix formatting and move min_element.bench.cpp [libc++] Use __invoke instead of invoke [libc++] Fix build issues with find

[libcxx] revert block_size logic [libc++] Replace std::find with impl

ldionne

I have some comments but this seems quite reasonable and the benefits are huge.

ldionne · 2025-03-27T15:19:27Z

libcxx/include/__algorithm/min_element.h

 template <class _Comp, class _Iter, class _Sent, class _Proj>
-inline _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX14 _Iter
-__min_element(_Iter __first, _Sent __last, _Comp& __comp, _Proj& __proj) {
+inline _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX14 _Iter __min_element(


Let's use __enable_if_t instead of using true_type and false_type overloads.

ldionne · 2025-03-27T15:21:51Z

libcxx/include/__algorithm/min_element.h

 _LIBCPP_BEGIN_NAMESPACE_STD

+template <class _Iter, class _Sent, class = void>
+struct __ConstTimeDistance : false_type {};


Let's rename this to __has_constant_time_distance and move it to libcxx/include/__iterator/iterator_traits.h. It can be useful elsewhere.

ldionne · 2025-03-27T15:22:36Z

libcxx/include/__algorithm/min_element.h

+inline _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX14 _Iter __min_element(
+    _Iter __first,
+    _Sent __last,
+    _Comp __comp,


Suggested change

_Comp __comp,

_Comp& __comp,

You can take a reference here.

ldionne · 2025-03-27T15:22:46Z

libcxx/include/__algorithm/min_element.h

+inline _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX14 _Iter __min_element(
+    _Iter __first,
+    _Sent __last,
+    _Comp __comp,


Suggested change

_Comp __comp,

_Comp& __comp,

Here too.

ldionne · 2025-03-27T15:23:15Z

libcxx/include/__algorithm/min_element.h

+  return std::__min_element<_Comp>(
+      std::move(__first), std::move(__last), __comp, __proj, __ConstTimeDistance<_Iter, _Sent>());


And here just pass __comp directly as a reference.

Suggested change

return std::__min_element<_Comp>(

std::move(__first), std::move(__last), __comp, __proj, __ConstTimeDistance<_Iter, _Sent>());

return std::__min_element(

std::move(__first), std::move(__last), __comp, __proj, __ConstTimeDistance<_Iter, _Sent>());

ldionne · 2025-03-27T15:24:51Z

libcxx/include/__algorithm/min_element.h

Can we do the same for max_element? Can we implement max_element in terms of min_element by inverting the predicate?

@mrdaybird Hi, I would like to give a try for std::max_element using similar approach if thats ok!

ldionne · 2025-03-27T15:26:44Z

@mrdaybird Please let us know if you're still interested in finishing this up despite the very long turnaround time. If not, I or someone else will pick it up.

philnik777

Overall, I'd really like to compare this to an implementation which goes over N elements in parallel, saving the offset to a vector register and reducing only at the end. I think that would avoid a bunch of conditionals and we wouldn't have to search the range twice. Actually, is the current implementation standards-conforming? We're invoking projections and comparators a lot more than previously I think.

philnik777 · 2025-03-27T15:59:49Z

libcxx/include/__algorithm/min_element.h

+  for (diff_type __i = 0; __i < __n_blocked; __i += __block_size) {
+    _Iter __start          = __curr;
+    value_type __block_min = __min_val;
+    for (diff_type __j = 0; __j < __block_size; __j++) {
+      if (std::__invoke(__comp, std::__invoke(__proj, *__curr), __block_min)) {
+        __block_min = *__curr;
+      }
+      __curr++;
+    }
+    if (std::__invoke(__comp, __block_min, __min_val)) {
+      __min_val     = __block_min;
+      __block_start = __start;
+      __block_end   = __curr;
+    }
+  }


This looks like something that results in incredible amounts of code bloat when the comparator isn't known or can't be inlined. e.g. how does the performance and code size compare with string?

Hi @mrdaybird, here’s my thought: we check whether the value is integral. If it is, we can simply use the built-in vector operations to get the min in this block. Does this work?

philnik777 · 2025-03-27T16:06:07Z

libcxx/include/__algorithm/min_element.h

+  diff_type __n_blocked = __n - (__n % __block_size);
+  _Iter __block_start = __first, __block_end = __first;


Suggested change

diff_type __n_blocked = __n - (__n % __block_size);

_Iter __block_start = __first, __block_end = __first;

diff_type __n_blocked = __n - (__n % __block_size);

_Iter __block_start = __first;

_Iter __block_end = __first;

mrdaybird · 2025-03-28T09:38:27Z

@ldionne @philnik777 definitely! I am still interested. great to see that it's getting some traction now. I will look into this over the weekend! Thanks!

mrdaybird requested a review from a team as a code owner July 25, 2024 17:55

llvmbot added the libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi. label Jul 25, 2024

mordante assigned philnik777 Jul 25, 2024

mrdaybird force-pushed the optimize_min_element branch 2 times, most recently from a997ee5 to 01c71b4 Compare July 25, 2024 19:09

mrdaybird marked this pull request as draft July 25, 2024 23:06

mrdaybird force-pushed the optimize_min_element branch from 74bad7e to 6aae7b1 Compare July 26, 2024 01:35

mrdaybird added 2 commits October 15, 2024 11:26

[libc++] Optimize std::min_element

5978fae

[libc++] Fix formatting and move min_element.bench.cpp [libc++] Use __invoke instead of invoke [libc++] Fix build issues with find

[libc++] Fix build issues and replace std::find with loop

bbfccb3

[libcxx] revert block_size logic [libc++] Replace std::find with impl

hiraditya force-pushed the optimize_min_element branch from f8e6131 to bbfccb3 Compare October 15, 2024 18:31

ldionne added the performance label Oct 31, 2024

hiraditya marked this pull request as ready for review October 31, 2024 16:59

Merge branch 'main' into optimize_min_element

0191d3a

ldionne requested changes Mar 27, 2025

View reviewed changes

philnik777 reviewed Mar 27, 2025

View reviewed changes

wsehjk mentioned this pull request Mar 30, 2025

Vectorize minmax_element. #112397

Open

		return std::__min_element<_Comp>(
		std::move(__first), std::move(__last), __comp, __proj, __ConstTimeDistance<_Iter, _Sent>());

		diff_type __n_blocked = __n - (__n % __block_size);
		_Iter __block_start = __first, __block_end = __first;

[libc++] Optimize std::min_element #100616

Are you sure you want to change the base?

[libc++] Optimize std::min_element #100616

Uh oh!

Conversation

mrdaybird commented Jul 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Jul 25, 2024

Uh oh!

mrdaybird commented Jul 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ldionne left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ldionne commented Mar 27, 2025

Uh oh!

philnik777 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrdaybird commented Mar 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

mrdaybird commented Jul 25, 2024 •

edited

Loading

mrdaybird commented Jul 25, 2024 •

edited

Loading

github-actions bot commented Jul 25, 2024 •

edited

Loading