Skip to content

Conversation

@mrdaybird
Copy link
Contributor

@mrdaybird mrdaybird commented Jul 25, 2024

This PR provides an alternate implementation of std::min_element which is faster in most cases and can leverage clang's auto-vectorization capabilities.

This algorithm is inspired by Argmin in SIMD. You can refer to it for the reason for choosing the algorithm.

The algorithm performs min-reduction and std::find in blocks, to get the global minimum element. Since clang can auto-vectorize min-reduction and soon may also auto-vectorize std::find(#88385), the new algorithm can partially be auto-vectorized, and in the future may get completely auto-vectorized.

You can see the benchmark results in this repo.

Gist of benchmark-

  • If vectorization is enabled, up to 22x speedup
  • If vectorization is not enabled, up to 3x speedup, though regression in case of (unsigned) long long.

I would encourage the reviewers to perform the benchmark on other systems as well.

@mrdaybird mrdaybird requested a review from a team as a code owner July 25, 2024 17:55
@llvmbot llvmbot added the libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi. label Jul 25, 2024
@llvmbot
Copy link
Member

llvmbot commented Jul 25, 2024

@llvm/pr-subscribers-libcxx

Author: vaibhav (mrdaybird)

Changes

This PR provides an alternate implementation of std::min_element which is faster in most cases and can leverage clang's auto-vectorization capabilities.

The algorithm is inspired by Argmin in SIMD. You can refer to it for reason of choosing the algorithm.

The algorithm performs min-reduction and std::find in blocks, to get the global minimum element. Since, clang can auto-vectorize min-reduction and soon may also auto-vectorize std::find(#88385), the new algorithm can partially be auto-vectorized, and in the future may get completely auto-vectorized.

You can see the benchmark results in this repo
Gist of benchmark-

  • If vectorization is enabled, up-to 22x speedup
  • If vectorization is not enabled, up-to 3x speedup, though regression in case of (unsigned) long long.

I would encourage the reviewers to perform the benchmark on other systems as well.


Full diff: https://github.com/llvm/llvm-project/pull/100616.diff

3 Files Affected:

  • (modified) libcxx/benchmarks/CMakeLists.txt (+1)
  • (added) libcxx/benchmarks/min_element.bench.cpp (+48)
  • (modified) libcxx/include/__algorithm/min_element.h (+50-5)
diff --git a/libcxx/benchmarks/CMakeLists.txt b/libcxx/benchmarks/CMakeLists.txt
index d96ccc1e49f66..952777d5346f7 100644
--- a/libcxx/benchmarks/CMakeLists.txt
+++ b/libcxx/benchmarks/CMakeLists.txt
@@ -121,6 +121,7 @@ set(BENCHMARK_TESTS
     algorithms/make_heap_then_sort_heap.bench.cpp
     algorithms/min.bench.cpp
     algorithms/minmax.bench.cpp
+    algorithms/min_element.bench.cpp
     algorithms/min_max_element.bench.cpp
     algorithms/mismatch.bench.cpp
     algorithms/pop_heap.bench.cpp
diff --git a/libcxx/benchmarks/min_element.bench.cpp b/libcxx/benchmarks/min_element.bench.cpp
new file mode 100644
index 0000000000000..ed769aeeea7d2
--- /dev/null
+++ b/libcxx/benchmarks/min_element.bench.cpp
@@ -0,0 +1,48 @@
+#include <vector>
+#include <algorithm>
+#include <limits>
+
+#include <benchmark/benchmark.h>
+#include <random>
+
+template<typename T>
+static void BM_stdmin_element_decreasing(benchmark::State &state){
+    std::vector<T> v(state.range(0));
+    T start = std::numeric_limits<T>::max();
+    T end = std::numeric_limits<T>::min();
+
+    for(size_t i = 0; i < v.size(); i++)
+        v[i] = ((start != end) ? start-- : end);
+
+    for(auto _ : state){
+        benchmark::DoNotOptimize(v);
+        benchmark::DoNotOptimize(std::min_element(v.begin(), v.end()));
+    }
+}
+
+BENCHMARK(BM_stdmin_element_decreasing<char>)
+    ->DenseRange(1, 8)->Range(32, 128)->Range(256, 4096)->DenseRange(5000, 10000, 1000)
+    ->Range(1<<14, 1<<16)->Arg(70000);
+BENCHMARK(BM_stdmin_element_decreasing<short>)
+    ->DenseRange(1, 8)->Range(32, 128)->Range(256, 4096)->DenseRange(5000, 10000, 1000)
+    ->Range(1<<14, 1<<16)->Arg(70000);
+BENCHMARK(BM_stdmin_element_decreasing<int>)
+    ->DenseRange(1, 8)->Range(32, 128)->Range(256, 4096)->DenseRange(5000, 10000, 1000)
+    ->Range(1<<14, 1<<16)->Arg(70000);
+BENCHMARK(BM_stdmin_element_decreasing<long long>)
+    ->DenseRange(1, 8)->Range(32, 128)->Range(256, 4096)->DenseRange(5000, 10000, 1000)
+    ->Range(1<<14, 1<<16)->Arg(70000);
+BENCHMARK(BM_stdmin_element_decreasing<unsigned char>)
+    ->DenseRange(1, 8)->Range(32, 128)->Range(256, 4096)->DenseRange(5000, 10000, 1000)
+    ->Range(1<<14, 1<<16)->Arg(70000);
+BENCHMARK(BM_stdmin_element_decreasing<unsigned short>)
+    ->DenseRange(1, 8)->Range(32, 128)->Range(256, 4096)->DenseRange(5000, 10000, 1000)
+    ->Range(1<<14, 1<<16)->Arg(70000);
+BENCHMARK(BM_stdmin_element_decreasing<unsigned int>)
+    ->DenseRange(1, 8)->Range(32, 128)->Range(256, 4096)->DenseRange(5000, 10000, 1000)
+    ->Range(1<<14, 1<<16)->Arg(70000);
+BENCHMARK(BM_stdmin_element_decreasing<unsigned long long>)
+    ->DenseRange(1, 8)->Range(32, 128)->Range(256, 4096)->DenseRange(5000, 10000, 1000)
+    ->Range(1<<14, 1<<16)->Arg(70000);
+
+BENCHMARK_MAIN();
diff --git a/libcxx/include/__algorithm/min_element.h b/libcxx/include/__algorithm/min_element.h
index 65f3594d630ce..891a4afae3ea1 100644
--- a/libcxx/include/__algorithm/min_element.h
+++ b/libcxx/include/__algorithm/min_element.h
@@ -11,6 +11,7 @@
 
 #include <__algorithm/comp.h>
 #include <__algorithm/comp_ref_type.h>
+#include <__algorithm/iterator_operations.h>
 #include <__config>
 #include <__functional/identity.h>
 #include <__functional/invoke.h>
@@ -33,12 +34,56 @@ __min_element(_Iter __first, _Sent __last, _Comp __comp, _Proj& __proj) {
   if (__first == __last)
     return __first;
 
-  _Iter __i = __first;
-  while (++__i != __last)
-    if (std::__invoke(__comp, std::__invoke(__proj, *__i), std::__invoke(__proj, *__first)))
-      __first = __i;
+  const size_t __n = static_cast<size_t>(std::distance(__first, __last));
 
-  return __first;
+  if (__n <= 64) {
+    _Iter __i = __first;
+    while (++__i != __last)
+      if (std::__invoke(__comp, std::__invoke(__proj, *__i), std::__invoke(__proj, *__first)))
+        __first = __i;
+    return __first;
+  }
+
+  size_t __block_size = 256;
+
+  size_t __n_blocked  = __n - (__n % __block_size);
+  _Iter __block_start = __first, __block_end = __first;
+
+  typedef typename std::iterator_traits<_Iter>::value_type value_type;
+  value_type __min_val = std::invoke(__proj, *__first);
+
+  _Iter __curr = __first;
+  for (size_t __i = 0; __i < __n_blocked; __i += __block_size) {
+    _Iter __start          = __curr;
+    value_type __block_min = __min_val;
+    for (size_t j = 0; j < __block_size; j++) {
+      if (std::__invoke(__comp, std::__invoke(__proj, *__curr), __block_min)) {
+        __block_min = *__curr;
+      }
+      __curr++;
+    }
+    if (std::invoke(__comp, __block_min, __min_val)) {
+      __min_val     = __block_min;
+      __block_start = __start;
+      __block_end   = __curr;
+    }
+  }
+
+  value_type __epilogue_min = __min_val;
+  _Iter __epilogue_start    = __curr;
+  while (__curr != __last) {
+    if (std::__invoke(__comp, std::__invoke(__proj, *__curr), __epilogue_min)) {
+      __epilogue_min = *__curr;
+    }
+    __curr++;
+  }
+  if (std::__invoke(__comp, __epilogue_min, __min_val)) {
+    __min_val     = __epilogue_min;
+    __block_start = __epilogue_start;
+    __block_end   = __last;
+  }
+
+  return find(__block_start, __block_end, __min_val);
 }
 
 template <class _Comp, class _Iter, class _Sent>

@mrdaybird
Copy link
Contributor Author

mrdaybird commented Jul 25, 2024

Pinging @philnik777, if you are interested in reviewing this, because #84663 was the inspiration for this.

@github-actions
Copy link

github-actions bot commented Jul 25, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@mrdaybird mrdaybird force-pushed the optimize_min_element branch 2 times, most recently from a997ee5 to 01c71b4 Compare July 25, 2024 19:09
@mrdaybird mrdaybird marked this pull request as draft July 25, 2024 23:06
@mrdaybird mrdaybird force-pushed the optimize_min_element branch from 74bad7e to 6aae7b1 Compare July 26, 2024 01:35
[libc++] Fix formatting and move min_element.bench.cpp

[libc++] Use __invoke instead of invoke

[libc++] Fix build issues with find
[libcxx] revert block_size logic

[libc++] Replace std::find with impl
@hiraditya hiraditya force-pushed the optimize_min_element branch from f8e6131 to bbfccb3 Compare October 15, 2024 18:31
@hiraditya hiraditya marked this pull request as ready for review October 31, 2024 16:59
Copy link
Member

@ldionne ldionne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some comments but this seems quite reasonable and the benefits are huge.

Comment on lines 50 to +51
template <class _Comp, class _Iter, class _Sent, class _Proj>
inline _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX14 _Iter
__min_element(_Iter __first, _Sent __last, _Comp& __comp, _Proj& __proj) {
inline _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX14 _Iter __min_element(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use __enable_if_t instead of using true_type and false_type overloads.

_LIBCPP_BEGIN_NAMESPACE_STD

template <class _Iter, class _Sent, class = void>
struct __ConstTimeDistance : false_type {};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename this to __has_constant_time_distance and move it to libcxx/include/__iterator/iterator_traits.h. It can be useful elsewhere.

inline _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX14 _Iter __min_element(
_Iter __first,
_Sent __last,
_Comp __comp,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_Comp __comp,
_Comp& __comp,

You can take a reference here.

inline _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX14 _Iter __min_element(
_Iter __first,
_Sent __last,
_Comp __comp,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_Comp __comp,
_Comp& __comp,

Here too.

Comment on lines +136 to +137
return std::__min_element<_Comp>(
std::move(__first), std::move(__last), __comp, __proj, __ConstTimeDistance<_Iter, _Sent>());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here just pass __comp directly as a reference.

Suggested change
return std::__min_element<_Comp>(
std::move(__first), std::move(__last), __comp, __proj, __ConstTimeDistance<_Iter, _Sent>());
return std::__min_element(
std::move(__first), std::move(__last), __comp, __proj, __ConstTimeDistance<_Iter, _Sent>());

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do the same for max_element? Can we implement max_element in terms of min_element by inverting the predicate?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrdaybird Hi, I would like to give a try for std::max_element using similar approach if thats ok!

@ldionne
Copy link
Member

ldionne commented Mar 27, 2025

@mrdaybird Please let us know if you're still interested in finishing this up despite the very long turnaround time. If not, I or someone else will pick it up.

Copy link
Contributor

@philnik777 philnik777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I'd really like to compare this to an implementation which goes over N elements in parallel, saving the offset to a vector register and reducing only at the end. I think that would avoid a bunch of conditionals and we wouldn't have to search the range twice. Actually, is the current implementation standards-conforming? We're invoking projections and comparators a lot more than previously I think.

Comment on lines +97 to +111
for (diff_type __i = 0; __i < __n_blocked; __i += __block_size) {
_Iter __start = __curr;
value_type __block_min = __min_val;
for (diff_type __j = 0; __j < __block_size; __j++) {
if (std::__invoke(__comp, std::__invoke(__proj, *__curr), __block_min)) {
__block_min = *__curr;
}
__curr++;
}
if (std::__invoke(__comp, __block_min, __min_val)) {
__min_val = __block_min;
__block_start = __start;
__block_end = __curr;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like something that results in incredible amounts of code bloat when the comparator isn't known or can't be inlined. e.g. how does the performance and code size compare with string?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mrdaybird, here’s my thought: we check whether the value is integral. If it is, we can simply use the built-in vector operations to get the min in this block. Does this work?

Comment on lines +90 to +91
diff_type __n_blocked = __n - (__n % __block_size);
_Iter __block_start = __first, __block_end = __first;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
diff_type __n_blocked = __n - (__n % __block_size);
_Iter __block_start = __first, __block_end = __first;
diff_type __n_blocked = __n - (__n % __block_size);
_Iter __block_start = __first;
_Iter __block_end = __first;

@mrdaybird
Copy link
Contributor Author

@ldionne @philnik777 definitely! I am still interested. great to see that it's getting some traction now. I will look into this over the weekend! Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi. performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants