Skip to content

Conversation

@higher-performance
Copy link
Contributor

@higher-performance higher-performance commented Mar 12, 2025

This improves performance for the copy-assignment operators of associative containers such as std::map.

Note that this optimization already exists in other places, e.g.:

for (const_iterator __e = cend(); __f != __l; ++__f)
insert(__e.__i_, *__f);

As an example, this makes the following code 15% faster:

#include <stddef.h>
#include <chrono>
#include <iostream>
#include <set>

using Clock = std::chrono::high_resolution_clock;

int main() {
  std::set<size_t> a, b;
  for (size_t i = 0; i < 5000000; ++i) {
    b.insert(i);
  }
  Clock::time_point start = Clock::now();
  a = b;
  std::chrono::duration<double> diff = Clock::now() - start;
  std::cout << "Time taken: " << diff.count() << std::endl;
}

…the end of the tree every time

This improves performance for the copy-assignment operators of associative containers such as std::map
@higher-performance higher-performance added the libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi. label Mar 12, 2025
@higher-performance higher-performance requested a review from a team as a code owner March 12, 2025 20:59
@llvmbot
Copy link
Member

llvmbot commented Mar 12, 2025

@llvm/pr-subscribers-libcxx

Author: None (higher-performance)

Changes

This improves performance for the copy-assignment operators of associative containers such as std::map.

Note that this optimization already exists in other places, e.g.:

for (const_iterator __e = cend(); __f != __l; ++__f)
insert(__e.__i_, *__f);


Full diff: https://github.com/llvm/llvm-project/pull/131030.diff

1 Files Affected:

  • (modified) libcxx/include/__tree (+2-1)
diff --git a/libcxx/include/__tree b/libcxx/include/__tree
index c627641d5d86f..c07de5c95b0dc 100644
--- a/libcxx/include/__tree
+++ b/libcxx/include/__tree
@@ -1432,8 +1432,9 @@ void __tree<_Tp, _Compare, _Allocator>::__assign_multi(_InputIterator __first, _
       __cache.__advance();
     }
   }
+  const_iterator __e = end();
   for (; __first != __last; ++__first)
-    __insert_multi(_NodeTypes::__get_value(*__first));
+    __insert_multi(__e, _NodeTypes::__get_value(*__first));
 }
 
 template <class _Tp, class _Compare, class _Allocator>

@higher-performance
Copy link
Contributor Author

@philnik777 would you mind reviewing this? It's a fairly trivial PR.

Copy link
Contributor

@philnik777 philnik777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you provide some benchmark numbers for this change?

@higher-performance
Copy link
Contributor Author

@philnik777 sure, added a benchmark in the PR description.

@philnik777
Copy link
Contributor

@higher-performance We have benchmarks for <set> under test/benchmarks/containers/associative/set.bench.cpp. You should be able to run them just like any other test (make sure to enable optimizations though). It would be great if you could post some before/after numbers from the relevant benchmarks there.

@philnik777
Copy link
Contributor

You can also find documentation on it at https://libcxx.llvm.org/TestingLibcxx.html#benchmarks.

@higher-performance
Copy link
Contributor Author

@philnik777 thanks, I wasn't aware. But I'm not using CMake, I have a different build system. Setting this up is quite a headache. The change is incredibly trivial and it's a clear benchmark win -- any chance you could either just accept it or run the benchmarks on your end?

@philnik777
Copy link
Contributor

@higher-performance I can't see much of a difference. These are the numbers from our benchmarks:

--------------------------------------------------------------------------------------------------
Benchmark                                                                         old          new
--------------------------------------------------------------------------------------------------
std::set<int>::ctor(const&)/32                                                 487 ns       492 ns
std::set<int>::ctor(const&)/1024                                             23391 ns     23758 ns
std::set<int>::ctor(const&)/8192                                            239519 ns    246201 ns
std::set<int>::ctor(iterator, iterator) (unsorted sequence)/32                 520 ns       515 ns
std::set<int>::ctor(iterator, iterator) (unsorted sequence)/1024             55183 ns     55256 ns
std::set<int>::ctor(iterator, iterator) (unsorted sequence)/8192            749516 ns    761760 ns
std::set<int>::ctor(iterator, iterator) (sorted sequence)/32                   517 ns       524 ns
std::set<int>::ctor(iterator, iterator) (sorted sequence)/1024               24800 ns     24731 ns
std::set<int>::ctor(iterator, iterator) (sorted sequence)/8192              229736 ns    241529 ns
std::set<int>::operator=(const&)/32                                            540 ns       563 ns
std::set<int>::operator=(const&)/1024                                        28827 ns     28390 ns
std::set<int>::operator=(const&)/8192                                       290951 ns    287785 ns
std::set<int>::insert(iterator, iterator) (all new keys)/32                   1148 ns      1129 ns
std::set<int>::insert(iterator, iterator) (all new keys)/1024                30299 ns     29754 ns
std::set<int>::insert(iterator, iterator) (all new keys)/8192               334470 ns    331822 ns
std::set<int>::insert(iterator, iterator) (half new keys)/32                   865 ns       864 ns
std::set<int>::insert(iterator, iterator) (half new keys)/1024               32034 ns     32017 ns
std::set<int>::insert(iterator, iterator) (half new keys)/8192              396411 ns    389757 ns

@higher-performance
Copy link
Contributor Author

@philnik777 thank you, that's strange. The assignment operator should be faster. Are you able to reproduce the timing on the code I posted, or do you not see any difference there either?

@philnik777
Copy link
Contributor

With your code I see a bigger difference (about 8%). I'm not sure what the difference between the libc++ and your benchmark is. They look very similar.

@higher-performance
Copy link
Contributor Author

If the /8192 is referring to the number of elements, the set might just be too small to exhibit the effect. When I lower it to that many elements I don't see a significant difference either. Try bumping it up to 1 << 23 elements, with a batch size of 1?

@higher-performance
Copy link
Contributor Author

@philnik777 Just a friendly bump - did you have a chance to try with a larger size?

@higher-performance
Copy link
Contributor Author

@philnik777 just checking in. Do you see any reason not to merge this? It's a strict efficiency gain for us -- would love to merge it if that is all right with you.

@philnik777
Copy link
Contributor

@higher-performance My main concern is that this adds a hint for ever range we're assigning from, so the hint could be completely bogus when you assign from something other than a container of the same type. Given the relatively small performance improvement and the extremely large numbers required to achieve that I'm not sure this is a net win in the end.

@higher-performance
Copy link
Contributor Author

higher-performance commented Apr 10, 2025

@philnik777 oh. That's because this is an algorithmic improvement, which you can't assess by relying on microbenchmarks. They miss the bigger picture and give you highly misleading results.

To elaborate: The number of comparisons is being cut down significantly, and comparisons can be expensive. Not merely because of the container size (side note: 2^24 is really not "extremely large" for a tree of size_t in 2025, it's just 16 million) but because elements are not always CPU-friendly size_ts, and the comparisons themselves can be arbitrarily expensive.

To illustrate, just try running this:

#include <stddef.h>

#include <chrono>
#include <iostream>
#include <set>
#include <string>
#include <utility>

using Clock = std::chrono::high_resolution_clock;

int main() {
  std::set<std::string> a, b;
  for (size_t i = 0; i < 1000000; ++i) {
    std::string s = std::to_string(i);
    s.insert(0, 3000, 'h');
    b.insert(std::move(s));
  }
  Clock::time_point start = Clock::now();
  a = b;
  std::chrono::duration<double> diff = Clock::now() - start;
  std::cout << "Time taken: " << diff.count() << std::endl;
}

On my machine it's 1.65 ms vs. 5.81 ms, which is a rather catastrophic > 3x performance hit.

@higher-performance
Copy link
Contributor Author

@philnik777 where do you stand on this? Do you still see a downside to merging it?

@higher-performance
Copy link
Contributor Author

@philnik777 just one last bump -- given we haven't had any further concerns the past couple of weeks, and that this optimization already exists everywhere else I've seen in the codebase, I plan to merge this soon (probably in the next couple of days). If you still see any concrete reasons not to, please let me know. So far as I can tell, it seems to be a clear win for some usages, and I believe we haven't seen any cases where it's detrimental.

Copy link
Contributor

@philnik777 philnik777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@philnik777 just one last bump -- given we haven't had any further concerns the past couple of weeks, and that this optimization already exists everywhere else I've seen in the codebase, I plan to merge this soon (probably in the next couple of days).

Please do not merge libc++ patches without approval from @llvm/reviewers-libcxx, as that is a general requirement for libc++ patches to be landed. I realize that this patch has been open for a while. Feel free to ping in the libc++ channel on Discord if patches get stale. There it's much more likely to be seen by someone.

If you still see any concrete reasons not to, please let me know. So far as I can tell, it seems to be a clear win for some usages, and I believe we haven't seen any cases where it's detrimental.

Given that you haven't provided any benchmarks for non-optimal cases that's not exactly surprising. How does the performance look if you e.g. assign a range that contains random elements or is sorted inversely?

Edit: I've raised this in my previous comment already, so there has been review feedback that hasn't been addressed.

@higher-performance
Copy link
Contributor Author

Please do not merge libc++ patches without approval from @llvm/reviewers-libcxx, as that is a general requirement for libc++ patches to be landed.

Huh, I didn't realize. Is this documented anywhere? I was just going by the official LLVM Developer Policy, which appeared to specifically give permission to do this:

2. You are allowed to commit patches without approval which you think are obvious.

Given that you haven't provided any benchmarks for non-optimal cases that's not exactly surprising. How does the performance look if you e.g. assign a range that contains random elements or is sorted inversely?

That sounds impossible -- am I perhaps not understanding what you mean? This is in __assign_multi. So far as I'm aware, it is only used in __tree::operator=, like this:

template <class _Tp, class _Compare, class _Allocator>
__tree<_Tp, _Compare, _Allocator>& __tree<_Tp, _Compare, _Allocator>::operator=(const __tree& __t) {
  if (this != std::addressof(__t)) {
    value_comp() = __t.value_comp();
    __copy_assign_alloc(__t);
    __assign_multi(__t.begin(), __t.end());
  }
  return *this;
}

The comparators are identical and the target of the assignment is empty at the type of the assignment. How could the inputs have any ordering other than the correctly-sorted one?

Edit: I've raised this in my previous comment already, so there has been review feedback that hasn't been addressed.

I'm not seeing the feedback you're referring to unfortunately. The only previous feedback I see on my side is the objection that the performance improvement was small (not negative), and requires large numbers (both of which I responded to), as well as the one asking me to run your existing benchmarks -- which you yourself did, and said you didn't see much difference. I'm not seeing anything about benchmarking with random elements, inversion, or other feedback that I didn't address. Did you perhaps type more comments and forget to publish them?

Copy link
Contributor

@philnik777 philnik777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not merge libc++ patches without approval from @llvm/reviewers-libcxx, as that is a general requirement for libc++ patches to be landed.

Huh, I didn't realize. Is this documented anywhere? I was just going by the official LLVM Developer Policy, which appeared to specifically give permission to do this:

_2. You are allowed to commit patches without approval which you think are obvious.

I'm not sure it's documented anywhere. We should certainly do that if it's not.

Given that you haven't provided any benchmarks for non-optimal cases that's not exactly surprising. How does the performance look if you e.g. assign a range that contains random elements or is sorted inversely?

That sounds impossible -- am I perhaps not understanding what you mean? This is in __assign_multi. So far as I'm aware, it is only used in __tree::operator=, like this:

template <class _Tp, class _Compare, class _Allocator>
__tree<_Tp, _Compare, _Allocator>& __tree<_Tp, _Compare, _Allocator>::operator=(const __tree& __t) {
  if (this != std::addressof(__t)) {
    value_comp() = __t.value_comp();
    __copy_assign_alloc(__t);
    __assign_multi(__t.begin(), __t.end());
  }
  return *this;
}

The comparators are identical and the target of the assignment is empty at the type of the assignment. How could the inputs have any ordering other than the correctly-sorted one?

Never mind. I was absolutely sure that it's used in map(InputIterator, InputIterator), but it's not. That does indeed make it impossible.

Edit: I've raised this in my previous comment already, so there has been review feedback that hasn't been addressed.

I'm not seeing the feedback you're referring to unfortunately. The only previous feedback I see on my side is the objection that the performance improvement was small (not negative), and requires large numbers (both of which I responded to), as well as the one asking me to run your existing benchmarks -- which you yourself did, and said you didn't see much difference. I'm not seeing anything about benchmarking with random elements, inversion, or other feedback that I didn't address. Did you perhaps type more comments and forget to publish them?

I'm talking about

so the hint could be completely bogus when you assign from something other than a container of the same type

that might not have been obvious though.

@philnik777
Copy link
Contributor

So, we do have https://libcxx.llvm.org/Contributing.html#the-review-process, but that's... a bit outdated, let's say.

@higher-performance
Copy link
Contributor Author

Ah gotcha! And yeah, it would definitely be great to update any relevant policies. Thanks, I appreciate it!

@higher-performance higher-performance merged commit 5f91649 into llvm:main Apr 24, 2025
83 checks passed
@higher-performance higher-performance deleted the set_insert branch April 24, 2025 18:30
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025
…the end of the tree every time (llvm#131030)

This improves performance for the copy-assignment operators of associative containers such as `std::map`.

This optimization already exists in other places in the codebase, and seems to have been missed here.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants