Optimize std::tree::assign_multi to insert the provided range at the end of the tree every time #131030

higher-performance · 2025-03-12T20:59:17Z

This improves performance for the copy-assignment operators of associative containers such as std::map.

Note that this optimization already exists in other places, e.g.:

Lines 1217 to 1218 in 15e6bb6

    
           for (const_iterator __e = cend(); __f != __l; ++__f) 
        
             insert(__e.__i_, *__f);

As an example, this makes the following code 15% faster:

#include <stddef.h>
#include <chrono>
#include <iostream>
#include <set>

using Clock = std::chrono::high_resolution_clock;

int main() {
  std::set<size_t> a, b;
  for (size_t i = 0; i < 5000000; ++i) {
    b.insert(i);
  }
  Clock::time_point start = Clock::now();
  a = b;
  std::chrono::duration<double> diff = Clock::now() - start;
  std::cout << "Time taken: " << diff.count() << std::endl;
}

…the end of the tree every time This improves performance for the copy-assignment operators of associative containers such as std::map

llvmbot · 2025-03-12T20:59:40Z

@llvm/pr-subscribers-libcxx

Author: None (higher-performance)

Changes

This improves performance for the copy-assignment operators of associative containers such as std::map.

Note that this optimization already exists in other places, e.g.:

llvm-project/libcxx/include/map

Lines 1217 to 1218 in 15e6bb6

    
           for (const_iterator __e = cend(); __f != __l; ++__f) 
        
             insert(__e.__i_, *__f);

Full diff: https://github.com/llvm/llvm-project/pull/131030.diff

1 Files Affected:

(modified) libcxx/include/__tree (+2-1)

diff --git a/libcxx/include/__tree b/libcxx/include/__tree
index c627641d5d86f..c07de5c95b0dc 100644
--- a/libcxx/include/__tree
+++ b/libcxx/include/__tree
@@ -1432,8 +1432,9 @@ void __tree<_Tp, _Compare, _Allocator>::__assign_multi(_InputIterator __first, _
       __cache.__advance();
     }
   }
+  const_iterator __e = end();
   for (; __first != __last; ++__first)
-    __insert_multi(_NodeTypes::__get_value(*__first));
+    __insert_multi(__e, _NodeTypes::__get_value(*__first));
 }
 
 template <class _Tp, class _Compare, class _Allocator>

higher-performance · 2025-03-21T16:50:35Z

@philnik777 would you mind reviewing this? It's a fairly trivial PR.

philnik777

Could you provide some benchmark numbers for this change?

higher-performance · 2025-03-25T16:12:21Z

@philnik777 sure, added a benchmark in the PR description.

philnik777 · 2025-03-25T16:38:53Z

@higher-performance We have benchmarks for <set> under test/benchmarks/containers/associative/set.bench.cpp. You should be able to run them just like any other test (make sure to enable optimizations though). It would be great if you could post some before/after numbers from the relevant benchmarks there.

philnik777 · 2025-03-25T16:39:51Z

You can also find documentation on it at https://libcxx.llvm.org/TestingLibcxx.html#benchmarks.

higher-performance · 2025-03-25T16:49:22Z

@philnik777 thanks, I wasn't aware. But I'm not using CMake, I have a different build system. Setting this up is quite a headache. The change is incredibly trivial and it's a clear benchmark win -- any chance you could either just accept it or run the benchmarks on your end?

philnik777 · 2025-03-27T11:51:36Z

@higher-performance I can't see much of a difference. These are the numbers from our benchmarks:

--------------------------------------------------------------------------------------------------
Benchmark                                                                         old          new
--------------------------------------------------------------------------------------------------
std::set<int>::ctor(const&)/32                                                 487 ns       492 ns
std::set<int>::ctor(const&)/1024                                             23391 ns     23758 ns
std::set<int>::ctor(const&)/8192                                            239519 ns    246201 ns
std::set<int>::ctor(iterator, iterator) (unsorted sequence)/32                 520 ns       515 ns
std::set<int>::ctor(iterator, iterator) (unsorted sequence)/1024             55183 ns     55256 ns
std::set<int>::ctor(iterator, iterator) (unsorted sequence)/8192            749516 ns    761760 ns
std::set<int>::ctor(iterator, iterator) (sorted sequence)/32                   517 ns       524 ns
std::set<int>::ctor(iterator, iterator) (sorted sequence)/1024               24800 ns     24731 ns
std::set<int>::ctor(iterator, iterator) (sorted sequence)/8192              229736 ns    241529 ns
std::set<int>::operator=(const&)/32                                            540 ns       563 ns
std::set<int>::operator=(const&)/1024                                        28827 ns     28390 ns
std::set<int>::operator=(const&)/8192                                       290951 ns    287785 ns
std::set<int>::insert(iterator, iterator) (all new keys)/32                   1148 ns      1129 ns
std::set<int>::insert(iterator, iterator) (all new keys)/1024                30299 ns     29754 ns
std::set<int>::insert(iterator, iterator) (all new keys)/8192               334470 ns    331822 ns
std::set<int>::insert(iterator, iterator) (half new keys)/32                   865 ns       864 ns
std::set<int>::insert(iterator, iterator) (half new keys)/1024               32034 ns     32017 ns
std::set<int>::insert(iterator, iterator) (half new keys)/8192              396411 ns    389757 ns

higher-performance · 2025-03-27T14:00:42Z

@philnik777 thank you, that's strange. The assignment operator should be faster. Are you able to reproduce the timing on the code I posted, or do you not see any difference there either?

philnik777 · 2025-03-27T14:09:56Z

With your code I see a bigger difference (about 8%). I'm not sure what the difference between the libc++ and your benchmark is. They look very similar.

higher-performance · 2025-03-27T16:04:03Z

If the /8192 is referring to the number of elements, the set might just be too small to exhibit the effect. When I lower it to that many elements I don't see a significant difference either. Try bumping it up to 1 << 23 elements, with a batch size of 1?

higher-performance · 2025-04-01T20:03:12Z

@philnik777 Just a friendly bump - did you have a chance to try with a larger size?

higher-performance · 2025-04-07T15:11:03Z

@philnik777 just checking in. Do you see any reason not to merge this? It's a strict efficiency gain for us -- would love to merge it if that is all right with you.

philnik777 · 2025-04-10T11:25:22Z

@higher-performance My main concern is that this adds a hint for ever range we're assigning from, so the hint could be completely bogus when you assign from something other than a container of the same type. Given the relatively small performance improvement and the extremely large numbers required to achieve that I'm not sure this is a net win in the end.

higher-performance · 2025-04-10T16:42:52Z

@philnik777 oh. That's because this is an algorithmic improvement, which you can't assess by relying on microbenchmarks. They miss the bigger picture and give you highly misleading results.

To elaborate: The number of comparisons is being cut down significantly, and comparisons can be expensive. Not merely because of the container size (side note: 2^24 is really not "extremely large" for a tree of size_t in 2025, it's just 16 million) but because elements are not always CPU-friendly size_ts, and the comparisons themselves can be arbitrarily expensive.

To illustrate, just try running this:

#include <stddef.h>

#include <chrono>
#include <iostream>
#include <set>
#include <string>
#include <utility>

using Clock = std::chrono::high_resolution_clock;

int main() {
  std::set<std::string> a, b;
  for (size_t i = 0; i < 1000000; ++i) {
    std::string s = std::to_string(i);
    s.insert(0, 3000, 'h');
    b.insert(std::move(s));
  }
  Clock::time_point start = Clock::now();
  a = b;
  std::chrono::duration<double> diff = Clock::now() - start;
  std::cout << "Time taken: " << diff.count() << std::endl;
}

On my machine it's 1.65 ms vs. 5.81 ms, which is a rather catastrophic > 3x performance hit.

higher-performance · 2025-04-21T16:19:07Z

@philnik777 where do you stand on this? Do you still see a downside to merging it?

higher-performance · 2025-04-23T15:42:05Z

@philnik777 just one last bump -- given we haven't had any further concerns the past couple of weeks, and that this optimization already exists everywhere else I've seen in the codebase, I plan to merge this soon (probably in the next couple of days). If you still see any concrete reasons not to, please let me know. So far as I can tell, it seems to be a clear win for some usages, and I believe we haven't seen any cases where it's detrimental.

philnik777

@philnik777 just one last bump -- given we haven't had any further concerns the past couple of weeks, and that this optimization already exists everywhere else I've seen in the codebase, I plan to merge this soon (probably in the next couple of days).

Please do not merge libc++ patches without approval from @llvm/reviewers-libcxx, as that is a general requirement for libc++ patches to be landed. I realize that this patch has been open for a while. Feel free to ping in the libc++ channel on Discord if patches get stale. There it's much more likely to be seen by someone.

If you still see any concrete reasons not to, please let me know. So far as I can tell, it seems to be a clear win for some usages, and I believe we haven't seen any cases where it's detrimental.

Given that you haven't provided any benchmarks for non-optimal cases that's not exactly surprising. How does the performance look if you e.g. assign a range that contains random elements or is sorted inversely?

Edit: I've raised this in my previous comment already, so there has been review feedback that hasn't been addressed.

higher-performance · 2025-04-24T16:22:18Z

Please do not merge libc++ patches without approval from @llvm/reviewers-libcxx, as that is a general requirement for libc++ patches to be landed.

Huh, I didn't realize. Is this documented anywhere? I was just going by the official LLVM Developer Policy, which appeared to specifically give permission to do this:

2. You are allowed to commit patches without approval which you think are obvious.

Given that you haven't provided any benchmarks for non-optimal cases that's not exactly surprising. How does the performance look if you e.g. assign a range that contains random elements or is sorted inversely?

That sounds impossible -- am I perhaps not understanding what you mean? This is in __assign_multi. So far as I'm aware, it is only used in __tree::operator=, like this:

template <class _Tp, class _Compare, class _Allocator>
__tree<_Tp, _Compare, _Allocator>& __tree<_Tp, _Compare, _Allocator>::operator=(const __tree& __t) {
  if (this != std::addressof(__t)) {
    value_comp() = __t.value_comp();
    __copy_assign_alloc(__t);
    __assign_multi(__t.begin(), __t.end());
  }
  return *this;
}

The comparators are identical and the target of the assignment is empty at the type of the assignment. How could the inputs have any ordering other than the correctly-sorted one?

Edit: I've raised this in my previous comment already, so there has been review feedback that hasn't been addressed.

I'm not seeing the feedback you're referring to unfortunately. The only previous feedback I see on my side is the objection that the performance improvement was small (not negative), and requires large numbers (both of which I responded to), as well as the one asking me to run your existing benchmarks -- which you yourself did, and said you didn't see much difference. I'm not seeing anything about benchmarking with random elements, inversion, or other feedback that I didn't address. Did you perhaps type more comments and forget to publish them?

philnik777

Please do not merge libc++ patches without approval from @llvm/reviewers-libcxx, as that is a general requirement for libc++ patches to be landed.

Huh, I didn't realize. Is this documented anywhere? I was just going by the official LLVM Developer Policy, which appeared to specifically give permission to do this:

_2. You are allowed to commit patches without approval which you think are obvious.

I'm not sure it's documented anywhere. We should certainly do that if it's not.

Given that you haven't provided any benchmarks for non-optimal cases that's not exactly surprising. How does the performance look if you e.g. assign a range that contains random elements or is sorted inversely?

That sounds impossible -- am I perhaps not understanding what you mean? This is in __assign_multi. So far as I'm aware, it is only used in __tree::operator=, like this:
template <class _Tp, class _Compare, class _Allocator>
__tree<_Tp, _Compare, _Allocator>& __tree<_Tp, _Compare, _Allocator>::operator=(const __tree& __t) {
  if (this != std::addressof(__t)) {
    value_comp() = __t.value_comp();
    __copy_assign_alloc(__t);
    __assign_multi(__t.begin(), __t.end());
  }
  return *this;
}
The comparators are identical and the target of the assignment is empty at the type of the assignment. How could the inputs have any ordering other than the correctly-sorted one?

Never mind. I was absolutely sure that it's used in map(InputIterator, InputIterator), but it's not. That does indeed make it impossible.

Edit: I've raised this in my previous comment already, so there has been review feedback that hasn't been addressed.

I'm not seeing the feedback you're referring to unfortunately. The only previous feedback I see on my side is the objection that the performance improvement was small (not negative), and requires large numbers (both of which I responded to), as well as the one asking me to run your existing benchmarks -- which you yourself did, and said you didn't see much difference. I'm not seeing anything about benchmarking with random elements, inversion, or other feedback that I didn't address. Did you perhaps type more comments and forget to publish them?

I'm talking about

so the hint could be completely bogus when you assign from something other than a container of the same type

that might not have been obvious though.

philnik777 · 2025-04-24T18:19:59Z

So, we do have https://libcxx.llvm.org/Contributing.html#the-review-process, but that's... a bit outdated, let's say.

higher-performance · 2025-04-24T18:28:14Z

Ah gotcha! And yeah, it would definitely be great to update any relevant policies. Thanks, I appreciate it!

…the end of the tree every time (llvm#131030) This improves performance for the copy-assignment operators of associative containers such as `std::map`. This optimization already exists in other places in the codebase, and seems to have been missed here.

Optimize std::__tree::__assign_multi to insert the provided range at …

4f12629

…the end of the tree every time This improves performance for the copy-assignment operators of associative containers such as std::map

higher-performance added the libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi. label Mar 12, 2025

higher-performance requested a review from a team as a code owner March 12, 2025 20:59

philnik777 reviewed Mar 25, 2025

View reviewed changes

philnik777 requested changes Apr 24, 2025

View reviewed changes

philnik777 approved these changes Apr 24, 2025

View reviewed changes

higher-performance merged commit 5f91649 into llvm:main Apr 24, 2025
83 checks passed

higher-performance deleted the set_insert branch April 24, 2025 18:30

	for (const_iterator __e = cend(); __f != __l; ++__f)
	insert(__e.__i_, *__f);

Optimize std::__tree::__assign_multi to insert the provided range at the end of the tree every time #131030

Optimize std::__tree::__assign_multi to insert the provided range at the end of the tree every time #131030

Uh oh!

Conversation

higher-performance commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Mar 12, 2025

Uh oh!

higher-performance commented Mar 21, 2025

Uh oh!

philnik777 left a comment

Choose a reason for hiding this comment

Uh oh!

higher-performance commented Mar 25, 2025

Uh oh!

philnik777 commented Mar 25, 2025

Uh oh!

philnik777 commented Mar 25, 2025

Uh oh!

higher-performance commented Mar 25, 2025

Uh oh!

philnik777 commented Mar 27, 2025

Uh oh!

higher-performance commented Mar 27, 2025

Uh oh!

philnik777 commented Mar 27, 2025

Uh oh!

higher-performance commented Mar 27, 2025

Uh oh!

higher-performance commented Apr 1, 2025

Uh oh!

higher-performance commented Apr 7, 2025

Uh oh!

philnik777 commented Apr 10, 2025

Uh oh!

higher-performance commented Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

higher-performance commented Apr 21, 2025

Uh oh!

higher-performance commented Apr 23, 2025

Uh oh!

philnik777 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

higher-performance commented Apr 24, 2025

Uh oh!

philnik777 left a comment

Choose a reason for hiding this comment

Uh oh!

philnik777 commented Apr 24, 2025

Uh oh!

higher-performance commented Apr 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Optimize std::tree::assign_multi to insert the provided range at the end of the tree every time #131030

Optimize std::tree::assign_multi to insert the provided range at the end of the tree every time #131030

higher-performance commented Mar 12, 2025 •

edited

Loading

higher-performance commented Apr 10, 2025 •

edited

Loading

philnik777 left a comment •

edited

Loading