Skip to content

Conversation

@higher-performance
Copy link
Contributor

std::visit on my machine costs roughly 10 milliseconds per unique invocation to compile, measurable as follows:

#include <variant>

int main(int argc, char* argv[]) {
  std::variant<char, unsigned char, int> v;
  int n = 0;
#define X(V) \
  ++n;       \
  std::visit([](int) {}, V)
#ifdef NEW_VERSION
  // clang-format off
  X(v); X(v); X(v); X(v); X(v); X(v); X(v); X(v);
  X(v); X(v); X(v); X(v); X(v); X(v); X(v); X(v);
  X(v); X(v); X(v); X(v); X(v); X(v); X(v); X(v);
  X(v); X(v); X(v); X(v); X(v); X(v); X(v); X(v);
  X(v); X(v); X(v); X(v); X(v); X(v); X(v); X(v);
  X(v); X(v); X(v); X(v); X(v); X(v); X(v); X(v);
  X(v); X(v); X(v); X(v); X(v); X(v); X(v); X(v);
  X(v); X(v); X(v); X(v); X(v); X(v); X(v); X(v);
// clang-format on
#else
  (void)v;
#endif
#undef X

  return n;
}

This PR hard-codes common cases to speed up compilation by roughly ~8x for them.

@higher-performance higher-performance requested a review from a team as a code owner October 20, 2025 02:00
@llvmbot llvmbot added the libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi. label Oct 20, 2025
@llvmbot
Copy link
Member

llvmbot commented Oct 20, 2025

@llvm/pr-subscribers-libcxx

Author: None (higher-performance)

Changes

std::visit on my machine costs roughly 10 milliseconds per unique invocation to compile, measurable as follows:

#include &lt;variant&gt;

int main(int argc, char* argv[]) {
  std::variant&lt;char, unsigned char, int&gt; v;
  int n = 0;
#define X(V) \
  ++n;       \
  std::visit([](int) {}, V)
#ifdef NEW_VERSION
  // clang-format off
  X(v); X(v); X(v); X(v); X(v); X(v); X(v); X(v);
  X(v); X(v); X(v); X(v); X(v); X(v); X(v); X(v);
  X(v); X(v); X(v); X(v); X(v); X(v); X(v); X(v);
  X(v); X(v); X(v); X(v); X(v); X(v); X(v); X(v);
  X(v); X(v); X(v); X(v); X(v); X(v); X(v); X(v);
  X(v); X(v); X(v); X(v); X(v); X(v); X(v); X(v);
  X(v); X(v); X(v); X(v); X(v); X(v); X(v); X(v);
  X(v); X(v); X(v); X(v); X(v); X(v); X(v); X(v);
// clang-format on
#else
  (void)v;
#endif
#undef X

  return n;
}

This PR hard-codes common cases to speed up compilation by roughly ~8x for them.


Full diff: https://github.com/llvm/llvm-project/pull/164196.diff

1 Files Affected:

  • (modified) libcxx/include/variant (+42-5)
diff --git a/libcxx/include/variant b/libcxx/include/variant
index 9beef146f203c..ef5bca4c2fda0 100644
--- a/libcxx/include/variant
+++ b/libcxx/include/variant
@@ -1578,11 +1578,48 @@ _LIBCPP_HIDE_FROM_ABI constexpr void __throw_if_valueless(_Vs&&... __vs) {
   }
 }
 
-template < class _Visitor, class... _Vs, typename>
-_LIBCPP_HIDE_FROM_ABI constexpr decltype(auto) visit(_Visitor&& __visitor, _Vs&&... __vs) {
-  using __variant_detail::__visitation::__variant;
-  std::__throw_if_valueless(std::forward<_Vs>(__vs)...);
-  return __variant::__visit_value(std::forward<_Visitor>(__visitor), std::forward<_Vs>(__vs)...);
+template <class _Visitor, class... _Vs, typename>
+_LIBCPP_HIDE_FROM_ABI constexpr decltype(auto) visit(_Visitor&& __visitor,
+                                                     _Vs&&... __vs) {
+#define _XDispatchIndex(_I)                                              \
+  case _I:                                                               \
+    if constexpr (__variant_size::value > _I) {                          \
+      return __visitor(                                                  \
+          __variant::__get_alt<_I>(std::forward<_Vs>(__vs)...).__value); \
+    }                                                                    \
+    [[__fallthrough__]]
+#define _XDispatchMax 7 // Speed up compilation for the common cases
+  if constexpr (sizeof...(_Vs) == 1) {
+    if constexpr (variant_size<__remove_cvref_t<_Vs>...>::value <=
+                  _XDispatchMax) {
+      using __variant_detail::__access::__variant;
+      using __variant_size = variant_size<__remove_cvref_t<_Vs>...>;
+      const size_t __indexes[] = {__vs.index()...};
+      switch (__indexes[0]) {
+        _XDispatchIndex(_XDispatchMax - 7);
+        _XDispatchIndex(_XDispatchMax - 6);
+        _XDispatchIndex(_XDispatchMax - 5);
+        _XDispatchIndex(_XDispatchMax - 4);
+        _XDispatchIndex(_XDispatchMax - 3);
+        _XDispatchIndex(_XDispatchMax - 2);
+        _XDispatchIndex(_XDispatchMax - 1);
+        _XDispatchIndex(_XDispatchMax - 0);
+        default:
+          __throw_bad_variant_access();
+      }
+    } else {
+      static_assert(
+          variant_size<__remove_cvref_t<_Vs>...>::value > _XDispatchMax,
+          "forgot to add dispatch case");
+    }
+  } else {
+    using __variant_detail::__visitation::__variant;
+    std::__throw_if_valueless(std::forward<_Vs>(__vs)...);
+    return __variant::__visit_value(std::forward<_Visitor>(__visitor),
+                                    std::forward<_Vs>(__vs)...);
+  }
+#undef _XDispatchMax
+#undef _XDispatchIndex
 }
 
 #    if _LIBCPP_STD_VER >= 20

@github-actions
Copy link

github-actions bot commented Oct 20, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@higher-performance higher-performance marked this pull request as draft October 20, 2025 03:11
@higher-performance higher-performance force-pushed the variant-compile-speedup branch 8 times, most recently from 350f45c to 7c3c8e6 Compare October 20, 2025 06:55
@higher-performance higher-performance marked this pull request as ready for review October 20, 2025 14:49
@higher-performance
Copy link
Contributor Author

Could someone please take a look at this? @philnik777 or anybody else involved?

Copy link
Contributor

@philnik777 philnik777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this basically fixes #62648. I'm not sure how to proceed here. Previously there were problems with this approach generating a lot of code, which is why it was reverted. This patch is a lot more conservative at least. I guess I'd really like if we had reflection here, since that would allow us to generate the perfect switch/case for any variant. I think I'd be fine with this as a temporary solution. I'd really like you to check compile time and code generation overhead of this though, since, as mentioned, this was a big problem previously.
Please run the variant benchmarks as well and share the results.

@higher-performance
Copy link
Contributor Author

higher-performance commented Nov 28, 2025

IIUC this basically fixes #62648. I'm not sure how to proceed here. Previously there were problems with this approach generating a lot of code, which is why it was reverted. This patch is a lot more conservative at least. I guess I'd really like if we had reflection here, since that would allow us to generate the perfect switch/case for any variant. I think I'd be fine with this as a temporary solution. I'd really like you to check compile time and code generation overhead of this though, since, as mentioned, this was a big problem previously. Please run the variant benchmarks as well and share the results.

Sounds good, thanks.
Re: #62648, note that this is only special-casing the (common) case of std::visiting 1 variant with up to 8 types. Invocations for 2+ variants or more than 8 types will still retain whatever problems or other characteristics they may have had before.

@higher-performance higher-performance force-pushed the variant-compile-speedup branch 2 times, most recently from ed6da80 to 186c480 Compare November 28, 2025 15:52
@higher-performance higher-performance changed the title Speed up compilation of common uses of std::visit() by ~8x Speed up compilation of common uses of std::visit() Nov 28, 2025
@higher-performance
Copy link
Contributor Author

Done.

First, re: the runtime benchmarks, I had to run them a bit ad-hoc via googlebenchmark since I don't have the official setup handy, but regardless -- they actually indicate a speedup for < 8 elements:

Before:

Benchmark                       Time(ns)      CPU(ns)  Iterations
BM_Visit<1, 1>_mean              2.13           2.13    25000000  
BM_Visit<1, 2>_mean              3.22           3.22    25000000  
BM_Visit<1, 3>_mean              3.20           3.20    25000000  
BM_Visit<1, 4>_mean              3.21           3.21    25000000  
BM_Visit<1, 5>_mean              3.21           3.20    25000000  
BM_Visit<1, 6>_mean              3.22           3.22    25000000  
BM_Visit<1, 7>_mean              3.20           3.20    25000000  
BM_Visit<1, 8>_mean              3.21           3.21    25000000

After:

Benchmark                       Time(ns)      CPU(ns)  Iterations
BM_Visit<1, 1>_mean              2.19           2.19    25000000  
BM_Visit<1, 2>_mean              2.20           2.20    25000000  
BM_Visit<1, 3>_mean              2.18           2.18    25000000  
BM_Visit<1, 4>_mean              2.18           2.18    25000000  
BM_Visit<1, 5>_mean              2.22           2.22    25000000  
BM_Visit<1, 6>_mean              2.19           2.19    25000000  
BM_Visit<1, 7>_mean              2.19           2.19    25000000  
BM_Visit<1, 8>_mean              3.27           3.27    25000000  

As for compile-time benchmarking, I also tested it like this:

#include <variant>

int main(int argc, char* argv[]) {
  std::variant<char, unsigned char, int> v;
  v.emplace<0>(3);
  int n = 0;
  unsigned int r = 1;
#define X(V) \
  ++n;       \
  std::visit([&](int x) { r *= x; }, V)
  (void)--n, X(v);
#ifdef NEW_VERSION
  // clang-format off
  X(v); X(v); X(v); X(v); X(v); X(v); X(v); X(v);
  X(v); X(v); X(v); X(v); X(v); X(v); X(v); X(v);
  X(v); X(v); X(v); X(v); X(v); X(v); X(v); X(v);
  X(v); X(v); X(v); X(v); X(v); X(v); X(v); X(v);
  X(v); X(v); X(v); X(v); X(v); X(v); X(v); X(v);
  X(v); X(v); X(v); X(v); X(v); X(v); X(v); X(v);
  X(v); X(v); X(v); X(v); X(v); X(v); X(v); X(v);
  X(v); X(v); X(v); X(v); X(v); X(v); X(v); X(v);
// clang-format on
#else
  (void)v;
#endif
#undef X

  return r % 1000 == 1 ? -1 : n;
}

Under -O3 I got:

  • Baseline: only 1 variant call: 5216 bytes
  • 64 extra calls (new implementation): 5216 bytes, +0.1 ms
  • 64 extra calls (old implementation): 54104 bytes, +0.43 ms

My setup/system is a bit different from last time, so it's not quite 8x here, but still, it's a huge win.

tl;dr: it's a strict win on every axis I measure. @philnik777

@higher-performance higher-performance force-pushed the variant-compile-speedup branch 5 times, most recently from ab3fa02 to d4b0d1e Compare November 29, 2025 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants