Skip to content

Conversation

@statham-arm
Copy link
Collaborator

Fixes #51158: instantiating a basic_string with a character type of 3, 5, or greater than 8 wouldn't work properly, because the two members of its internal union disagreed on the location of the bit flag indicating which member was live.

This commit completely rewrites the calculations of how much to pad the two sub-structures so that their lengths match. Two new static_asserts check that, after all the calculation is done, the compiler agrees that the structures really did end up the same size. A new test instantiates basic_string with every size of character type up to 16, and four different alignment requirements.

The new code now potentially has to pad both sub-structures, depending on what is and is not divisible by the size of the character type. This also means that the __padding type must have a method to zero it out (otherwise the __grow_by method stops being constexpr, which an existing test depends on).

I've also added some extra checks to the existing test push_back.pass.cpp, which could have found this issue by checking that pushing huge characters on to a string updated the length in the appropriate way.

Fixes llvm#51158: instantiating a `basic_string` with a character type of
3, 5, or greater than 8 wouldn't work properly, because the two
members of its internal union disagreed on the location of the bit
flag indicating which member was live.

This commit completely rewrites the calculations of how much to pad
the two sub-structures so that their lengths match. Two new
static_asserts check that, after all the calculation is done, the
compiler agrees that the structures really did end up the same size. A
new test instantiates `basic_string` with every size of character type
up to 16, and four different alignment requirements.

The new code now potentially has to pad _both_ sub-structures,
depending on what is and is not divisible by the size of the character
type. This also means that the `__padding` type must have a method to
zero it out (otherwise the `__grow_by` method stops being constexpr,
which an existing test depends on).

I've also added some extra checks to the existing test
`push_back.pass.cpp`, which _could_ have found this issue by checking
that pushing huge characters on to a string updated the length in the
appropriate way.
@statham-arm statham-arm requested review from a team, ldionne and philnik777 November 19, 2025 17:28
@llvmbot llvmbot added the libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi. label Nov 19, 2025
@llvmbot
Copy link
Member

llvmbot commented Nov 19, 2025

@llvm/pr-subscribers-libcxx

Author: Simon Tatham (statham-arm)

Changes

Fixes #51158: instantiating a basic_string with a character type of 3, 5, or greater than 8 wouldn't work properly, because the two members of its internal union disagreed on the location of the bit flag indicating which member was live.

This commit completely rewrites the calculations of how much to pad the two sub-structures so that their lengths match. Two new static_asserts check that, after all the calculation is done, the compiler agrees that the structures really did end up the same size. A new test instantiates basic_string with every size of character type up to 16, and four different alignment requirements.

The new code now potentially has to pad both sub-structures, depending on what is and is not divisible by the size of the character type. This also means that the __padding type must have a method to zero it out (otherwise the __grow_by method stops being constexpr, which an existing test depends on).

I've also added some extra checks to the existing test push_back.pass.cpp, which could have found this issue by checking that pushing huge characters on to a string updated the length in the appropriate way.


Full diff: https://github.com/llvm/llvm-project/pull/168742.diff

3 Files Affected:

  • (modified) libcxx/include/string (+51-6)
  • (added) libcxx/test/std/strings/basic.string/awkward-char-types.pass.cpp (+114)
  • (modified) libcxx/test/std/strings/basic.string/string.modifiers/string_append/push_back.pass.cpp (+4)
diff --git a/libcxx/include/string b/libcxx/include/string
index 09fc6228c4fdb..50fcf5dc6fa44 100644
--- a/libcxx/include/string
+++ b/libcxx/include/string
@@ -719,10 +719,16 @@ struct __init_with_sentinel_tag {};
 template <size_t _PaddingSize>
 struct __padding {
   char __padding_[_PaddingSize];
+  constexpr void clear() {
+    __padding __initialized = {0};
+    *this = __initialized;
+  }
 };
 
 template <>
-struct __padding<0> {};
+struct __padding<0> {
+  constexpr void clear() {}
+};
 
 template <class _CharT, class _Traits, class _Allocator>
 class basic_string {
@@ -819,6 +825,33 @@ private:
 
 #  ifdef _LIBCPP_ABI_ALTERNATE_STRING_LAYOUT
 
+  // The __long structure contains at least a `pointer` to an allocated buffer,
+  // and two `size_type` (the second one being a bitfield container). This
+  // struct, only used by `sizeof`, is the minimum that will fit those fields.
+  struct __long_min {
+    pointer __data_;
+    size_type __size_;
+    size_type __bitfield_container;
+  };
+
+  // The __short structure has a byte-sized bitfield container at the end, and
+  // the rest of it can be taken up by value_type. We want at least two
+  // value_type, or more if they will fit.
+  static constexpr size_t __fit_cap = (sizeof(__long_min) - 1) / sizeof(value_type);
+  static constexpr size_t __min_cap = __fit_cap > 2 ? __fit_cap : 2;
+
+  // Now we know how many value_type will fit, calculate how much space they
+  // take up in total; add one byte for the final bitfield container; and round
+  // up to the nearest multiple of the alignment. That's the total size of the
+  // structure.
+  static constexpr size_t __short_packed_size = sizeof(value_type) * __min_cap + 1;
+  union __union_alignment_check { __long_min __long_; value_type v; };
+  static constexpr size_t __union_alignment = _LIBCPP_ALIGNOF(__union_alignment_check);
+  static constexpr size_t __full_size = (__short_packed_size + __union_alignment - 1) & -__union_alignment;
+
+  // Now define both structures for real, with padding to ensure they are both
+  // exactly the calculated size.
+
   struct __long {
     __long() = default;
 
@@ -829,19 +862,23 @@ private:
 
     pointer __data_;
     size_type __size_;
+    _LIBCPP_NO_UNIQUE_ADDRESS __padding<__full_size - sizeof(__long_min)> __padding_;
     size_type __cap_ : sizeof(size_type) * CHAR_BIT - 1;
     size_type __is_long_ : 1;
   };
 
-  enum { __min_cap = (sizeof(__long) - 1) / sizeof(value_type) > 2 ? (sizeof(__long) - 1) / sizeof(value_type) : 2 };
-
   struct __short {
     value_type __data_[__min_cap];
-    _LIBCPP_NO_UNIQUE_ADDRESS __padding<sizeof(value_type) - 1> __padding_;
+    _LIBCPP_NO_UNIQUE_ADDRESS __padding<__full_size - __short_packed_size> __padding_;
     unsigned char __size_    : 7;
     unsigned char __is_long_ : 1;
   };
 
+  // Finally, check that we got it all right, and that both structures have
+  // exactly the expected size.
+  static_assert(sizeof(__long) == __full_size, "Miscalculated size for __long");
+  static_assert(sizeof(__short) == __full_size, "Miscalculated size for __short");
+
   // The __endian_factor is required because the field we use to store the size
   // has one fewer bit than it would if it were not a bitfield.
   //
@@ -887,6 +924,13 @@ private:
     };
     size_type __size_;
     pointer __data_;
+
+    // No padding is needed in this version of the structure, but we add a
+    // zero-sized __padding_ member anyway to match the
+    // `_LIBCPP_ABI_ALTERNATE_STRING_LAYOUT` version, so that
+    // `__padding_.clear()` can be performed unconditionally in code common to
+    // both layouts.
+    _LIBCPP_NO_UNIQUE_ADDRESS __padding<0> __padding_;
   };
 
   enum { __min_cap = (sizeof(__long) - 1) / sizeof(value_type) > 2 ? (sizeof(__long) - 1) / sizeof(value_type) : 2 };
@@ -900,10 +944,10 @@ private:
     value_type __data_[__min_cap];
   };
 
-#  endif // _LIBCPP_ABI_ALTERNATE_STRING_LAYOUT
-
   static_assert(sizeof(__short) == (sizeof(value_type) * (__min_cap + 1)), "__short has an unexpected size.");
 
+#  endif // _LIBCPP_ABI_ALTERNATE_STRING_LAYOUT
+
   union __rep {
     __short __s;
     __long __l;
@@ -2764,6 +2808,7 @@ _LIBCPP_DEPRECATED_("use __grow_by_without_replace") basic_string<_CharT, _Trait
   // This is -1 to make sure the caller sets the size properly, since old versions of this function didn't set the size
   // at all.
   __buffer.__size_ = -1;
+  __buffer.__padding_.clear();
   __reset_internal_buffer(__buffer);
 }
 
diff --git a/libcxx/test/std/strings/basic.string/awkward-char-types.pass.cpp b/libcxx/test/std/strings/basic.string/awkward-char-types.pass.cpp
new file mode 100644
index 0000000000000..a4e258680cd2e
--- /dev/null
+++ b/libcxx/test/std/strings/basic.string/awkward-char-types.pass.cpp
@@ -0,0 +1,114 @@
+#include <stdint.h>
+#include <string>
+#include <algorithm>
+#include <cassert>
+
+#include "test_macros.h"
+
+template<typename Char>
+void test_string() {
+  // Make a test string.
+  std::basic_string<Char> s;
+  LIBCPP_ASSERT(s.size() == 0);
+
+  // Append enough chars to it that we must have switched over from a short
+  // string stored internally to a long one pointing to a dynamic buffer,
+  // causing a reallocation.
+  unsigned n = sizeof(s) / sizeof(Char) + 1;
+  for (unsigned i = 0; i < n; i++) {
+    s.push_back(Char::from_integer(i));
+    LIBCPP_ASSERT(s.size() == i + 1);
+  }
+
+  // Check that all the chars were correctly copied during the realloc.
+  for (unsigned i = 0; i < n; i++) {
+    LIBCPP_ASSERT(s[i] == Char::from_integer(i));
+  }
+}
+
+template<typename Integer, size_t N>
+struct TestChar {
+  Integer values[N];
+
+  static TestChar from_integer(unsigned index) {
+    TestChar ch;
+    for (size_t i = 0; i < N; i++)
+      ch.values[i] = index + i;
+    return ch;
+  }
+
+  bool operator==(const TestChar &other) const {
+    return 0 == memcmp(values, other.values, sizeof(values));
+  }
+  bool operator<(const TestChar &other) const {
+    return 0 < memcmp(values, other.values, sizeof(values));
+  }
+};
+
+template<typename Integer, size_t N>
+struct std::char_traits<TestChar<Integer, N>> {
+  using char_type  = TestChar<Integer, N>;
+  using int_type   = int;
+  using off_type   = streamoff;
+  using pos_type   = streampos;
+  using state_type = mbstate_t;
+
+  static TEST_CONSTEXPR_CXX20 void assign(char_type& c1, const char_type& c2) { c1 = c2; }
+  static bool eq(char_type c1, char_type c2);
+  static bool lt(char_type c1, char_type c2);
+
+  static int compare(const char_type* s1, const char_type* s2, std::size_t n);
+  static std::size_t length(const char_type* s);
+  static const char_type* find(const char_type* s, std::size_t n, const char_type& a);
+  static char_type* move(char_type* s1, const char_type* s2, std::size_t n);
+  static TEST_CONSTEXPR_CXX20 char_type* copy(char_type* s1, const char_type* s2, std::size_t n) {
+    std::copy_n(s2, n, s1);
+    return s1;
+  }
+  static TEST_CONSTEXPR_CXX20 char_type* assign(char_type* s, std::size_t n, char_type a) {
+    std::fill_n(s, n, a);
+    return s;
+  }
+
+  static int_type not_eof(int_type c);
+  static char_type to_char_type(int_type c);
+  static int_type to_int_type(char_type c);
+  static bool eq_int_type(int_type c1, int_type c2);
+  static int_type eof();
+};
+
+int main(int, char**) {
+  test_string<TestChar<uint8_t, 1>>();
+  test_string<TestChar<uint8_t, 2>>();
+  test_string<TestChar<uint8_t, 3>>();
+  test_string<TestChar<uint8_t, 4>>();
+  test_string<TestChar<uint8_t, 5>>();
+  test_string<TestChar<uint8_t, 6>>();
+  test_string<TestChar<uint8_t, 7>>();
+  test_string<TestChar<uint8_t, 8>>();
+  test_string<TestChar<uint8_t, 9>>();
+  test_string<TestChar<uint8_t, 10>>();
+  test_string<TestChar<uint8_t, 11>>();
+  test_string<TestChar<uint8_t, 12>>();
+  test_string<TestChar<uint8_t, 13>>();
+  test_string<TestChar<uint8_t, 14>>();
+  test_string<TestChar<uint8_t, 15>>();
+  test_string<TestChar<uint8_t, 16>>();
+
+  test_string<TestChar<uint16_t, 1>>();
+  test_string<TestChar<uint16_t, 2>>();
+  test_string<TestChar<uint16_t, 3>>();
+  test_string<TestChar<uint16_t, 4>>();
+  test_string<TestChar<uint16_t, 5>>();
+  test_string<TestChar<uint16_t, 6>>();
+  test_string<TestChar<uint16_t, 7>>();
+  test_string<TestChar<uint16_t, 8>>();
+
+  test_string<TestChar<uint32_t, 1>>();
+  test_string<TestChar<uint32_t, 2>>();
+  test_string<TestChar<uint32_t, 3>>();
+  test_string<TestChar<uint32_t, 4>>();
+
+  test_string<TestChar<uint64_t, 1>>();
+  test_string<TestChar<uint64_t, 2>>();
+}
diff --git a/libcxx/test/std/strings/basic.string/string.modifiers/string_append/push_back.pass.cpp b/libcxx/test/std/strings/basic.string/string.modifiers/string_append/push_back.pass.cpp
index 8c2324c9d1759..34669c28d624c 100644
--- a/libcxx/test/std/strings/basic.string/string.modifiers/string_append/push_back.pass.cpp
+++ b/libcxx/test/std/strings/basic.string/string.modifiers/string_append/push_back.pass.cpp
@@ -84,9 +84,13 @@ TEST_CONSTEXPR_CXX20 bool test() {
     // https://llvm.org/PR31454
     std::basic_string<VeryLarge> s;
     VeryLarge vl = {};
+    LIBCPP_ASSERT(s.size() == 0);
     s.push_back(vl);
+    LIBCPP_ASSERT(s.size() == 1);
     s.push_back(vl);
+    LIBCPP_ASSERT(s.size() == 2);
     s.push_back(vl);
+    LIBCPP_ASSERT(s.size() == 3);
   }
 
   return true;

@github-actions
Copy link

github-actions bot commented Nov 19, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

void test_string() {
// Make a test string.
std::basic_string<Char> s;
LIBCPP_ASSERT(s.size() == 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not assert? I believe this is standard-guaranteed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other test I was modifying used LIBCPP_ASSERT, and I assumed that was local convention in libc++ tests. But now I look more closely, the purpose of LIBCPP_ASSERT seems to be to switch between runtime and static assertions, and this test isn't trying to do anything static at compile time.

One of the buildbot failures was objecting to a constexpr function
returning void, I guess because that's from a later C++ version.
Another CI failure occurred because of compiling in C++03 mode, where
'constexpr' isn't even a keyword. Replaced with
_LIBCPP_CONSTEXPR_SINCE_20 where feasible, and enum otherwise.
This one is particularly annoying because my local run of
git-clang-format didn't pick it up and I had to round-trip through
official CI to even find out there was a problem!
Spaced out all the >> in template suffixes so that C++03 won't parse
them as a right shift operator.
This wasn't spotted until stage 3 of buildbots: having used `enum` to
define all those constants, you now can't subtract one of them from
another without provoking a warning about distinct integer types being
used together. static_casting them all back to size_t makes the
generic-abi-unstable build happy again.
Ah! Now I understand why my local run keeps missing these formatting
changes. It's because by default it only does C++ formatting on files
that end in `.cpp`, so it doesn't search `string` by default, which is
where I keep making changes. It's not that CI is running a more
up-to-date clang-format which disagrees on style. It's that it's using
a longer command line which reminds it to check extensionless C++
standard headers.
The ASan failure in CI complained that a buffer had been allocated as
7 chars (in the case where sizeof(char_type)==5) and freed as only 6.
That's because the test was running in a configuration with
__endian_factor==2, and __align_allocation_size is supposed to always
return an even number in that situation, and it wasn't.

Fixed by introducing a new more careful code path for non-power-of-2
char sizes, and also adjusting the caller so that it actually
remembers the value it got back from __align_allocation_size, to pass
the same value to free later.

(I'm also suspicious of the __allocate_at_least call, which isn't
using the returned count field. That too might be odd, which would
stop us from storing it in the __long size field. At the moment this
isn't failing because __allocate_at_least never allocates more than
it's asked to.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

std::basic_string wrong capacity calculation when sizeof(CharT)=3, 5, or >8

3 participants