[Clang] Do not warn on UTF-16 -> UTF-32 conversions. #163927

cor3ntin · 2025-10-17T08:51:27Z

UTF-16 to UTF-16 conversions seems widespread,
and lone surrogate have a distinct representation in UTF-32.

Lets not warn on this case to make the warning easier to adopt. This follows SG-16 guideline

https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p3695r2.html#changes-since-r1

Fixes #163719

llvmbot · 2025-10-17T08:52:05Z

@llvm/pr-subscribers-clang

Author: Corentin Jabot (cor3ntin)

Changes

UTF-16 to UTF-16 conversions seems widespread,
and lone surrogate have a distinct representation in UTF-32.

Lets not warn on this case to make the warning easier to adopt. This follows SG-16 guideline

https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p3695r2.html#changes-since-r1

Fixes #163719

Full diff: https://github.com/llvm/llvm-project/pull/163927.diff

2 Files Affected:

(modified) clang/lib/Sema/SemaChecking.cpp (+8-1)
(modified) clang/test/SemaCXX/warn-implicit-unicode-conversions.cpp (+4-4)

diff --git a/clang/lib/Sema/SemaChecking.cpp b/clang/lib/Sema/SemaChecking.cpp
index 4f409ca0f414d..0300d09be0420 100644
--- a/clang/lib/Sema/SemaChecking.cpp
+++ b/clang/lib/Sema/SemaChecking.cpp
@@ -12309,13 +12309,20 @@ static void DiagnoseMixedUnicodeImplicitConversion(Sema &S, const Type *Source,
                                                    SourceLocation CC) {
   assert(Source->isUnicodeCharacterType() && Target->isUnicodeCharacterType() &&
          Source != Target);
+
+  // Lone surrogates have a distinct representation in UTF-32.
+  // Converting betweem UTF-16 and UTF-32 codepoint seems very widespread,
+  // so don't warn on such conversion.
+  if(Source->isChar16Type() && Target->isChar32Type())
+      return;
+
   Expr::EvalResult Result;
   if (E->EvaluateAsInt(Result, S.getASTContext(), Expr::SE_AllowSideEffects,
                        S.isConstantEvaluatedContext())) {
     llvm::APSInt Value(32);
     Value = Result.Val.getInt();
     bool IsASCII = Value <= 0x7F;
-    bool IsBMP = Value <= 0xD7FF || (Value >= 0xE000 && Value <= 0xFFFF);
+    bool IsBMP = Value <= 0xDFFF || (Value >= 0xE000 && Value <= 0xFFFF);
     bool ConversionPreservesSemantics =
         IsASCII || (!Source->isChar8Type() && !Target->isChar8Type() && IsBMP);
 
diff --git a/clang/test/SemaCXX/warn-implicit-unicode-conversions.cpp b/clang/test/SemaCXX/warn-implicit-unicode-conversions.cpp
index fcff006d0e028..f17f20ca25295 100644
--- a/clang/test/SemaCXX/warn-implicit-unicode-conversions.cpp
+++ b/clang/test/SemaCXX/warn-implicit-unicode-conversions.cpp
@@ -14,7 +14,7 @@ void test(char8_t u8, char16_t u16, char32_t u32) {
     c16(u32); // expected-warning {{implicit conversion from 'char32_t' to 'char16_t' may lose precision and change the meaning of the represented code unit}}
 
     c32(u8);  // expected-warning {{implicit conversion from 'char8_t' to 'char32_t' may change the meaning of the represented code unit}}
-    c32(u16); // expected-warning {{implicit conversion from 'char16_t' to 'char32_t' may change the meaning of the represented code unit}}
+    c32(u16);
     c32(u32);
 
 
@@ -30,7 +30,7 @@ void test(char8_t u8, char16_t u16, char32_t u32) {
     c16(char32_t(0x7f));
     c16(char32_t(0x80));
     c16(char32_t(0xD7FF));
-    c16(char32_t(0xD800)); // expected-warning {{implicit conversion from 'char32_t' to 'char16_t' changes the meaning of the code unit '<0xD800>'}}
+    c16(char32_t(0xD800));
     c16(char32_t(0xE000));
     c16(char32_t(U'🐉')); // expected-warning {{implicit conversion from 'char32_t' to 'char16_t' changes the meaning of the code point '🐉'}}
 
@@ -44,8 +44,8 @@ void test(char8_t u8, char16_t u16, char32_t u32) {
     c32(char16_t(0x80));
 
     c32(char16_t(0xD7FF));
-    c32(char16_t(0xD800)); // expected-warning {{implicit conversion from 'char16_t' to 'char32_t' changes the meaning of the code unit '<0xD800>'}}
-    c32(char16_t(0xDFFF)); // expected-warning {{implicit conversion from 'char16_t' to 'char32_t' changes the meaning of the code unit '<0xDFFF>'}}
+    c32(char16_t(0xD800));
+    c32(char16_t(0xDFFF));
     c32(char16_t(0xE000));
     c32(char16_t(u'☕'));

github-actions · 2025-10-17T08:53:17Z

✅ With the latest revision this PR passed the C/C++ code formatter.

UTF-16 to UTF-16 conversions seems widespread, and lone surrogate have a distinct representation in UTF-32. Lets not warn on this case to make the warning easier to adopt. This follows SG-16 guideline https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p3695r2.html#changes-since-r1 Fixes llvm#163719

AaronBallman

Thank you for working on this!

AaronBallman · 2025-10-17T11:49:11Z

clang/lib/Sema/SemaChecking.cpp

         Source != Target);
+
+  // Lone surrogates have a distinct representation in UTF-32.
+  // Converting betweem UTF-16 and UTF-32 codepoint seems very widespread,


Suggested change

// Converting betweem UTF-16 and UTF-32 codepoint seems very widespread,

// Converting between UTF-16 and UTF-32 codepoints seems very widespread,

Would it make sense to allow this as an opt-in warning so folks who want to catch those conversions still can?

I did consider that but given feedback we probably want to backport this change so I kept it simple

cor3ntin · 2025-10-17T18:13:39Z

@AaronBallman

AaronBallman

LGTM, though this should have a release note

cor3ntin · 2025-10-17T19:01:53Z

@AaronBallman nope, i want to backport it :)

AaronBallman · 2025-10-17T19:05:12Z

@AaronBallman nope, i want to backport it :)

Sure! Make sure the backport has the release note then. :-D

llvmbot added clang Clang issues not falling into any other category clang:frontend Language frontend issues, e.g. anything involving "Sema" labels Oct 17, 2025

cor3ntin force-pushed the corentin/gh163719 branch from e232d71 to 2943d39 Compare October 17, 2025 09:11

AaronBallman reviewed Oct 17, 2025

View reviewed changes

fix typos

d0f94e9

AaronBallman approved these changes Oct 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Clang] Do not warn on UTF-16 -> UTF-32 conversions. #163927

[Clang] Do not warn on UTF-16 -> UTF-32 conversions. #163927

Uh oh!

cor3ntin commented Oct 17, 2025

Uh oh!

llvmbot commented Oct 17, 2025

Uh oh!

github-actions bot commented Oct 17, 2025 •

edited

Loading

Uh oh!

AaronBallman left a comment

Uh oh!

AaronBallman Oct 17, 2025

Uh oh!

AaronBallman Oct 17, 2025

Uh oh!

cor3ntin Oct 17, 2025

Uh oh!

cor3ntin commented Oct 17, 2025

Uh oh!

AaronBallman left a comment

Uh oh!

cor3ntin commented Oct 17, 2025

Uh oh!

AaronBallman commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	// Converting betweem UTF-16 and UTF-32 codepoint seems very widespread,
	// Converting between UTF-16 and UTF-32 codepoints seems very widespread,

[Clang] Do not warn on UTF-16 -> UTF-32 conversions. #163927

Are you sure you want to change the base?

[Clang] Do not warn on UTF-16 -> UTF-32 conversions. #163927

Uh oh!

Conversation

cor3ntin commented Oct 17, 2025

Uh oh!

llvmbot commented Oct 17, 2025

Uh oh!

github-actions bot commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AaronBallman left a comment

Choose a reason for hiding this comment

Uh oh!

AaronBallman Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

AaronBallman Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

cor3ntin Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

cor3ntin commented Oct 17, 2025

Uh oh!

AaronBallman left a comment

Choose a reason for hiding this comment

Uh oh!

cor3ntin commented Oct 17, 2025

Uh oh!

AaronBallman commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Oct 17, 2025 •

edited

Loading