Enable fexec-charset option #138895

abhina-sree · 2025-05-07T15:30:44Z

This patch enables the fexec-charset option to control the execution charset of string literals. It sets the default internal charset, system charset, and execution charset for z/OS and UTF-8 for all other platforms.

This patch depends on adding the TextEncodingConverter class #138893

Relevant RFCs:
https://discourse.llvm.org/t/rfc-adding-a-charset-converter-to-the-llvm-support-library/69795
https://discourse.llvm.org/t/rfc-enabling-fexec-charset-support-to-llvm-and-clang-reposting/71512

The table below summarizes what encoding needs to be used for string literals depending on the context. This change needs to be added in a followup PR

╔══════════════════════════════════════╤═════════════════════════════════╗
║Context                               │Output Codepage                  ║
╠══════════════════════════════════════╪═════════════════════════════════╣
║asm("...")                            │system charset (IBM-1047 on z/OS)║
╟──────────────────────────────────────┼─────────────────────────────────╢
║typeinfo name                         │system charset (IBM-1047 on z/OS)║
╟──────────────────────────────────────┼─────────────────────────────────╢
║#include "fn" or #include <fn>        │system charset (IBM-1047 on z/OS)║
╟──────────────────────────────────────┼─────────────────────────────────╢
║__FILE__ , __func__                   │-fexec-charset                   ║
╟──────────────────────────────────────┼─────────────────────────────────╢
║literal, string & char                │-fexec-charset                   ║
╟──────────────────────────────────────┼─────────────────────────────────╢
║user-defined literal                  │-fexec-charset                   ║
╟──────────────────────────────────────┼─────────────────────────────────╢
║extern "C" ...                        │n/a (internal charset, UTF-8)    ║
╟──────────────────────────────────────┼─────────────────────────────────╢
║_Pragma("message(...)”)               │n/a (internal charset, UTF-8)    ║
╟──────────────────────────────────────┼─────────────────────────────────╢
║attribute args (GNU, GCC)             │n/a (internal charset, UTF-8)    ║
╟──────────────────────────────────────┼─────────────────────────────────╢
║module map string literals (-fmodules)│n/a (internal charset, UTF-8)    ║
╟──────────────────────────────────────┼─────────────────────────────────╢
║line directive, digit directive       │n/a (internal charset, UTF-8)    ║
╚══════════════════════════════════════╧═════════════════════════════════╝

llvmbot · 2025-05-07T15:31:20Z

@llvm/pr-subscribers-llvm-support
@llvm/pr-subscribers-clang-driver

@llvm/pr-subscribers-clang

Author: Abhina Sree (abhina-sree)

Changes

This patch enables the fexec-charset option to control the execution charset of string literals. It sets the default internal charset, system charset, and execution charset for z/OS and UTF-8 for all other platforms.

This patch depends on adding the CharSetConverter class #138893

Patch is 34.26 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/138895.diff

20 Files Affected:

(modified) clang/docs/LanguageExtensions.rst (+1-2)
(modified) clang/include/clang/Basic/LangOptions.h (+3)
(modified) clang/include/clang/Basic/TokenKinds.h (+7)
(modified) clang/include/clang/Driver/Options.td (+5)
(added) clang/include/clang/Lex/LiteralConverter.h (+36)
(modified) clang/include/clang/Lex/LiteralSupport.h (+8-4)
(modified) clang/include/clang/Lex/Preprocessor.h (+3)
(modified) clang/lib/Driver/ToolChains/Clang.cpp (+13-4)
(modified) clang/lib/Frontend/CompilerInstance.cpp (+4)
(modified) clang/lib/Frontend/InitPreprocessor.cpp (+8-4)
(modified) clang/lib/Lex/CMakeLists.txt (+1)
(added) clang/lib/Lex/LiteralConverter.cpp (+68)
(modified) clang/lib/Lex/LiteralSupport.cpp (+106-22)
(added) clang/test/CodeGen/systemz-charset.c (+35)
(added) clang/test/CodeGen/systemz-charset.cpp (+46)
(modified) clang/test/Driver/cl-options.c (+4-3)
(modified) clang/test/Driver/clang_f_opts.c (+9-3)
(modified) clang/test/Preprocessor/init-s390x.c (+1)
(modified) llvm/include/llvm/TargetParser/Triple.h (+3)
(modified) llvm/lib/TargetParser/Triple.cpp (+7)

diff --git a/clang/docs/LanguageExtensions.rst b/clang/docs/LanguageExtensions.rst
index ebcad44197ce4..44e20623d4d0b 100644
--- a/clang/docs/LanguageExtensions.rst
+++ b/clang/docs/LanguageExtensions.rst
@@ -416,8 +416,7 @@ Builtin Macros
 ``__clang_literal_encoding__``
   Defined to a narrow string literal that represents the current encoding of
   narrow string literals, e.g., ``"hello"``. This macro typically expands to
-  "UTF-8" (but may change in the future if the
-  ``-fexec-charset="Encoding-Name"`` option is implemented.)
+  the charset specified by -fexec-charset if specified, or the system charset.
 
 ``__clang_wide_literal_encoding__``
   Defined to a narrow string literal that represents the current encoding of
diff --git a/clang/include/clang/Basic/LangOptions.h b/clang/include/clang/Basic/LangOptions.h
index 491e8bee9fd5c..559a4be70b74c 100644
--- a/clang/include/clang/Basic/LangOptions.h
+++ b/clang/include/clang/Basic/LangOptions.h
@@ -633,6 +633,9 @@ class LangOptions : public LangOptionsBase {
   bool AtomicFineGrainedMemory = false;
   bool AtomicIgnoreDenormalMode = false;
 
+  /// Name of the exec charset to convert the internal charset to.
+  std::string ExecCharset;
+
   LangOptions();
 
   /// Set language defaults for the given input language and
diff --git a/clang/include/clang/Basic/TokenKinds.h b/clang/include/clang/Basic/TokenKinds.h
index 1b133dde89587..34f6133973e71 100644
--- a/clang/include/clang/Basic/TokenKinds.h
+++ b/clang/include/clang/Basic/TokenKinds.h
@@ -101,6 +101,13 @@ inline bool isLiteral(TokenKind K) {
          isStringLiteral(K) || K == tok::header_name || K == tok::binary_data;
 }
 
+/// Return true if this is a utf literal kind.
+inline bool isUTFLiteral(TokenKind K) {
+  return K == tok::utf8_char_constant || K == tok::utf8_string_literal ||
+         K == tok::utf16_char_constant || K == tok::utf16_string_literal ||
+         K == tok::utf32_char_constant || K == tok::utf32_string_literal;
+}
+
 /// Return true if this is any of tok::annot_* kinds.
 bool isAnnotation(TokenKind K);
 
diff --git a/clang/include/clang/Driver/Options.td b/clang/include/clang/Driver/Options.td
index 30ea75bb108d5..9d352eb1270fe 100644
--- a/clang/include/clang/Driver/Options.td
+++ b/clang/include/clang/Driver/Options.td
@@ -7197,6 +7197,11 @@ let Visibility = [CC1Option, CC1AsOption, FC1Option] in {
 def tune_cpu : Separate<["-"], "tune-cpu">,
   HelpText<"Tune for a specific cpu type">,
   MarshallingInfoString<TargetOpts<"TuneCPU">>;
+def fexec_charset : Separate<["-"], "fexec-charset">, MetaVarName<"<charset>">,
+  HelpText<"Set the execution <charset> for string and character literals. "
+           "Supported character encodings include ISO8859-1, UTF-8, IBM-1047 "
+           "and those supported by the host icu or iconv library.">,
+  MarshallingInfoString<LangOpts<"ExecCharset">>;
 def target_cpu : Separate<["-"], "target-cpu">,
   HelpText<"Target a specific cpu type">,
   MarshallingInfoString<TargetOpts<"CPU">>;
diff --git a/clang/include/clang/Lex/LiteralConverter.h b/clang/include/clang/Lex/LiteralConverter.h
new file mode 100644
index 0000000000000..203111255b791
--- /dev/null
+++ b/clang/include/clang/Lex/LiteralConverter.h
@@ -0,0 +1,36 @@
+//===--- clang/Lex/LiteralConverter.h - Translator for Literals -*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_CLANG_LEX_LITERALCONVERTER_H
+#define LLVM_CLANG_LEX_LITERALCONVERTER_H
+
+#include "clang/Basic/Diagnostic.h"
+#include "clang/Basic/LangOptions.h"
+#include "clang/Basic/TargetInfo.h"
+#include "llvm/ADT/StringMap.h"
+#include "llvm/ADT/StringRef.h"
+#include "llvm/Support/CharSet.h"
+
+enum ConversionAction { NoConversion, ToSystemCharset, ToExecCharset };
+
+class LiteralConverter {
+  llvm::StringRef InternalCharset;
+  llvm::StringRef SystemCharset;
+  llvm::StringRef ExecCharset;
+  llvm::StringMap<llvm::CharSetConverter> CharsetConverters;
+
+public:
+  llvm::CharSetConverter *getConverter(const char *Codepage);
+  llvm::CharSetConverter *getConverter(ConversionAction Action);
+  llvm::CharSetConverter *createAndInsertCharConverter(const char *To);
+  void setConvertersFromOptions(const clang::LangOptions &Opts,
+                                const clang::TargetInfo &TInfo,
+                                clang::DiagnosticsEngine &Diags);
+};
+
+#endif
diff --git a/clang/include/clang/Lex/LiteralSupport.h b/clang/include/clang/Lex/LiteralSupport.h
index ea5f63bc20399..114d50c226a29 100644
--- a/clang/include/clang/Lex/LiteralSupport.h
+++ b/clang/include/clang/Lex/LiteralSupport.h
@@ -17,10 +17,12 @@
 #include "clang/Basic/CharInfo.h"
 #include "clang/Basic/LLVM.h"
 #include "clang/Basic/TokenKinds.h"
+#include "clang/Lex/LiteralConverter.h"
 #include "llvm/ADT/APFloat.h"
 #include "llvm/ADT/ArrayRef.h"
 #include "llvm/ADT/SmallString.h"
 #include "llvm/ADT/StringRef.h"
+#include "llvm/Support/CharSet.h"
 #include "llvm/Support/DataTypes.h"
 
 namespace clang {
@@ -233,6 +235,7 @@ class StringLiteralParser {
   const LangOptions &Features;
   const TargetInfo &Target;
   DiagnosticsEngine *Diags;
+  LiteralConverter *LiteralConv;
 
   unsigned MaxTokenLength;
   unsigned SizeBound;
@@ -248,16 +251,17 @@ class StringLiteralParser {
 public:
   StringLiteralParser(ArrayRef<Token> StringToks, Preprocessor &PP,
                       StringLiteralEvalMethod StringMethod =
-                          StringLiteralEvalMethod::Evaluated);
+                          StringLiteralEvalMethod::Evaluated,
+                      ConversionAction Action = ToExecCharset);
   StringLiteralParser(ArrayRef<Token> StringToks, const SourceManager &sm,
                       const LangOptions &features, const TargetInfo &target,
                       DiagnosticsEngine *diags = nullptr)
       : SM(sm), Features(features), Target(target), Diags(diags),
-        MaxTokenLength(0), SizeBound(0), CharByteWidth(0), Kind(tok::unknown),
+        LiteralConv(nullptr), MaxTokenLength(0), SizeBound(0), CharByteWidth(0), Kind(tok::unknown),
         ResultPtr(ResultBuf.data()),
         EvalMethod(StringLiteralEvalMethod::Evaluated), hadError(false),
         Pascal(false) {
-    init(StringToks);
+    init(StringToks, NoConversion);
   }
 
   bool hadError;
@@ -305,7 +309,7 @@ class StringLiteralParser {
   static bool isValidUDSuffix(const LangOptions &LangOpts, StringRef Suffix);
 
 private:
-  void init(ArrayRef<Token> StringToks);
+  void init(ArrayRef<Token> StringToks, ConversionAction Action);
   bool CopyStringFragment(const Token &Tok, const char *TokBegin,
                           StringRef Fragment);
   void DiagnoseLexingError(SourceLocation Loc);
diff --git a/clang/include/clang/Lex/Preprocessor.h b/clang/include/clang/Lex/Preprocessor.h
index f2dfd3a349b8b..350cd2d436eb3 100644
--- a/clang/include/clang/Lex/Preprocessor.h
+++ b/clang/include/clang/Lex/Preprocessor.h
@@ -25,6 +25,7 @@
 #include "clang/Basic/TokenKinds.h"
 #include "clang/Lex/HeaderSearch.h"
 #include "clang/Lex/Lexer.h"
+#include "clang/Lex/LiteralConverter.h"
 #include "clang/Lex/MacroInfo.h"
 #include "clang/Lex/ModuleLoader.h"
 #include "clang/Lex/ModuleMap.h"
@@ -156,6 +157,7 @@ class Preprocessor {
   std::unique_ptr<ScratchBuffer> ScratchBuf;
   HeaderSearch      &HeaderInfo;
   ModuleLoader      &TheModuleLoader;
+  LiteralConverter LiteralConv;
 
   /// External source of macros.
   ExternalPreprocessorSource *ExternalSource;
@@ -1218,6 +1220,7 @@ class Preprocessor {
   SelectorTable &getSelectorTable() { return Selectors; }
   Builtin::Context &getBuiltinInfo() { return *BuiltinInfo; }
   llvm::BumpPtrAllocator &getPreprocessorAllocator() { return BP; }
+  LiteralConverter &getLiteralConverter() { return LiteralConv; }
 
   void setExternalSource(ExternalPreprocessorSource *Source) {
     ExternalSource = Source;
diff --git a/clang/lib/Driver/ToolChains/Clang.cpp b/clang/lib/Driver/ToolChains/Clang.cpp
index f87549baff5e1..7b364634f8ff7 100644
--- a/clang/lib/Driver/ToolChains/Clang.cpp
+++ b/clang/lib/Driver/ToolChains/Clang.cpp
@@ -50,6 +50,7 @@
 #include "llvm/Frontend/Debug/Options.h"
 #include "llvm/Object/ObjectFile.h"
 #include "llvm/Option/ArgList.h"
+#include "llvm/Support/CharSet.h"
 #include "llvm/Support/CodeGen.h"
 #include "llvm/Support/Compiler.h"
 #include "llvm/Support/Compression.h"
@@ -7597,12 +7598,20 @@ void Clang::ConstructJob(Compilation &C, const JobAction &JA,
                                           << value;
   }
 
-  // -fexec_charset=UTF-8 is default. Reject others
+  // Set the default fexec-charset as the system charset.
+  CmdArgs.push_back("-fexec-charset");
+  CmdArgs.push_back(Args.MakeArgString(Triple.getSystemCharset()));
   if (Arg *execCharset = Args.getLastArg(options::OPT_fexec_charset_EQ)) {
     StringRef value = execCharset->getValue();
-    if (!value.equals_insensitive("utf-8"))
-      D.Diag(diag::err_drv_invalid_value) << execCharset->getAsString(Args)
-                                          << value;
+    llvm::ErrorOr<llvm::CharSetConverter> ErrorOrConverter =
+        llvm::CharSetConverter::create("UTF-8", value.data());
+    if (ErrorOrConverter) {
+      CmdArgs.push_back("-fexec-charset");
+      CmdArgs.push_back(Args.MakeArgString(value));
+    } else {
+      D.Diag(diag::err_drv_invalid_value)
+          << execCharset->getAsString(Args) << value;
+    }
   }
 
   RenderDiagnosticsOptions(D, Args, CmdArgs);
diff --git a/clang/lib/Frontend/CompilerInstance.cpp b/clang/lib/Frontend/CompilerInstance.cpp
index b59496babb62c..879b81bbadabe 100644
--- a/clang/lib/Frontend/CompilerInstance.cpp
+++ b/clang/lib/Frontend/CompilerInstance.cpp
@@ -32,6 +32,7 @@
 #include "clang/Frontend/Utils.h"
 #include "clang/Frontend/VerifyDiagnosticConsumer.h"
 #include "clang/Lex/HeaderSearch.h"
+#include "clang/Lex/LiteralConverter.h"
 #include "clang/Lex/Preprocessor.h"
 #include "clang/Lex/PreprocessorOptions.h"
 #include "clang/Sema/CodeCompleteConsumer.h"
@@ -537,6 +538,9 @@ void CompilerInstance::createPreprocessor(TranslationUnitKind TUKind) {
 
   if (GetDependencyDirectives)
     PP->setDependencyDirectivesGetter(*GetDependencyDirectives);
+
+  PP->getLiteralConverter().setConvertersFromOptions(getLangOpts(), getTarget(),
+                                                     getDiagnostics());
 }
 
 std::string CompilerInstance::getSpecificModuleCachePath(StringRef ModuleHash) {
diff --git a/clang/lib/Frontend/InitPreprocessor.cpp b/clang/lib/Frontend/InitPreprocessor.cpp
index 96d6fb64a6319..47b71d0fb98c5 100644
--- a/clang/lib/Frontend/InitPreprocessor.cpp
+++ b/clang/lib/Frontend/InitPreprocessor.cpp
@@ -1058,10 +1058,14 @@ static void InitializePredefinedMacros(const TargetInfo &TI,
     }
   }
 
-  // Macros to help identify the narrow and wide character sets
-  // FIXME: clang currently ignores -fexec-charset=. If this changes,
-  // then this may need to be updated.
-  Builder.defineMacro("__clang_literal_encoding__", "\"UTF-8\"");
+  // Macros to help identify the narrow and wide character sets. This is set
+  // to fexec-charset. If fexec-charset is not specified, the default is the
+  // system charset.
+  if (!LangOpts.ExecCharset.empty())
+    Builder.defineMacro("__clang_literal_encoding__", LangOpts.ExecCharset);
+  else
+    Builder.defineMacro("__clang_literal_encoding__",
+                        TI.getTriple().getSystemCharset());
   if (TI.getTypeWidth(TI.getWCharType()) >= 32) {
     // FIXME: 32-bit wchar_t signals UTF-32. This may change
     // if -fwide-exec-charset= is ever supported.
diff --git a/clang/lib/Lex/CMakeLists.txt b/clang/lib/Lex/CMakeLists.txt
index f61737cd68021..9e38a1b8fbb44 100644
--- a/clang/lib/Lex/CMakeLists.txt
+++ b/clang/lib/Lex/CMakeLists.txt
@@ -12,6 +12,7 @@ add_clang_library(clangLex
   InitHeaderSearch.cpp
   Lexer.cpp
   LexHLSLRootSignature.cpp
+  LiteralConverter.cpp
   LiteralSupport.cpp
   MacroArgs.cpp
   MacroInfo.cpp
diff --git a/clang/lib/Lex/LiteralConverter.cpp b/clang/lib/Lex/LiteralConverter.cpp
new file mode 100644
index 0000000000000..a89781e182e83
--- /dev/null
+++ b/clang/lib/Lex/LiteralConverter.cpp
@@ -0,0 +1,68 @@
+//===--- LiteralConverter.cpp - Translator for String Literals -----------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "clang/Lex/LiteralConverter.h"
+#include "clang/Basic/DiagnosticDriver.h"
+
+using namespace llvm;
+
+llvm::CharSetConverter *LiteralConverter::getConverter(const char *Codepage) {
+  auto Iter = CharsetConverters.find(Codepage);
+  if (Iter != CharsetConverters.end())
+    return &Iter->second;
+  return nullptr;
+}
+
+llvm::CharSetConverter *
+LiteralConverter::getConverter(ConversionAction Action) {
+  StringRef CodePage;
+  if (Action == ToSystemCharset)
+    CodePage = SystemCharset;
+  else if (Action == ToExecCharset)
+    CodePage = ExecCharset;
+  else
+    CodePage = InternalCharset;
+  return getConverter(CodePage.data());
+}
+
+llvm::CharSetConverter *
+LiteralConverter::createAndInsertCharConverter(const char *To) {
+  const char *From = InternalCharset.data();
+  llvm::CharSetConverter *Converter = getConverter(To);
+  if (Converter)
+    return Converter;
+
+  ErrorOr<CharSetConverter> ErrorOrConverter =
+      llvm::CharSetConverter::create(From, To);
+  if (!ErrorOrConverter)
+    return nullptr;
+  CharsetConverters.insert_or_assign(StringRef(To),
+                                     std::move(*ErrorOrConverter));
+  return getConverter(To);
+}
+
+void LiteralConverter::setConvertersFromOptions(
+    const clang::LangOptions &Opts, const clang::TargetInfo &TInfo,
+    clang::DiagnosticsEngine &Diags) {
+  using namespace llvm;
+  SystemCharset = TInfo.getTriple().getSystemCharset();
+  InternalCharset = "UTF-8";
+  ExecCharset = Opts.ExecCharset.empty() ? InternalCharset : Opts.ExecCharset;
+  // Create converter between internal and system charset
+  if (!InternalCharset.equals(SystemCharset))
+    createAndInsertCharConverter(SystemCharset.data());
+
+  // Create converter between internal and exec charset specified
+  // in fexec-charset option.
+  if (InternalCharset.equals(ExecCharset))
+    return;
+  if (!createAndInsertCharConverter(ExecCharset.data())) {
+    Diags.Report(clang::diag::err_drv_invalid_value)
+        << "-fexec-charset" << ExecCharset;
+  }
+}
diff --git a/clang/lib/Lex/LiteralSupport.cpp b/clang/lib/Lex/LiteralSupport.cpp
index 75ad977d64b24..927c4e1823a19 100644
--- a/clang/lib/Lex/LiteralSupport.cpp
+++ b/clang/lib/Lex/LiteralSupport.cpp
@@ -134,7 +134,8 @@ static unsigned ProcessCharEscape(const char *ThisTokBegin,
                                   FullSourceLoc Loc, unsigned CharWidth,
                                   DiagnosticsEngine *Diags,
                                   const LangOptions &Features,
-                                  StringLiteralEvalMethod EvalMethod) {
+                                  StringLiteralEvalMethod EvalMethod,
+                                  llvm::CharSetConverter *Converter) {
   const char *EscapeBegin = ThisTokBuf;
   bool Delimited = false;
   bool EndDelimiterFound = false;
@@ -146,6 +147,8 @@ static unsigned ProcessCharEscape(const char *ThisTokBegin,
   // that would have been \", which would not have been the end of string.
   unsigned ResultChar = *ThisTokBuf++;
   char Escape = ResultChar;
+  bool Translate = true;
+  bool Invalid = false;
   switch (ResultChar) {
   // These map to themselves.
   case '\\': case '\'': case '"': case '?': break;
@@ -186,6 +189,7 @@ static unsigned ProcessCharEscape(const char *ThisTokBegin,
     ResultChar = 11;
     break;
   case 'x': { // Hex escape.
+    Translate = false;
     ResultChar = 0;
     if (ThisTokBuf != ThisTokEnd && *ThisTokBuf == '{') {
       Delimited = true;
@@ -249,6 +253,7 @@ static unsigned ProcessCharEscape(const char *ThisTokBegin,
   case '4': case '5': case '6': case '7': {
     // Octal escapes.
     --ThisTokBuf;
+    Translate = false;
     ResultChar = 0;
 
     // Octal escapes are a series of octal digits with maximum length 3.
@@ -334,6 +339,7 @@ static unsigned ProcessCharEscape(const char *ThisTokBegin,
         << std::string(1, ResultChar);
     break;
   default:
+    Invalid = true;
     if (!Diags)
       break;
 
@@ -367,6 +373,15 @@ static unsigned ProcessCharEscape(const char *ThisTokBegin,
     HadError = true;
   }
 
+  if (Translate && Converter) {
+    // Invalid escapes are written as '?' and then translated.
+    char ByteChar = Invalid ? '?' : ResultChar;
+    SmallString<8> ResultCharConv;
+    Converter->convert(StringRef(&ByteChar, 1), ResultCharConv);
+    assert(ResultCharConv.size() == 1 &&
+           "Char size increased after translation");
+    ResultChar = ResultCharConv[0];
+  }
   return ResultChar;
 }
 
@@ -1739,6 +1754,7 @@ CharLiteralParser::CharLiteralParser(const char *begin, const char *end,
   HadError = false;
 
   Kind = kind;
+  LiteralConverter *LiteralConv = &PP.getLiteralConverter();
 
   const char *TokBegin = begin;
 
@@ -1805,6 +1821,10 @@ CharLiteralParser::CharLiteralParser(const char *begin, const char *end,
     largest_character_for_kind = 0x7Fu;
   }
 
+  llvm::CharSetConverter *Converter = nullptr;
+  if (!isUTFLiteral(Kind) && LiteralConv)
+    Converter = LiteralConv->getConverter(ToExecCharset);
+
   while (begin != end) {
     // Is this a span of non-escape characters?
     if (begin[0] != '\\') {
@@ -1842,6 +1862,16 @@ CharLiteralParser::CharLiteralParser(const char *begin, const char *end,
             HadError = true;
             PP.Diag(Loc, diag::err_character_too_large);
           }
+          if (!HadError && Converter) {
+            assert(Kind != tok::wide_char_constant &&
+                   "Wide character translation not supported");
+            char ByteChar = *tmp_out_start;
+            SmallString<1> ConvertedChar;
+            Converter->convert(StringRef(&ByteChar, 1), ConvertedChar);
+            assert(ConvertedChar.size() == 1 &&
+                   "Char size increased after translation");
+            *tmp_out_start = ConvertedChar[0];
+          }
         }
       }
 
@@ -1849,16 +1879,35 @@ CharLiteralParser::CharLiteralParser(const char *begin, const char *end,
     }
     // Is this a Universal Character Name escape?
     if (begin[1] == 'u' || begin[1] == 'U' || begin[1] == 'N') {
-      unsigned short UcnLen = 0;
-      if (!ProcessUCNEscape(TokBegin, begin, end, *buffer_begin, UcnLen,
-                            FullSourceLoc(Loc, PP.getSourceManager()),
-                            &PP.getDiagnostics(), PP.getLangOpts(), true)) {
-        HadError = true;
-      } else if (*buffer_begin > largest_character_for_kind) {
-        HadError = true;
-        PP.Diag(Loc, diag::err_character_too_large);
+      if (Converter == nullptr) {
+        unsigned short UcnLen = 0;
+        if (!ProcessUCNEscape(TokBegin, begin, end, *buffer_begin, UcnLen,
+                              FullSourceLoc(Loc, PP.getSourceManager()),
+                              &PP.getDiagnostics(), PP.getLangOpts(), true)) {
+          HadError = true;
+        } else if (*buffer_begin > largest_character_for_kind) {
+          HadError = true;
+          PP.Diag(Loc, diag::err_character_too_large);
+        }
+      } else {
+        char Cp[8];
+        char *ResultPtr = Cp;
+        unsigned CharByteWidth = 1;
+        EncodeUCNEscape(TokBegin, begin, end, ResultPtr, HadError,
+                        FullSourceLoc(Loc, PP.getSourceManager()),
+                        CharByteWidth, &PP.getDiagnostics(), PP.getLangOpts());
+        if (!HadError) {
+          SmallString<8> CpConv;
+          Converter->convert(StringRef(Cp), CpConv);
+          if (CpConv.size() > 1) {
+            HadError = true;
+            PP.Diag(Loc, diag::err_character_too_large);
+          } else {
+            memcpy(Cp, CpConv.data(), CpConv.size());
+            *buffer_begin = *Cp;
+          }
+        }
       }
-
   ...
[truncated]

github-actions · 2025-05-07T15:33:12Z

✅ With the latest revision this PR passed the C/C++ code formatter.

cor3ntin · 2025-05-09T13:37:04Z

Thanks for working on this
I don't have time to do a full review yet, but this is going to need a lot more tests

Constant evaluation tests
Preprocessor string concatenation
Tests that we emit diagnostics for non-encodable strings
Tests for wide strings ( It's not okay to assert on that, maybe a diagnostics would be reasonable)
Ensure that unevaluated strings are not converted (as in static_assert(false, "string"), [[deprecated("string")]]]
In C++, tests for static_assert(false, string_view("string"))
Tests for diagnostics in printf strings

(This points to the need to sometimes convert strings from ebcdic to utf8 for diagnostic purposes)

llvm/include/llvm/TargetParser/Triple.h

clang/docs/LanguageExtensions.rst

clang/include/clang/Basic/LangOptions.h

cor3ntin · 2025-05-12T07:14:08Z

clang/include/clang/Lex/LiteralConverter.h

+#include "llvm/ADT/StringRef.h"
+#include "llvm/Support/CharSet.h"
+
+enum ConversionAction { NoConversion, ToSystemCharset, ToExecCharset };


We should have FromOrdinaryLiteralEncoding and ToOrdinaryLiteralEncoding instead of System/Exec

The names here were chosen to represent the different encodings that need to be used based on the context of the string literal which I mentioned in my RFC and have also copied the table to the description of this PR. I'm not sure the proposed names make that context clear

clang/include/clang/Lex/LiteralConverter.h

cor3ntin · 2025-05-12T07:27:19Z

clang/lib/Lex/LiteralSupport.cpp

  // that would have been \", which would not have been the end of string.
  unsigned ResultChar = *ThisTokBuf++;
  char Escape = ResultChar;
+  bool Translate = true;


Suggested change

bool Translate = true;

bool Transcode = true;

I would prefer this defaults to false

cor3ntin · 2025-05-12T07:30:58Z

clang/lib/Lex/LiteralSupport.cpp

+    Converter->convert(StringRef(&ByteChar, 1), ResultCharConv);
+    assert(ResultCharConv.size() == 1 &&
+           "Char size increased after translation");


Can we have a GetReplacementChar function... somewhere, and cache the result?

clang/lib/Lex/LiteralSupport.cpp

cor3ntin · 2025-05-12T07:34:16Z

clang/lib/Lex/LiteralSupport.cpp

+          if (!HadError && Converter) {
+            assert(Kind != tok::wide_char_constant &&
+                   "Wide character translation not supported");
+            char ByteChar = *tmp_out_start;
+            SmallString<1> ConvertedChar;
+            Converter->convert(StringRef(&ByteChar, 1), ConvertedChar);
+            assert(ConvertedChar.size() == 1 &&
+                   "Char size increased after translation");
+            *tmp_out_start = ConvertedChar[0];


This should be handled with diagnostics. The conversion can also fail, and that should be handled.

cor3ntin · 2025-05-12T07:40:22Z

clang/lib/Lex/LiteralSupport.cpp

+                   "Wide character translation not supported");
+            char ByteChar = *tmp_out_start;
+            SmallString<1> ConvertedChar;
+            Converter->convert(StringRef(&ByteChar, 1), ConvertedChar);


Here the order of operation should be:
-> convert from UTF-8 to UTF-32, check it's a valid character
-> convert the same buffer from UTF-8 to the literal encoding
-> Check that that succeed and has a size of one

(ie some codepoints might hgave a size of 2 when encoded as utf-8 but 1 when encoded as latin1)

efriedma-quic · 2025-05-19T23:51:38Z

clang/lib/Lex/LiteralSupport.cpp

  case '4': case '5': case '6': case '7': {
    // Octal escapes.
    --ThisTokBuf;
+    Translate = false;


Also handle \o escapes?

clang/lib/Lex/LiteralSupport.cpp

…38893) This patch adds a wrapper class called EncodingConverter for ConverterEBCDIC. This class is then extended to support the ICU library or iconv library. The ICU library currently takes priority over the iconv library. Relevant RFCs: https://discourse.llvm.org/t/rfc-adding-a-charset-converter-to-the-llvm-support-library/69795 https://discourse.llvm.org/t/rfc-enabling-fexec-charset-support-to-llvm-and-clang-reposting/71512 Stacked PR to enable fexec-charset that depends on this: #138895 See old PR for review and commit history: #74516

…upport. (#138893) This patch adds a wrapper class called EncodingConverter for ConverterEBCDIC. This class is then extended to support the ICU library or iconv library. The ICU library currently takes priority over the iconv library. Relevant RFCs: https://discourse.llvm.org/t/rfc-adding-a-charset-converter-to-the-llvm-support-library/69795 https://discourse.llvm.org/t/rfc-enabling-fexec-charset-support-to-llvm-and-clang-reposting/71512 Stacked PR to enable fexec-charset that depends on this: llvm/llvm-project#138895 See old PR for review and commit history: llvm/llvm-project#74516

abhina-sree · 2025-06-02T19:49:47Z

I'm currently working on addressing all the comments, I have just rebased this PR to use the TextEncodingConverter, thanks for your patience!

clang/docs/LanguageExtensions.rst

zygoloid · 2025-06-02T20:10:22Z

clang/include/clang/Driver/Options.td

  MarshallingInfoString<TargetOpts<"TuneCPU">>;
+def fexec_charset : Separate<["-"], "fexec-charset">, MetaVarName<"<charset>">,
+  HelpText<"Set the execution <charset> for string and character literals. "
+           "Supported character encodings include ISO8859-1, UTF-8, IBM-1047 "


I think this is too long for HelpText, which is displayed by clang --help. Also, aren't ISO8859-1, UTF-8, and IBM-1047 all covered by "those supported by the host icu or iconv library"? Do we really need to call them out separately?

Maybe something like "Use for string and character literals" could work here as help text? You can put longer information with examples of character set names and references to icu / iconv into a separate DocBrief argument.

This was because ConvertEBCDIC.cpp has a table for converting between EBCDIC and UTF-8 so those conversions will always be supported even without icu/iconv. I can definitely update this to something more succinct

zygoloid · 2025-06-02T20:13:30Z

clang/test/Driver/cl-options.c

-// /execution-charset: should warn on everything except UTF-8.
-// RUN: not %clang_cl /execution-charset:utf-16 -### -- %s 2>&1 | FileCheck -check-prefix=execution-charset-utf-16 %s
-// execution-charset-utf-16: invalid value 'utf-16' in '/execution-charset:utf-16'


What happens for the old example of /execution-charset:utf-16 / -fexec-charset=utf-16? Presumably that should also be an error, because we're setting the narrow execution character set here and it doesn't make sense to set that to a non-narrow encoding such as UTF-16 or UCS-4.

llvm/include/llvm/TargetParser/Triple.h

zygoloid · 2025-06-02T20:16:15Z

clang/test/CodeGen/systemz-charset.c

@@ -0,0 +1,35 @@
+// RUN: %clang_cc1 %s -emit-llvm -triple s390x-none-zos -fexec-charset IBM-1047 -o - | FileCheck %s


Should there be tests here for wide (L-prefixed) character and string literals too?

hubert-reinterpretcast · 2025-09-06T05:46:10Z

clang/test/CodeGen/systemz-charset.c

+const char *OctalCharacters = "\141\142\143";
+//CHECK: c"abc\00"
+
+const char singleChar = 'a';


Shouldn't the single-char check use a value that uses more than one byte in the internal encoding (but is a single byte in the literal encoding)?

hubert-reinterpretcast · 2025-09-07T03:33:39Z

but this is going to need a lot more tests

For preprocessor string concatenation and escape sequences within strings, the shift-sequence behaviour for stateful encodings would be of interest.

Additionally of interest is the treatment of bytes matching the encoding of %, etc. in a non-initial shift state within a printf, etc. format string.

hubert-reinterpretcast · 2025-09-17T02:18:22Z

clang/lib/Lex/LiteralConverter.cpp

+
+  // Create converter between internal and exec encoding specified
+  // in fexec-charset option.
+  if (InternalEncoding == ExecEncoding)


Please add a test case with -fexec-charset=utf8. This check does not catch that condition.

hubert-reinterpretcast · 2025-09-17T02:24:52Z

clang/lib/Lex/LiteralConverter.cpp

+  ToExecEncodingConverter =
+      new TextEncodingConverter(std::move(*ErrorOrConverter));


Last I checked, the Diags.Report call does return to this function.
Also, this PR is missing tests for the -cc1 diagnostics here.

Suggested change

ToExecEncodingConverter =

new TextEncodingConverter(std::move(*ErrorOrConverter));

else

ToExecEncodingConverter =

new TextEncodingConverter(std::move(*ErrorOrConverter));

clang/include/clang/Basic/TokenKinds.h

clang/lib/Lex/LiteralSupport.cpp

hubert-reinterpretcast · 2025-09-27T05:15:01Z

clang/test/CodeGen/systemz-charset.c

@@ -0,0 +1,35 @@
+// RUN: %clang_cc1 %s -emit-llvm -triple s390x-none-zos -fexec-charset IBM-1047 -o - | FileCheck %s
+// RUN: %clang %s -emit-llvm -S -target s390x-ibm-zos -o - | FileCheck %s


There should be tests for other converters.
__STDC_MB_MIGHT_NEQ_WC__ should be defined in some cases.

hubert-reinterpretcast · 2025-09-27T15:14:03Z

clang/test/CodeGen/systemz-charset.c

+const char *OctalCharacters = "\141\142\143";
+//CHECK: c"abc\00"
+
+const char singleChar = 'a';


There should be tests for multicharacter literals that become valid or invalid because of the chosen -fexec-charset.

hubert-reinterpretcast · 2025-09-27T15:57:28Z

clang/test/CodeGen/systemz-charset.cpp

@@ -0,0 +1,46 @@
+// RUN: %clang %s -std=c++17 -emit-llvm -S -target s390x-ibm-zos -o - | FileCheck %s
+


There should be tests for character literals in #if.

abhina-sree · 2025-10-23T14:19:40Z

Hi,
I'd like to discuss and clarify how we should handle the interaction of wide characters and fexec-charset. The standard states

A character literal that begins with the letter L, such as L'x', is a wide-character literal. A wide-character literal has type wchar_t. The value of a wide-character literal containing a single c-char has value equal to the numerical value of the encoding of the c-char in the execution wide-character set. (2.13.2 Character literals)

which seems to imply 'L'a' == 'a' . If the fexec-charset and fwide-exec-charset are not compatible, this condition will fail. Should we allow fexec-charset to translate single wide characters to meet this condition if the fwide-exec-charset is not explicity specified?

hubert-reinterpretcast · 2025-10-23T17:38:06Z

which seems to imply 'L'a' == 'a'

It looks like you are working from a pre-C++11 copy of the standard. In any case, what you quoted said "execution wide-character set" and you are implying value equivalence, not supported by the quoted text, between the execution character set and the execution wide-character set.

For C, please see the definition of the macro __STDC_MB_MIGHT_NEQ_WC__.

…charset of string literals. It sets the default internal charset, system charset, and execution charset for z/OS and UTF-8 for all other platforms. (cherry picked from commit 0295d0d) (cherry picked from commit e379f6cb9d063cb78c6b48b0e0a8d9f241958f89)

abhina-sree self-assigned this May 7, 2025

llvmbot added clang Clang issues not falling into any other category clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' clang:frontend Language frontend issues, e.g. anything involving "Sema" labels May 7, 2025

abhina-sree mentioned this pull request May 7, 2025

Create a EncodingConverter class with both iconv and icu support. #138893

Merged

abhina-sree force-pushed the users/abhina/fexec_charset branch from 39d8a0e to 0295d0d Compare May 7, 2025 15:34

abhina-sree mentioned this pull request May 7, 2025

Create a CharSetConverter class with both iconv and icu support #74516

Closed

abhina-sree requested review from AaronBallman, cor3ntin, daltenty, efriedma-quic, hubert-reinterpretcast, perry-ca, redstar and tahonermann May 7, 2025 15:38

cor3ntin reviewed May 12, 2025

View reviewed changes

efriedma-quic reviewed May 19, 2025

View reviewed changes

Base automatically changed from users/abhina/charset_converter to main May 20, 2025 18:02

JVApen mentioned this pull request Jun 2, 2025

Editor hover does not use reference file encoding clangd/vscode-clangd#802

Open

abhina-sree force-pushed the users/abhina/fexec_charset branch from 0295d0d to c7822ec Compare June 2, 2025 19:48

abhina-sree force-pushed the users/abhina/fexec_charset branch from c7822ec to bd3dd4d Compare June 2, 2025 19:52

zygoloid reviewed Jun 2, 2025

View reviewed changes

hubert-reinterpretcast reviewed Sep 6, 2025

View reviewed changes

hubert-reinterpretcast reviewed Sep 17, 2025

View reviewed changes

hubert-reinterpretcast reviewed Sep 26, 2025

View reviewed changes

clang/include/clang/Basic/TokenKinds.h Outdated Show resolved Hide resolved

hubert-reinterpretcast reviewed Sep 27, 2025

View reviewed changes

abhina-sree force-pushed the users/abhina/fexec_charset branch 5 times, most recently from b582b72 to 7149608 Compare October 21, 2025 13:17

abhina-sree added 2 commits October 28, 2025 11:53

replace StringMap with pointer

304a0ed

abhina-sree force-pushed the users/abhina/fexec_charset branch 2 times, most recently from a9a412b to ef0870f Compare October 30, 2025 14:24

llvmbot added the llvm:support label Oct 30, 2025

abhina-sree force-pushed the users/abhina/fexec_charset branch 3 times, most recently from 72fb09f to 58205a5 Compare October 30, 2025 19:31

abhina-sree force-pushed the users/abhina/fexec_charset branch 3 times, most recently from 6b7dcdd to f44396e Compare November 10, 2025 14:22

abhina-sree added 2 commits November 10, 2025 09:26

do not translate line/digit directives, do not translate filename

6988ba2

address comments

60a2e2b

abhina-sree force-pushed the users/abhina/fexec_charset branch from f44396e to 60a2e2b Compare November 10, 2025 14:26

abhina-sree added 2 commits November 10, 2025 15:12

fix CI

a991261

add more error handling, address comments

4634fb9

		@@ -0,0 +1,35 @@
		// RUN: %clang_cc1 %s -emit-llvm -triple s390x-none-zos -fexec-charset IBM-1047 -o - \| FileCheck %s

		ToExecEncodingConverter =
		new TextEncodingConverter(std::move(*ErrorOrConverter));

		@@ -0,0 +1,35 @@
		// RUN: %clang_cc1 %s -emit-llvm -triple s390x-none-zos -fexec-charset IBM-1047 -o - \| FileCheck %s
		// RUN: %clang %s -emit-llvm -S -target s390x-ibm-zos -o - \| FileCheck %s

		@@ -0,0 +1,46 @@
		// RUN: %clang %s -std=c++17 -emit-llvm -S -target s390x-ibm-zos -o - \| FileCheck %s

Enable fexec-charset option #138895

Are you sure you want to change the base?

Enable fexec-charset option #138895

Conversation

abhina-sree commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cor3ntin commented May 9, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

abhina-sree commented Jun 2, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hubert-reinterpretcast commented Sep 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abhina-sree commented Oct 23, 2025

Uh oh!

hubert-reinterpretcast commented Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

abhina-sree commented May 7, 2025 •

edited

Loading

llvmbot commented May 7, 2025 •

edited

Loading

github-actions bot commented May 7, 2025 •

edited

Loading