-
Notifications
You must be signed in to change notification settings - Fork 15.2k
Enable fexec-charset option #138895
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Enable fexec-charset option #138895
Conversation
|
@llvm/pr-subscribers-llvm-support @llvm/pr-subscribers-clang Author: Abhina Sree (abhina-sree) ChangesThis patch enables the fexec-charset option to control the execution charset of string literals. It sets the default internal charset, system charset, and execution charset for z/OS and UTF-8 for all other platforms. This patch depends on adding the CharSetConverter class #138893 Patch is 34.26 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/138895.diff 20 Files Affected:
diff --git a/clang/docs/LanguageExtensions.rst b/clang/docs/LanguageExtensions.rst
index ebcad44197ce4..44e20623d4d0b 100644
--- a/clang/docs/LanguageExtensions.rst
+++ b/clang/docs/LanguageExtensions.rst
@@ -416,8 +416,7 @@ Builtin Macros
``__clang_literal_encoding__``
Defined to a narrow string literal that represents the current encoding of
narrow string literals, e.g., ``"hello"``. This macro typically expands to
- "UTF-8" (but may change in the future if the
- ``-fexec-charset="Encoding-Name"`` option is implemented.)
+ the charset specified by -fexec-charset if specified, or the system charset.
``__clang_wide_literal_encoding__``
Defined to a narrow string literal that represents the current encoding of
diff --git a/clang/include/clang/Basic/LangOptions.h b/clang/include/clang/Basic/LangOptions.h
index 491e8bee9fd5c..559a4be70b74c 100644
--- a/clang/include/clang/Basic/LangOptions.h
+++ b/clang/include/clang/Basic/LangOptions.h
@@ -633,6 +633,9 @@ class LangOptions : public LangOptionsBase {
bool AtomicFineGrainedMemory = false;
bool AtomicIgnoreDenormalMode = false;
+ /// Name of the exec charset to convert the internal charset to.
+ std::string ExecCharset;
+
LangOptions();
/// Set language defaults for the given input language and
diff --git a/clang/include/clang/Basic/TokenKinds.h b/clang/include/clang/Basic/TokenKinds.h
index 1b133dde89587..34f6133973e71 100644
--- a/clang/include/clang/Basic/TokenKinds.h
+++ b/clang/include/clang/Basic/TokenKinds.h
@@ -101,6 +101,13 @@ inline bool isLiteral(TokenKind K) {
isStringLiteral(K) || K == tok::header_name || K == tok::binary_data;
}
+/// Return true if this is a utf literal kind.
+inline bool isUTFLiteral(TokenKind K) {
+ return K == tok::utf8_char_constant || K == tok::utf8_string_literal ||
+ K == tok::utf16_char_constant || K == tok::utf16_string_literal ||
+ K == tok::utf32_char_constant || K == tok::utf32_string_literal;
+}
+
/// Return true if this is any of tok::annot_* kinds.
bool isAnnotation(TokenKind K);
diff --git a/clang/include/clang/Driver/Options.td b/clang/include/clang/Driver/Options.td
index 30ea75bb108d5..9d352eb1270fe 100644
--- a/clang/include/clang/Driver/Options.td
+++ b/clang/include/clang/Driver/Options.td
@@ -7197,6 +7197,11 @@ let Visibility = [CC1Option, CC1AsOption, FC1Option] in {
def tune_cpu : Separate<["-"], "tune-cpu">,
HelpText<"Tune for a specific cpu type">,
MarshallingInfoString<TargetOpts<"TuneCPU">>;
+def fexec_charset : Separate<["-"], "fexec-charset">, MetaVarName<"<charset>">,
+ HelpText<"Set the execution <charset> for string and character literals. "
+ "Supported character encodings include ISO8859-1, UTF-8, IBM-1047 "
+ "and those supported by the host icu or iconv library.">,
+ MarshallingInfoString<LangOpts<"ExecCharset">>;
def target_cpu : Separate<["-"], "target-cpu">,
HelpText<"Target a specific cpu type">,
MarshallingInfoString<TargetOpts<"CPU">>;
diff --git a/clang/include/clang/Lex/LiteralConverter.h b/clang/include/clang/Lex/LiteralConverter.h
new file mode 100644
index 0000000000000..203111255b791
--- /dev/null
+++ b/clang/include/clang/Lex/LiteralConverter.h
@@ -0,0 +1,36 @@
+//===--- clang/Lex/LiteralConverter.h - Translator for Literals -*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_CLANG_LEX_LITERALCONVERTER_H
+#define LLVM_CLANG_LEX_LITERALCONVERTER_H
+
+#include "clang/Basic/Diagnostic.h"
+#include "clang/Basic/LangOptions.h"
+#include "clang/Basic/TargetInfo.h"
+#include "llvm/ADT/StringMap.h"
+#include "llvm/ADT/StringRef.h"
+#include "llvm/Support/CharSet.h"
+
+enum ConversionAction { NoConversion, ToSystemCharset, ToExecCharset };
+
+class LiteralConverter {
+ llvm::StringRef InternalCharset;
+ llvm::StringRef SystemCharset;
+ llvm::StringRef ExecCharset;
+ llvm::StringMap<llvm::CharSetConverter> CharsetConverters;
+
+public:
+ llvm::CharSetConverter *getConverter(const char *Codepage);
+ llvm::CharSetConverter *getConverter(ConversionAction Action);
+ llvm::CharSetConverter *createAndInsertCharConverter(const char *To);
+ void setConvertersFromOptions(const clang::LangOptions &Opts,
+ const clang::TargetInfo &TInfo,
+ clang::DiagnosticsEngine &Diags);
+};
+
+#endif
diff --git a/clang/include/clang/Lex/LiteralSupport.h b/clang/include/clang/Lex/LiteralSupport.h
index ea5f63bc20399..114d50c226a29 100644
--- a/clang/include/clang/Lex/LiteralSupport.h
+++ b/clang/include/clang/Lex/LiteralSupport.h
@@ -17,10 +17,12 @@
#include "clang/Basic/CharInfo.h"
#include "clang/Basic/LLVM.h"
#include "clang/Basic/TokenKinds.h"
+#include "clang/Lex/LiteralConverter.h"
#include "llvm/ADT/APFloat.h"
#include "llvm/ADT/ArrayRef.h"
#include "llvm/ADT/SmallString.h"
#include "llvm/ADT/StringRef.h"
+#include "llvm/Support/CharSet.h"
#include "llvm/Support/DataTypes.h"
namespace clang {
@@ -233,6 +235,7 @@ class StringLiteralParser {
const LangOptions &Features;
const TargetInfo &Target;
DiagnosticsEngine *Diags;
+ LiteralConverter *LiteralConv;
unsigned MaxTokenLength;
unsigned SizeBound;
@@ -248,16 +251,17 @@ class StringLiteralParser {
public:
StringLiteralParser(ArrayRef<Token> StringToks, Preprocessor &PP,
StringLiteralEvalMethod StringMethod =
- StringLiteralEvalMethod::Evaluated);
+ StringLiteralEvalMethod::Evaluated,
+ ConversionAction Action = ToExecCharset);
StringLiteralParser(ArrayRef<Token> StringToks, const SourceManager &sm,
const LangOptions &features, const TargetInfo &target,
DiagnosticsEngine *diags = nullptr)
: SM(sm), Features(features), Target(target), Diags(diags),
- MaxTokenLength(0), SizeBound(0), CharByteWidth(0), Kind(tok::unknown),
+ LiteralConv(nullptr), MaxTokenLength(0), SizeBound(0), CharByteWidth(0), Kind(tok::unknown),
ResultPtr(ResultBuf.data()),
EvalMethod(StringLiteralEvalMethod::Evaluated), hadError(false),
Pascal(false) {
- init(StringToks);
+ init(StringToks, NoConversion);
}
bool hadError;
@@ -305,7 +309,7 @@ class StringLiteralParser {
static bool isValidUDSuffix(const LangOptions &LangOpts, StringRef Suffix);
private:
- void init(ArrayRef<Token> StringToks);
+ void init(ArrayRef<Token> StringToks, ConversionAction Action);
bool CopyStringFragment(const Token &Tok, const char *TokBegin,
StringRef Fragment);
void DiagnoseLexingError(SourceLocation Loc);
diff --git a/clang/include/clang/Lex/Preprocessor.h b/clang/include/clang/Lex/Preprocessor.h
index f2dfd3a349b8b..350cd2d436eb3 100644
--- a/clang/include/clang/Lex/Preprocessor.h
+++ b/clang/include/clang/Lex/Preprocessor.h
@@ -25,6 +25,7 @@
#include "clang/Basic/TokenKinds.h"
#include "clang/Lex/HeaderSearch.h"
#include "clang/Lex/Lexer.h"
+#include "clang/Lex/LiteralConverter.h"
#include "clang/Lex/MacroInfo.h"
#include "clang/Lex/ModuleLoader.h"
#include "clang/Lex/ModuleMap.h"
@@ -156,6 +157,7 @@ class Preprocessor {
std::unique_ptr<ScratchBuffer> ScratchBuf;
HeaderSearch &HeaderInfo;
ModuleLoader &TheModuleLoader;
+ LiteralConverter LiteralConv;
/// External source of macros.
ExternalPreprocessorSource *ExternalSource;
@@ -1218,6 +1220,7 @@ class Preprocessor {
SelectorTable &getSelectorTable() { return Selectors; }
Builtin::Context &getBuiltinInfo() { return *BuiltinInfo; }
llvm::BumpPtrAllocator &getPreprocessorAllocator() { return BP; }
+ LiteralConverter &getLiteralConverter() { return LiteralConv; }
void setExternalSource(ExternalPreprocessorSource *Source) {
ExternalSource = Source;
diff --git a/clang/lib/Driver/ToolChains/Clang.cpp b/clang/lib/Driver/ToolChains/Clang.cpp
index f87549baff5e1..7b364634f8ff7 100644
--- a/clang/lib/Driver/ToolChains/Clang.cpp
+++ b/clang/lib/Driver/ToolChains/Clang.cpp
@@ -50,6 +50,7 @@
#include "llvm/Frontend/Debug/Options.h"
#include "llvm/Object/ObjectFile.h"
#include "llvm/Option/ArgList.h"
+#include "llvm/Support/CharSet.h"
#include "llvm/Support/CodeGen.h"
#include "llvm/Support/Compiler.h"
#include "llvm/Support/Compression.h"
@@ -7597,12 +7598,20 @@ void Clang::ConstructJob(Compilation &C, const JobAction &JA,
<< value;
}
- // -fexec_charset=UTF-8 is default. Reject others
+ // Set the default fexec-charset as the system charset.
+ CmdArgs.push_back("-fexec-charset");
+ CmdArgs.push_back(Args.MakeArgString(Triple.getSystemCharset()));
if (Arg *execCharset = Args.getLastArg(options::OPT_fexec_charset_EQ)) {
StringRef value = execCharset->getValue();
- if (!value.equals_insensitive("utf-8"))
- D.Diag(diag::err_drv_invalid_value) << execCharset->getAsString(Args)
- << value;
+ llvm::ErrorOr<llvm::CharSetConverter> ErrorOrConverter =
+ llvm::CharSetConverter::create("UTF-8", value.data());
+ if (ErrorOrConverter) {
+ CmdArgs.push_back("-fexec-charset");
+ CmdArgs.push_back(Args.MakeArgString(value));
+ } else {
+ D.Diag(diag::err_drv_invalid_value)
+ << execCharset->getAsString(Args) << value;
+ }
}
RenderDiagnosticsOptions(D, Args, CmdArgs);
diff --git a/clang/lib/Frontend/CompilerInstance.cpp b/clang/lib/Frontend/CompilerInstance.cpp
index b59496babb62c..879b81bbadabe 100644
--- a/clang/lib/Frontend/CompilerInstance.cpp
+++ b/clang/lib/Frontend/CompilerInstance.cpp
@@ -32,6 +32,7 @@
#include "clang/Frontend/Utils.h"
#include "clang/Frontend/VerifyDiagnosticConsumer.h"
#include "clang/Lex/HeaderSearch.h"
+#include "clang/Lex/LiteralConverter.h"
#include "clang/Lex/Preprocessor.h"
#include "clang/Lex/PreprocessorOptions.h"
#include "clang/Sema/CodeCompleteConsumer.h"
@@ -537,6 +538,9 @@ void CompilerInstance::createPreprocessor(TranslationUnitKind TUKind) {
if (GetDependencyDirectives)
PP->setDependencyDirectivesGetter(*GetDependencyDirectives);
+
+ PP->getLiteralConverter().setConvertersFromOptions(getLangOpts(), getTarget(),
+ getDiagnostics());
}
std::string CompilerInstance::getSpecificModuleCachePath(StringRef ModuleHash) {
diff --git a/clang/lib/Frontend/InitPreprocessor.cpp b/clang/lib/Frontend/InitPreprocessor.cpp
index 96d6fb64a6319..47b71d0fb98c5 100644
--- a/clang/lib/Frontend/InitPreprocessor.cpp
+++ b/clang/lib/Frontend/InitPreprocessor.cpp
@@ -1058,10 +1058,14 @@ static void InitializePredefinedMacros(const TargetInfo &TI,
}
}
- // Macros to help identify the narrow and wide character sets
- // FIXME: clang currently ignores -fexec-charset=. If this changes,
- // then this may need to be updated.
- Builder.defineMacro("__clang_literal_encoding__", "\"UTF-8\"");
+ // Macros to help identify the narrow and wide character sets. This is set
+ // to fexec-charset. If fexec-charset is not specified, the default is the
+ // system charset.
+ if (!LangOpts.ExecCharset.empty())
+ Builder.defineMacro("__clang_literal_encoding__", LangOpts.ExecCharset);
+ else
+ Builder.defineMacro("__clang_literal_encoding__",
+ TI.getTriple().getSystemCharset());
if (TI.getTypeWidth(TI.getWCharType()) >= 32) {
// FIXME: 32-bit wchar_t signals UTF-32. This may change
// if -fwide-exec-charset= is ever supported.
diff --git a/clang/lib/Lex/CMakeLists.txt b/clang/lib/Lex/CMakeLists.txt
index f61737cd68021..9e38a1b8fbb44 100644
--- a/clang/lib/Lex/CMakeLists.txt
+++ b/clang/lib/Lex/CMakeLists.txt
@@ -12,6 +12,7 @@ add_clang_library(clangLex
InitHeaderSearch.cpp
Lexer.cpp
LexHLSLRootSignature.cpp
+ LiteralConverter.cpp
LiteralSupport.cpp
MacroArgs.cpp
MacroInfo.cpp
diff --git a/clang/lib/Lex/LiteralConverter.cpp b/clang/lib/Lex/LiteralConverter.cpp
new file mode 100644
index 0000000000000..a89781e182e83
--- /dev/null
+++ b/clang/lib/Lex/LiteralConverter.cpp
@@ -0,0 +1,68 @@
+//===--- LiteralConverter.cpp - Translator for String Literals -----------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "clang/Lex/LiteralConverter.h"
+#include "clang/Basic/DiagnosticDriver.h"
+
+using namespace llvm;
+
+llvm::CharSetConverter *LiteralConverter::getConverter(const char *Codepage) {
+ auto Iter = CharsetConverters.find(Codepage);
+ if (Iter != CharsetConverters.end())
+ return &Iter->second;
+ return nullptr;
+}
+
+llvm::CharSetConverter *
+LiteralConverter::getConverter(ConversionAction Action) {
+ StringRef CodePage;
+ if (Action == ToSystemCharset)
+ CodePage = SystemCharset;
+ else if (Action == ToExecCharset)
+ CodePage = ExecCharset;
+ else
+ CodePage = InternalCharset;
+ return getConverter(CodePage.data());
+}
+
+llvm::CharSetConverter *
+LiteralConverter::createAndInsertCharConverter(const char *To) {
+ const char *From = InternalCharset.data();
+ llvm::CharSetConverter *Converter = getConverter(To);
+ if (Converter)
+ return Converter;
+
+ ErrorOr<CharSetConverter> ErrorOrConverter =
+ llvm::CharSetConverter::create(From, To);
+ if (!ErrorOrConverter)
+ return nullptr;
+ CharsetConverters.insert_or_assign(StringRef(To),
+ std::move(*ErrorOrConverter));
+ return getConverter(To);
+}
+
+void LiteralConverter::setConvertersFromOptions(
+ const clang::LangOptions &Opts, const clang::TargetInfo &TInfo,
+ clang::DiagnosticsEngine &Diags) {
+ using namespace llvm;
+ SystemCharset = TInfo.getTriple().getSystemCharset();
+ InternalCharset = "UTF-8";
+ ExecCharset = Opts.ExecCharset.empty() ? InternalCharset : Opts.ExecCharset;
+ // Create converter between internal and system charset
+ if (!InternalCharset.equals(SystemCharset))
+ createAndInsertCharConverter(SystemCharset.data());
+
+ // Create converter between internal and exec charset specified
+ // in fexec-charset option.
+ if (InternalCharset.equals(ExecCharset))
+ return;
+ if (!createAndInsertCharConverter(ExecCharset.data())) {
+ Diags.Report(clang::diag::err_drv_invalid_value)
+ << "-fexec-charset" << ExecCharset;
+ }
+}
diff --git a/clang/lib/Lex/LiteralSupport.cpp b/clang/lib/Lex/LiteralSupport.cpp
index 75ad977d64b24..927c4e1823a19 100644
--- a/clang/lib/Lex/LiteralSupport.cpp
+++ b/clang/lib/Lex/LiteralSupport.cpp
@@ -134,7 +134,8 @@ static unsigned ProcessCharEscape(const char *ThisTokBegin,
FullSourceLoc Loc, unsigned CharWidth,
DiagnosticsEngine *Diags,
const LangOptions &Features,
- StringLiteralEvalMethod EvalMethod) {
+ StringLiteralEvalMethod EvalMethod,
+ llvm::CharSetConverter *Converter) {
const char *EscapeBegin = ThisTokBuf;
bool Delimited = false;
bool EndDelimiterFound = false;
@@ -146,6 +147,8 @@ static unsigned ProcessCharEscape(const char *ThisTokBegin,
// that would have been \", which would not have been the end of string.
unsigned ResultChar = *ThisTokBuf++;
char Escape = ResultChar;
+ bool Translate = true;
+ bool Invalid = false;
switch (ResultChar) {
// These map to themselves.
case '\\': case '\'': case '"': case '?': break;
@@ -186,6 +189,7 @@ static unsigned ProcessCharEscape(const char *ThisTokBegin,
ResultChar = 11;
break;
case 'x': { // Hex escape.
+ Translate = false;
ResultChar = 0;
if (ThisTokBuf != ThisTokEnd && *ThisTokBuf == '{') {
Delimited = true;
@@ -249,6 +253,7 @@ static unsigned ProcessCharEscape(const char *ThisTokBegin,
case '4': case '5': case '6': case '7': {
// Octal escapes.
--ThisTokBuf;
+ Translate = false;
ResultChar = 0;
// Octal escapes are a series of octal digits with maximum length 3.
@@ -334,6 +339,7 @@ static unsigned ProcessCharEscape(const char *ThisTokBegin,
<< std::string(1, ResultChar);
break;
default:
+ Invalid = true;
if (!Diags)
break;
@@ -367,6 +373,15 @@ static unsigned ProcessCharEscape(const char *ThisTokBegin,
HadError = true;
}
+ if (Translate && Converter) {
+ // Invalid escapes are written as '?' and then translated.
+ char ByteChar = Invalid ? '?' : ResultChar;
+ SmallString<8> ResultCharConv;
+ Converter->convert(StringRef(&ByteChar, 1), ResultCharConv);
+ assert(ResultCharConv.size() == 1 &&
+ "Char size increased after translation");
+ ResultChar = ResultCharConv[0];
+ }
return ResultChar;
}
@@ -1739,6 +1754,7 @@ CharLiteralParser::CharLiteralParser(const char *begin, const char *end,
HadError = false;
Kind = kind;
+ LiteralConverter *LiteralConv = &PP.getLiteralConverter();
const char *TokBegin = begin;
@@ -1805,6 +1821,10 @@ CharLiteralParser::CharLiteralParser(const char *begin, const char *end,
largest_character_for_kind = 0x7Fu;
}
+ llvm::CharSetConverter *Converter = nullptr;
+ if (!isUTFLiteral(Kind) && LiteralConv)
+ Converter = LiteralConv->getConverter(ToExecCharset);
+
while (begin != end) {
// Is this a span of non-escape characters?
if (begin[0] != '\\') {
@@ -1842,6 +1862,16 @@ CharLiteralParser::CharLiteralParser(const char *begin, const char *end,
HadError = true;
PP.Diag(Loc, diag::err_character_too_large);
}
+ if (!HadError && Converter) {
+ assert(Kind != tok::wide_char_constant &&
+ "Wide character translation not supported");
+ char ByteChar = *tmp_out_start;
+ SmallString<1> ConvertedChar;
+ Converter->convert(StringRef(&ByteChar, 1), ConvertedChar);
+ assert(ConvertedChar.size() == 1 &&
+ "Char size increased after translation");
+ *tmp_out_start = ConvertedChar[0];
+ }
}
}
@@ -1849,16 +1879,35 @@ CharLiteralParser::CharLiteralParser(const char *begin, const char *end,
}
// Is this a Universal Character Name escape?
if (begin[1] == 'u' || begin[1] == 'U' || begin[1] == 'N') {
- unsigned short UcnLen = 0;
- if (!ProcessUCNEscape(TokBegin, begin, end, *buffer_begin, UcnLen,
- FullSourceLoc(Loc, PP.getSourceManager()),
- &PP.getDiagnostics(), PP.getLangOpts(), true)) {
- HadError = true;
- } else if (*buffer_begin > largest_character_for_kind) {
- HadError = true;
- PP.Diag(Loc, diag::err_character_too_large);
+ if (Converter == nullptr) {
+ unsigned short UcnLen = 0;
+ if (!ProcessUCNEscape(TokBegin, begin, end, *buffer_begin, UcnLen,
+ FullSourceLoc(Loc, PP.getSourceManager()),
+ &PP.getDiagnostics(), PP.getLangOpts(), true)) {
+ HadError = true;
+ } else if (*buffer_begin > largest_character_for_kind) {
+ HadError = true;
+ PP.Diag(Loc, diag::err_character_too_large);
+ }
+ } else {
+ char Cp[8];
+ char *ResultPtr = Cp;
+ unsigned CharByteWidth = 1;
+ EncodeUCNEscape(TokBegin, begin, end, ResultPtr, HadError,
+ FullSourceLoc(Loc, PP.getSourceManager()),
+ CharByteWidth, &PP.getDiagnostics(), PP.getLangOpts());
+ if (!HadError) {
+ SmallString<8> CpConv;
+ Converter->convert(StringRef(Cp), CpConv);
+ if (CpConv.size() > 1) {
+ HadError = true;
+ PP.Diag(Loc, diag::err_character_too_large);
+ } else {
+ memcpy(Cp, CpConv.data(), CpConv.size());
+ *buffer_begin = *Cp;
+ }
+ }
}
-
...
[truncated]
|
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
39d8a0e to
0295d0d
Compare
|
Thanks for working on this
(This points to the need to sometimes convert strings from ebcdic to utf8 for diagnostic purposes) |
| #include "llvm/ADT/StringRef.h" | ||
| #include "llvm/Support/CharSet.h" | ||
|
|
||
| enum ConversionAction { NoConversion, ToSystemCharset, ToExecCharset }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have FromOrdinaryLiteralEncoding and ToOrdinaryLiteralEncoding instead of System/Exec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The names here were chosen to represent the different encodings that need to be used based on the context of the string literal which I mentioned in my RFC and have also copied the table to the description of this PR. I'm not sure the proposed names make that context clear
clang/lib/Lex/LiteralSupport.cpp
Outdated
| // that would have been \", which would not have been the end of string. | ||
| unsigned ResultChar = *ThisTokBuf++; | ||
| char Escape = ResultChar; | ||
| bool Translate = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| bool Translate = true; | |
| bool Transcode = true; |
I would prefer this defaults to false
| Converter->convert(StringRef(&ByteChar, 1), ResultCharConv); | ||
| assert(ResultCharConv.size() == 1 && | ||
| "Char size increased after translation"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a GetReplacementChar function... somewhere, and cache the result?
| if (!HadError && Converter) { | ||
| assert(Kind != tok::wide_char_constant && | ||
| "Wide character translation not supported"); | ||
| char ByteChar = *tmp_out_start; | ||
| SmallString<1> ConvertedChar; | ||
| Converter->convert(StringRef(&ByteChar, 1), ConvertedChar); | ||
| assert(ConvertedChar.size() == 1 && | ||
| "Char size increased after translation"); | ||
| *tmp_out_start = ConvertedChar[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be handled with diagnostics. The conversion can also fail, and that should be handled.
| "Wide character translation not supported"); | ||
| char ByteChar = *tmp_out_start; | ||
| SmallString<1> ConvertedChar; | ||
| Converter->convert(StringRef(&ByteChar, 1), ConvertedChar); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here the order of operation should be:
-> convert from UTF-8 to UTF-32, check it's a valid character
-> convert the same buffer from UTF-8 to the literal encoding
-> Check that that succeed and has a size of one
(ie some codepoints might hgave a size of 2 when encoded as utf-8 but 1 when encoded as latin1)
clang/lib/Lex/LiteralSupport.cpp
Outdated
| case '4': case '5': case '6': case '7': { | ||
| // Octal escapes. | ||
| --ThisTokBuf; | ||
| Translate = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also handle \o escapes?
…38893) This patch adds a wrapper class called EncodingConverter for ConverterEBCDIC. This class is then extended to support the ICU library or iconv library. The ICU library currently takes priority over the iconv library. Relevant RFCs: https://discourse.llvm.org/t/rfc-adding-a-charset-converter-to-the-llvm-support-library/69795 https://discourse.llvm.org/t/rfc-enabling-fexec-charset-support-to-llvm-and-clang-reposting/71512 Stacked PR to enable fexec-charset that depends on this: #138895 See old PR for review and commit history: #74516
…upport. (#138893) This patch adds a wrapper class called EncodingConverter for ConverterEBCDIC. This class is then extended to support the ICU library or iconv library. The ICU library currently takes priority over the iconv library. Relevant RFCs: https://discourse.llvm.org/t/rfc-adding-a-charset-converter-to-the-llvm-support-library/69795 https://discourse.llvm.org/t/rfc-enabling-fexec-charset-support-to-llvm-and-clang-reposting/71512 Stacked PR to enable fexec-charset that depends on this: llvm/llvm-project#138895 See old PR for review and commit history: llvm/llvm-project#74516
0295d0d to
c7822ec
Compare
|
I'm currently working on addressing all the comments, I have just rebased this PR to use the TextEncodingConverter, thanks for your patience! |
c7822ec to
bd3dd4d
Compare
| MarshallingInfoString<TargetOpts<"TuneCPU">>; | ||
| def fexec_charset : Separate<["-"], "fexec-charset">, MetaVarName<"<charset>">, | ||
| HelpText<"Set the execution <charset> for string and character literals. " | ||
| "Supported character encodings include ISO8859-1, UTF-8, IBM-1047 " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is too long for HelpText, which is displayed by clang --help. Also, aren't ISO8859-1, UTF-8, and IBM-1047 all covered by "those supported by the host icu or iconv library"? Do we really need to call them out separately?
Maybe something like "Use for string and character literals" could work here as help text? You can put longer information with examples of character set names and references to icu / iconv into a separate DocBrief argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was because ConvertEBCDIC.cpp has a table for converting between EBCDIC and UTF-8 so those conversions will always be supported even without icu/iconv. I can definitely update this to something more succinct
| // /execution-charset: should warn on everything except UTF-8. | ||
| // RUN: not %clang_cl /execution-charset:utf-16 -### -- %s 2>&1 | FileCheck -check-prefix=execution-charset-utf-16 %s | ||
| // execution-charset-utf-16: invalid value 'utf-16' in '/execution-charset:utf-16' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens for the old example of /execution-charset:utf-16 / -fexec-charset=utf-16? Presumably that should also be an error, because we're setting the narrow execution character set here and it doesn't make sense to set that to a non-narrow encoding such as UTF-16 or UCS-4.
| @@ -0,0 +1,35 @@ | |||
| // RUN: %clang_cc1 %s -emit-llvm -triple s390x-none-zos -fexec-charset IBM-1047 -o - | FileCheck %s | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be tests here for wide (L-prefixed) character and string literals too?
| const char *OctalCharacters = "\141\142\143"; | ||
| //CHECK: c"abc\00" | ||
|
|
||
| const char singleChar = 'a'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the single-char check use a value that uses more than one byte in the internal encoding (but is a single byte in the literal encoding)?
For preprocessor string concatenation and escape sequences within strings, the shift-sequence behaviour for stateful encodings would be of interest. Additionally of interest is the treatment of bytes matching the encoding of |
clang/lib/Lex/LiteralConverter.cpp
Outdated
|
|
||
| // Create converter between internal and exec encoding specified | ||
| // in fexec-charset option. | ||
| if (InternalEncoding == ExecEncoding) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a test case with -fexec-charset=utf8. This check does not catch that condition.
clang/lib/Lex/LiteralConverter.cpp
Outdated
| ToExecEncodingConverter = | ||
| new TextEncodingConverter(std::move(*ErrorOrConverter)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last I checked, the Diags.Report call does return to this function.
Also, this PR is missing tests for the -cc1 diagnostics here.
| ToExecEncodingConverter = | |
| new TextEncodingConverter(std::move(*ErrorOrConverter)); | |
| else | |
| ToExecEncodingConverter = | |
| new TextEncodingConverter(std::move(*ErrorOrConverter)); |
| @@ -0,0 +1,35 @@ | |||
| // RUN: %clang_cc1 %s -emit-llvm -triple s390x-none-zos -fexec-charset IBM-1047 -o - | FileCheck %s | |||
| // RUN: %clang %s -emit-llvm -S -target s390x-ibm-zos -o - | FileCheck %s | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be tests for other converters.
__STDC_MB_MIGHT_NEQ_WC__ should be defined in some cases.
| const char *OctalCharacters = "\141\142\143"; | ||
| //CHECK: c"abc\00" | ||
|
|
||
| const char singleChar = 'a'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be tests for multicharacter literals that become valid or invalid because of the chosen -fexec-charset.
| @@ -0,0 +1,46 @@ | |||
| // RUN: %clang %s -std=c++17 -emit-llvm -S -target s390x-ibm-zos -o - | FileCheck %s | |||
|
|
|||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be tests for character literals in #if.
b582b72 to
7149608
Compare
|
Hi,
which seems to imply 'L'a' == 'a' . If the fexec-charset and fwide-exec-charset are not compatible, this condition will fail. Should we allow fexec-charset to translate single wide characters to meet this condition if the fwide-exec-charset is not explicity specified? |
It looks like you are working from a pre-C++11 copy of the standard. In any case, what you quoted said "execution wide-character set" and you are implying value equivalence, not supported by the quoted text, between the execution character set and the execution wide-character set. For C, please see the definition of the macro |
…charset of string literals. It sets the default internal charset, system charset, and execution charset for z/OS and UTF-8 for all other platforms. (cherry picked from commit 0295d0d) (cherry picked from commit e379f6cb9d063cb78c6b48b0e0a8d9f241958f89)
a9a412b to
ef0870f
Compare
72fb09f to
58205a5
Compare
6b7dcdd to
f44396e
Compare
f44396e to
60a2e2b
Compare
This patch enables the fexec-charset option to control the execution charset of string literals. It sets the default internal charset, system charset, and execution charset for z/OS and UTF-8 for all other platforms.
This patch depends on adding the TextEncodingConverter class #138893
Relevant RFCs:
https://discourse.llvm.org/t/rfc-adding-a-charset-converter-to-the-llvm-support-library/69795
https://discourse.llvm.org/t/rfc-enabling-fexec-charset-support-to-llvm-and-clang-reposting/71512
The table below summarizes what encoding needs to be used for string literals depending on the context. This change needs to be added in a followup PR