Skip to content

ICU-23350 Support name aliases of type correction in UnicodeSet \N#3918

Open
eggrobin wants to merge 1 commit intounicode-org:mainfrom
eggrobin:correction
Open

ICU-23350 Support name aliases of type correction in UnicodeSet \N#3918
eggrobin wants to merge 1 commit intounicode-org:mainfrom
eggrobin:correction

Conversation

@eggrobin
Copy link
Copy Markdown
Member

@eggrobin eggrobin commented Mar 30, 2026

Currently, ICU4C rejects the UnicodeSet [\N{PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRACKET}], and accepts only the misspelling [\N{PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET}]. With this change, both expressions are valid and equivalent to [︘].

Name aliases of other types do not work (ICU-8963). UAX44-LM2 loose matching does not work as specified (ICU-3736).

Checklist

  • Required: Issue filed: ICU-23350
  • Required: The PR title must be prefixed with a JIRA Issue number. Example: "ICU-NNNNN Fix xyz"
  • Required: Each commit message must be prefixed with a JIRA Issue number. Example: "ICU-NNNNN Fix xyz"
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable
  • Approver: Feel free to merge on my behalf

@eggrobin
Copy link
Copy Markdown
Member Author

@markusicu I thought this was covered by ICU-8963, but while that ticket mentions it in passing at the end, it is about something else.

Shall I file a ticket? I think I could self-approve such a ticket, since the TC has approved the proposal https://groups.google.com/a/unicode.org/g/icu-design/c/Tx2123ejkTI/m/ZDLGzY-XDQAJ which included this (under Name aliases and UAX44-LM2 in \N and \p{Name=…}).

@eggrobin eggrobin requested a review from markusicu March 30, 2026 15:31
@markusicu
Copy link
Copy Markdown
Member

markusicu commented Mar 30, 2026

@markusicu I thought this was covered by ICU-8963, but while that ticket mentions it in passing at the end, it is about something else.

Right. That one is about changing the data structure so that we can store multiple aliases, and adding API (constants) for selecting among them. Maybe new API altogether. And a little bit of UnicodeSet parser work to wire in API changes as needed.

Shall I file a ticket?

Yes -- I can't find one for this.

I think I could self-approve such a ticket, since the TC has approved the proposal [...] which included this

+1

Comment on lines +1459 to +1462
for (const UCharNameChoice nameChoice :
std::array{U_EXTENDED_CHAR_NAME, U_CHAR_NAME_ALIAS}) {
ec = U_ZERO_ERROR;
UChar32 ch = u_charFromName(nameChoice, buf, &ec);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: The feature is good, but it will make this even slower :-(
We should really add an API that we ask to consider all aliases, or some set of several aliases. (Different ticket & PR)

@eggrobin eggrobin changed the title Support name aliases of type correction in UnicodeSet \N ICU-23350 Support name aliases of type correction in UnicodeSet \N Mar 30, 2026
const UChar32 result = getCharacterByName(
std::u16string_view(pattern_).substr(start, parsePosition_.getIndex() - 1 - start));
if (result == U_SENTINEL || (hex.has_value() && result != hex) ||
(literal.has_value() && result != literal)) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please indent more than the if-body

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. We should probably figure out how to tell clang-format to do that, it indents where the ( is, and if ( is the same length as our indent…

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't know why it does that; in google3 (where the indent is 2 spaces), the effect is that the continuation is indented more than the body; it should make that work with a larger indent as well

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it does what it does because that works for google3 and other major users.

It seems that AlignAfterOpenBracket: DontAlign can work, but it is a very blunt hammer: https://discourse.llvm.org/t/misleading-indentation-on-if-statements-with-clang-format-alignafteropenbracket/89069, see https://clang.llvm.org/docs/ClangFormatStyleOptions.html#alignafteropenbracket.
Maybe https://clang.llvm.org/docs/ClangFormatStyleOptions.html#breakafteropenbracketif would work?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the clang-format I have on my machine,

ContinuationIndentWidth: 8
AlignAfterOpenBracket: DontAlign

works; I don’t have BreakAfterOpenBracketIf.
But of course it results in unnecessary wide continuation indents elsewhere.
Anyway, another example of why I don’t think clang-formatting everything in ICU is happening soon…

Copy link
Copy Markdown
Member

@markusicu markusicu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm pse squash

@markusicu markusicu self-assigned this Mar 31, 2026
@jira-pull-request-webhook
Copy link
Copy Markdown

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@jira-pull-request-webhook
Copy link
Copy Markdown

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants