Skip to content

Conversation

@eggrobin
Copy link
Member

@eggrobin eggrobin commented Dec 22, 2025

Require that UnicodeSet variables to correspond to a set, a single element, or a stand-in for a single pre-parsed set (via lookupMatcher, to be removed later as part of ICU-23297). Variables in string literals are no longer expanded: {$this} now means {\$this} rather than {Bobby Tables}]&[]&[{}.

Remove applyPropertyPattern in favour of the new lexer as discussed in #3604 (comment).

No other changes in behaviour: named-element is still a set, spaces are still ignored in string-literal, spaces are still allowed in [: ^ (the change for that first one will come in a subsequent pull request, the others will be proposed shortly).

Checklist

  • Required: Issue filed: ICU-23301
  • Required: The PR title must be prefixed with a JIRA Issue number. Example: "ICU-NNNNN Fix xyz"
  • Required: Each commit message must be prefixed with a JIRA Issue number. Example: "ICU-NNNNN Fix xyz"
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable
  • Approver: Feel free to merge on my behalf

@eggrobin eggrobin changed the title Grammatical UnicodeSet variables ICU-23301 Grammatical UnicodeSet variables Dec 23, 2025
@markusicu
Copy link
Member

Still "draft". Not yet ready to look at?

@eggrobin eggrobin marked this pull request as ready for review December 23, 2025 17:51
@eggrobin
Copy link
Member Author

Oops, undrafted.

"[[a-z]-[c-z]-]", nullptr,
"string", "{", "end", "}", "[ $string Zeichenkette $end ]", "[{Zeichenkette}]", nullptr,
// Variables do not expand inside string literals.
"us", "[a-z]", "[$us{$us}]", R"([a-z{\$us}])", nullptr,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the $ in the string literal need to be escaped?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The input is [$us{$us}], no escaping here.

The [a-z{\$us}] is the output of UnicodeSet pattern rewriting; that always escapes dollar signs (I think it applies the same logic inside and outside strings).

Maybe it shouldn’t do that inside strings anymore, but this like a separate issue; [a-z{\$us}] is correct.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that - also gets escaped in toPattern strings, even though it has never meant anything inside a string literal:
https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%7BBaden-W%C3%BCrttemberg%7D%5D&g=&i=

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird. Please add it to your pile of stuff to revisit later.

@markusicu markusicu self-assigned this Dec 23, 2025
Copy link
Member

@markusicu markusicu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks... plausible

Comment on lines +480 to +482
if (UnicodeString name = symbols_->parseReference(pattern_, nameEnd, pattern_.length());
!name.isEmpty()) {
chars_.jumpahead(nameEnd.getIndex() - (start + 1));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard-to-read indentation style... but if I don't find anything else, let's go with it.

@eggrobin eggrobin force-pushed the grammatical-variables branch from 9109c05 to 1655f87 Compare December 29, 2025 11:03
@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@eggrobin eggrobin force-pushed the grammatical-variables branch from 1655f87 to 908e5a2 Compare December 29, 2025 11:21
@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@eggrobin eggrobin merged commit 83ad6bf into unicode-org:main Dec 29, 2025
98 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants