ICU-23301 Grammatical UnicodeSet variables #3825

eggrobin · 2025-12-22T18:34:59Z

Require that UnicodeSet variables to correspond to a set, a single element, or a stand-in for a single pre-parsed set (via lookupMatcher, to be removed later as part of ICU-23297). Variables in string literals are no longer expanded: {$this} now means {\$this} rather than {Bobby Tables}]&[]&[{}.

Remove applyPropertyPattern in favour of the new lexer as discussed in #3604 (comment).

No other changes in behaviour: named-element is still a set, spaces are still ignored in string-literal, spaces are still allowed in [: ^ (the change for that first one will come in a subsequent pull request, the others will be proposed shortly).

Checklist

Required: Issue filed: ICU-23301
Required: The PR title must be prefixed with a JIRA Issue number. Example: "ICU-NNNNN Fix xyz"
Required: Each commit message must be prefixed with a JIRA Issue number. Example: "ICU-NNNNN Fix xyz"
Issue accepted (done by Technical Committee after discussion)
Tests included, if applicable
API docs and/or User Guide docs changed or added, if applicable
Approver: Feel free to merge on my behalf

markusicu · 2025-12-23T17:48:41Z

Still "draft". Not yet ready to look at?

eggrobin · 2025-12-23T17:51:44Z

Oops, undrafted.

markusicu · 2025-12-23T17:53:18Z

icu4c/source/test/intltest/usettest.cpp

-            "[[a-z]-[c-z]-]", nullptr,
-        "string", "{", "end", "}", "[ $string Zeichenkette $end ]", "[{Zeichenkette}]", nullptr,
+        // Variables do not expand inside string literals.
+        "us", "[a-z]", "[$us{$us}]", R"([a-z{\$us}])", nullptr,


Why does the $ in the string literal need to be escaped?

The input is [$us{$us}], no escaping here.

The [a-z{\$us}] is the output of UnicodeSet pattern rewriting; that always escapes dollar signs (I think it applies the same logic inside and outside strings).

Maybe it shouldn’t do that inside strings anymore, but this like a separate issue; [a-z{\$us}] is correct.

Note that - also gets escaped in toPattern strings, even though it has never meant anything inside a string literal:
https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%7BBaden-W%C3%BCrttemberg%7D%5D&g=&i=

Weird. Please add it to your pile of stuff to revisit later.

markusicu

looks... plausible

markusicu · 2025-12-23T20:10:05Z

icu4c/source/common/uniset_props.cpp

+            if (UnicodeString name = symbols_->parseReference(pattern_, nameEnd, pattern_.length());
+                !name.isEmpty()) {
+                chars_.jumpahead(nameEnd.getIndex() - (start + 1));


Hard-to-read indentation style... but if I don't find anything else, let's go with it.

jira-pull-request-webhook · 2025-12-29T11:03:23Z

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

jira-pull-request-webhook · 2025-12-29T11:21:38Z

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

eggrobin changed the title ~~Grammatical UnicodeSet variables~~ ICU-23301 Grammatical UnicodeSet variables Dec 23, 2025

eggrobin requested review from markusicu and richgillam December 23, 2025 16:08

eggrobin marked this pull request as ready for review December 23, 2025 17:51

markusicu reviewed Dec 23, 2025

View reviewed changes

markusicu self-assigned this Dec 23, 2025

markusicu approved these changes Dec 23, 2025

View reviewed changes

eggrobin force-pushed the grammatical-variables branch from 9109c05 to 1655f87 Compare December 29, 2025 11:03

ICU-23301 Grammatical UnicodeSet variables

908e5a2

eggrobin force-pushed the grammatical-variables branch from 1655f87 to 908e5a2 Compare December 29, 2025 11:21

eggrobin merged commit 83ad6bf into unicode-org:main Dec 29, 2025
98 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ICU-23301 Grammatical UnicodeSet variables #3825

ICU-23301 Grammatical UnicodeSet variables #3825

eggrobin commented Dec 22, 2025 •

edited by markusicu

Loading

Uh oh!

markusicu commented Dec 23, 2025

Uh oh!

eggrobin commented Dec 23, 2025

Uh oh!

markusicu Dec 23, 2025

Uh oh!

eggrobin Dec 23, 2025

Uh oh!

eggrobin Dec 23, 2025

Uh oh!

markusicu Dec 23, 2025

Uh oh!

markusicu left a comment

Uh oh!

markusicu Dec 23, 2025

Uh oh!

jira-pull-request-webhook bot commented Dec 29, 2025

Uh oh!

jira-pull-request-webhook bot commented Dec 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

ICU-23301 Grammatical UnicodeSet variables #3825

ICU-23301 Grammatical UnicodeSet variables #3825

Conversation

eggrobin commented Dec 22, 2025 • edited by markusicu Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

markusicu commented Dec 23, 2025

Uh oh!

eggrobin commented Dec 23, 2025

Uh oh!

markusicu Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

eggrobin Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

eggrobin Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

markusicu Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

markusicu left a comment

Choose a reason for hiding this comment

Uh oh!

markusicu Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

jira-pull-request-webhook bot commented Dec 29, 2025

Uh oh!

jira-pull-request-webhook bot commented Dec 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

eggrobin commented Dec 22, 2025 •

edited by markusicu

Loading