ICU-23179 A somewhat more structured UnicodeSet parser. #3604

eggrobin · 2025-08-21T15:30:35Z

While this is a hand-rolled recursive-descent predictive parser, it has a separate lexer, so we know that the lexing is not context-dependent.
The lexer prefigures some of the distinctions introduced in UTS61 (escapes in property-query, bracketed-element distinct from string-literal, etc.), but they then get treated the old way in the parser so nothing changes.

Both the lexer and the parser mirror the UTS61 grammar (in its version before the yellow/cyan changes).

TEMPORARY NOTE: This actually matches the grammar in https://www.unicode.org/reports/tr61/tr61-1d2.html; tr61-1.html should get updated shortly.

Technically it is possible to detect that this behaves differently, as there will be fewer calls to SymbolTable::lookup, see the changes to icu4c/source/test/intltest/usettest.cpp.

But we retain the essential property (relied on by rbbi) that with $meow=CP, the call to lookupMatcher(CP) always immediately follows a call to lookup(meow).

Checklist

Required: Issue filed: ICU-23179
Required: The PR title must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
Required: Each commit message must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
Issue accepted (done by Technical Committee after discussion)
Tests included, if applicable
API docs and/or User Guide docs changed or added, if applicable

…madness

…nto-madness

eggrobin · 2025-09-11T12:34:30Z

@markusicu Do we need a separate ticket for this restructuring, or should I do it as part of ICU-23179?

macchiati

LGTM

BTW we should point out in the spec that unlike set operations, where
A∖B = A∩Bᶜ, UnicodeSet doesn't do that: for some A and Bs (those containing sequences):

A-B ≠ A&[^B]

Example

[ab{cd}{ef}]-[ax{cd}{ex}]
≠
[ab{cd}{ef}]&[^ax{cd}{ex}]

richgillam · 2025-09-13T00:46:20Z

Wow, there's a lot here. I ran out of time to read through the whole thing ad was starting to lose my mental acuity well before that, so I don't think I have anything useful to say. I can try again next week if you don't get useful review feedback from other people.

I read all the way through the spec, though, and I think I managed to keep my brain going through that. I think it looks really good-- is the plan to bring all the various implementations into complete conformance with that spec, and is this one of the first steps toward doing so? That seems like a good idea. I'm assuming you've thought through the backward-compatibility issues in doing so and everybody's comfortable with them?

I hope you'll permit me one stupid question about the spec: I'm assuming that the spec allows for sets that contain both individual code points and strings, right? I did see a lot of verbiage in there about bracketed elements and how they can represent either single code points or strings and how you want to do special things with them when they're single code points. I don't remember seeing much verbiage that talks about the behavior of unicode sets that contain both strings and single code points. At least to me, the semantics of this kind of thing are far from obvious, especially if a set contains both types. I remember a lot of this being discussed when various people were proposing changes to UnicodeSet earlier, but it seems like all of that needs to be covered in the spec, and I didn't really see it there. Is that planned? Am I wrong to be concerned about this?

macchiati · 2025-09-13T04:39:19Z

I wouldn't think that performance would be affected significantly, but we probably should check at least with a simple spot test.

…

On Fri, Sep 12, 2025, 17:46 Rich Gillam ***@***.***> wrote: *richgillam* left a comment (unicode-org/icu#3604) <#3604 (comment)> Wow, there's a lot here. I ran out of time to read through the whole thing ad was starting to lose my mental acuity well before that, so I don't think I have anything useful to say. I can try again next week if you don't get useful review feedback from other people. I read all the way through the spec, though, and I think I managed to keep my brain going through that. I think it looks really good-- is the plan to bring all the various implementations into complete conformance with that spec, and is this one of the first steps toward doing so? That seems like a good idea. I'm assuming you've thought through the backward-compatibility issues in doing so and everybody's comfortable with them? I hope you'll permit me one stupid question about the spec: I'm assuming that the spec allows for sets that contain both individual code points and strings, right? I did see a lot of verbiage in there about bracketed elements and how they can represent either single code points or strings and how you want to do special things with them when they're single code points. I don't remember seeing much verbiage that talks about the behavior of unicode sets that contain both strings and single code points. At least to me, the semantics of this kind of thing are far from obvious, especially if a set contains both types. I remember a lot of this being discussed when various people were proposing changes to UnicodeSet earlier, but it seems like all of that needs to be covered in the spec, and I didn't really see it there. Is that planned? Am I wrong to be concerned about this? — Reply to this email directly, view it on GitHub <#3604 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMGAZYQKA7GBJMVWUIL3SNSPFAVCNFSM6AAAAACEPCMSG2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTEOBXGI4DCNRXGU> . You are receiving this because your review was requested.Message ID: ***@***.***>

eggrobin added 30 commits August 11, 2025 16:24

ICU-22851 Test the error paths in UnicodeSet parsing

185a38f

Call it a day

6a650e7

Some progress, toPattern is wrong, but what is right?

85b8b50

ICU-22851 Test the exact behaviour of UnicodeSet::toPattern

07ab1c1

Merge branch 'unicodeset-parser' into recursive-descent

fd6940c

call it a day

b489fa0

Pattern-rebuilding logic

e147b15

More tests of toPattern

f47cac4

ICU-22851 Test the exact behaviour of UnicodeSet::toPattern

e3efa59

Merge commit 'f47cac4' into recursive-descent

1e34418

Merge branch 'unicodeset-parser' into recursive-descent

6276769

Print strings

a7a4035

Appease the warnings even though these are string_views

cef298e

Merge branch 'unicodeset-parser' into recursive-descent

8f8dcca

ICU-22851 Test the exact behaviour of UnicodeSet::toPattern

b4e365b

Merge branch 'unicodeset-parser' into recursive-descent

51e2702

ICU-22851 Test various edge cases with $ in the absence of variables

cef2093

Merge branch 'unicodeset-parser' into recursive-descent

33b1075

$ handling

9e126dd

comment

c8d2b9e

ICU-22851 Even more $ edge cases

bbcc231

Merge branch 'unicodeset-parser' into recursive-descent

ec299be

ICU-22851 Test various edge cases with $ in the absence of variables

876d338

Merge branch 'unicodeset-parser' into recursive-descent

5d19376

ICU-22851 Test UnicodeSet with lookupMatcher

a6d9182

Merge branch 'unicodeset-lookup-matcher' into recursive-descent-into-…

27728c7

…madness

Something that works in the same silly way as it used to.

e81735c

indentation on the parse error tests

4beef14

Merge branch 'unicodeset-parser' into recursive-descent-into-madness

d0bc4fa

ICU-22851 Test the error paths in UnicodeSet parsing

18f2b7b

eggrobin added 17 commits August 21, 2025 17:51

Unused variables

ff092dc

Some work towards a proper lexer

f0bd37b

A proper lexer

b78c0ce

Don’t report end of text as a literal-element

d61b090

Turn off traces

40460d9

ICU-23179 Test more edge cases when mapping syntax characters to sets

e39c4d1

Merge branch 'doctor-it-hurts-when-i-do-this' into dura-lex-sed-lex

9014b75

Deal with the ambiguous - and ^

93d9296

Update sequence expectations

7940892

warnings

d3cc9ea

Clarify some comments

3cfc4ae

more discursive comments

629bc89

make it compile

cace9d7

libstdc++ dependencies

0323593

quote?

bcb7ac0

No infinite loops in the lexer

66cceeb

That is well-formed

f79b35c

eggrobin mentioned this pull request Sep 9, 2025

ICU-23179 Test more edge cases when mapping syntax characters to sets #3612

Merged

6 tasks

eggrobin added 2 commits September 11, 2025 14:28

Merge remote-tracking branch 'la-vache/main' into recursive-descent-i…

5f40223

…nto-madness

dedent

94cc56c

eggrobin marked this pull request as ready for review September 11, 2025 12:34

eggrobin changed the title ~~ICU-22851 A somewhat more structured UnicodeSet parser.~~ ICU-23179 A somewhat more structured UnicodeSet parser. Sep 11, 2025

markusicu self-assigned this Sep 11, 2025

markusicu requested review from macchiati, markusicu and richgillam September 11, 2025 16:25

macchiati approved these changes Sep 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ICU-23179 A somewhat more structured UnicodeSet parser. #3604

ICU-23179 A somewhat more structured UnicodeSet parser. #3604

Uh oh!

eggrobin commented Aug 21, 2025 •

edited

Loading

Uh oh!

eggrobin commented Sep 11, 2025

Uh oh!

macchiati left a comment

Uh oh!

richgillam commented Sep 13, 2025

Uh oh!

macchiati commented Sep 13, 2025 via email

Uh oh!

Uh oh!

Uh oh!

ICU-23179 A somewhat more structured UnicodeSet parser. #3604

Are you sure you want to change the base?

ICU-23179 A somewhat more structured UnicodeSet parser. #3604

Uh oh!

Conversation

eggrobin commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

eggrobin commented Sep 11, 2025

Uh oh!

macchiati left a comment

Choose a reason for hiding this comment

Uh oh!

richgillam commented Sep 13, 2025

Uh oh!

macchiati commented Sep 13, 2025 via email

Uh oh!

Uh oh!

eggrobin commented Aug 21, 2025 •

edited

Loading