Skip to content

Conversation

eggrobin
Copy link
Member

@eggrobin eggrobin commented Aug 21, 2025

While this is a hand-rolled recursive-descent predictive parser, it has a separate lexer, so we know that the lexing is not context-dependent.
The lexer prefigures some of the distinctions introduced in UTS61 (escapes in property-query, bracketed-element distinct from string-literal, etc.), but they then get treated the old way in the parser so nothing changes.

Both the lexer and the parser mirror the UTS61 grammar (in its version before the yellow/cyan changes).

TEMPORARY NOTE: This actually matches the grammar in https://www.unicode.org/reports/tr61/tr61-1d2.html; tr61-1.html should get updated shortly.

Technically it is possible to detect that this behaves differently, as there will be fewer calls to SymbolTable::lookup, see the changes to icu4c/source/test/intltest/usettest.cpp.

But we retain the essential property (relied on by rbbi) that with $meow=CP, the call to lookupMatcher(CP) always immediately follows a call to lookup(meow).

Checklist

  • Required: Issue filed: ICU-23179
  • Required: The PR title must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
  • Required: Each commit message must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable

@eggrobin eggrobin marked this pull request as ready for review September 11, 2025 12:34
@eggrobin
Copy link
Member Author

@markusicu Do we need a separate ticket for this restructuring, or should I do it as part of ICU-23179?

@eggrobin eggrobin changed the title ICU-22851 A somewhat more structured UnicodeSet parser. ICU-23179 A somewhat more structured UnicodeSet parser. Sep 11, 2025
@markusicu markusicu self-assigned this Sep 11, 2025
Copy link
Member

@macchiati macchiati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

BTW we should point out in the spec that unlike set operations, where
A∖B = A∩Bᶜ, UnicodeSet doesn't do that: for some A and Bs (those containing sequences):

A-B ≠ A&[^B]

Example

[ab{cd}{ef}]-[ax{cd}{ex}]

[ab{cd}{ef}]&[^ax{cd}{ex}]

@richgillam
Copy link
Contributor

Wow, there's a lot here. I ran out of time to read through the whole thing ad was starting to lose my mental acuity well before that, so I don't think I have anything useful to say. I can try again next week if you don't get useful review feedback from other people.

I read all the way through the spec, though, and I think I managed to keep my brain going through that. I think it looks really good-- is the plan to bring all the various implementations into complete conformance with that spec, and is this one of the first steps toward doing so? That seems like a good idea. I'm assuming you've thought through the backward-compatibility issues in doing so and everybody's comfortable with them?

I hope you'll permit me one stupid question about the spec: I'm assuming that the spec allows for sets that contain both individual code points and strings, right? I did see a lot of verbiage in there about bracketed elements and how they can represent either single code points or strings and how you want to do special things with them when they're single code points. I don't remember seeing much verbiage that talks about the behavior of unicode sets that contain both strings and single code points. At least to me, the semantics of this kind of thing are far from obvious, especially if a set contains both types. I remember a lot of this being discussed when various people were proposing changes to UnicodeSet earlier, but it seems like all of that needs to be covered in the spec, and I didn't really see it there. Is that planned? Am I wrong to be concerned about this?

@macchiati
Copy link
Member

macchiati commented Sep 13, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants