-
-
Notifications
You must be signed in to change notification settings - Fork 830
ICU-23179 A somewhat more structured UnicodeSet parser. #3604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
ICU-23179 A somewhat more structured UnicodeSet parser. #3604
Conversation
@markusicu Do we need a separate ticket for this restructuring, or should I do it as part of ICU-23179? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
BTW we should point out in the spec that unlike set operations, where
A∖B = A∩Bᶜ, UnicodeSet doesn't do that: for some A and Bs (those containing sequences):
A-B ≠ A&[^B]
Example
[ab{cd}{ef}]-[ax{cd}{ex}]
≠
[ab{cd}{ef}]&[^ax{cd}{ex}]
Wow, there's a lot here. I ran out of time to read through the whole thing ad was starting to lose my mental acuity well before that, so I don't think I have anything useful to say. I can try again next week if you don't get useful review feedback from other people. I read all the way through the spec, though, and I think I managed to keep my brain going through that. I think it looks really good-- is the plan to bring all the various implementations into complete conformance with that spec, and is this one of the first steps toward doing so? That seems like a good idea. I'm assuming you've thought through the backward-compatibility issues in doing so and everybody's comfortable with them? I hope you'll permit me one stupid question about the spec: I'm assuming that the spec allows for sets that contain both individual code points and strings, right? I did see a lot of verbiage in there about bracketed elements and how they can represent either single code points or strings and how you want to do special things with them when they're single code points. I don't remember seeing much verbiage that talks about the behavior of unicode sets that contain both strings and single code points. At least to me, the semantics of this kind of thing are far from obvious, especially if a set contains both types. I remember a lot of this being discussed when various people were proposing changes to |
I wouldn't think that performance would be affected significantly, but we
probably should check at least with a simple spot test.
…On Fri, Sep 12, 2025, 17:46 Rich Gillam ***@***.***> wrote:
*richgillam* left a comment (unicode-org/icu#3604)
<#3604 (comment)>
Wow, there's a lot here. I ran out of time to read through the whole thing
ad was starting to lose my mental acuity well before that, so I don't think
I have anything useful to say. I can try again next week if you don't get
useful review feedback from other people.
I read all the way through the spec, though, and I think I managed to keep
my brain going through that. I think it looks really good-- is the plan to
bring all the various implementations into complete conformance with that
spec, and is this one of the first steps toward doing so? That seems like a
good idea. I'm assuming you've thought through the backward-compatibility
issues in doing so and everybody's comfortable with them?
I hope you'll permit me one stupid question about the spec: I'm assuming
that the spec allows for sets that contain both individual code points and
strings, right? I did see a lot of verbiage in there about bracketed
elements and how they can represent either single code points or strings
and how you want to do special things with them when they're single code
points. I don't remember seeing much verbiage that talks about the behavior
of unicode sets that contain both strings and single code points. At least
to me, the semantics of this kind of thing are far from obvious, especially
if a set contains both types. I remember a lot of this being discussed when
various people were proposing changes to UnicodeSet earlier, but it seems
like all of that needs to be covered in the spec, and I didn't really see
it there. Is that planned? Am I wrong to be concerned about this?
—
Reply to this email directly, view it on GitHub
<#3604 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMGAZYQKA7GBJMVWUIL3SNSPFAVCNFSM6AAAAACEPCMSG2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTEOBXGI4DCNRXGU>
.
You are receiving this because your review was requested.Message ID:
***@***.***>
|
While this is a hand-rolled recursive-descent predictive parser, it has a separate lexer, so we know that the lexing is not context-dependent.
The lexer prefigures some of the distinctions introduced in UTS61 (escapes in property-query, bracketed-element distinct from string-literal, etc.), but they then get treated the old way in the parser so nothing changes.
Both the lexer and the parser mirror the UTS61 grammar (in its version before the yellow/cyan changes).
Technically it is possible to detect that this behaves differently, as there will be fewer calls to
SymbolTable::lookup
, see the changes to icu4c/source/test/intltest/usettest.cpp.But we retain the essential property (relied on by rbbi) that with $meow=CP, the call to lookupMatcher(CP) always immediately follows a call to lookup(meow).
Checklist