-
Notifications
You must be signed in to change notification settings - Fork 601
toke.c: Generalize S_parse_ident() #23835
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
toke.c
Outdated
|
|
||
| parse_ident(&s, &d, e, allow_package, is_utf8, TRUE); | ||
| parse_ident(&s, &d, e, is_utf8, | ||
| (CHECK_DOLLAR | ((allow_package) ? ALLOW_PACKAGE : 0))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usage of U32 instead of bool allows to preserve ALLOW_PACKAGE bit allows just CHECK_DOLLAR | allow_package.
Is there any reason to use bool type ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand. This code is in Perl_scan_word(). It was my original intent to make allow_package a U32 parameter to scan_word, but then I realized that is a function that is accessible outside this file, so changing parameters is problematic. And even if we could change it, that means the bit definitions for the flags parameter would have to be exported, which I did not want to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant to use ALLOW_PACKAGE (or complement DISALLOW_PACKAGE) as parameters of scan_word as well but on second though, let's focus on parse_ident only. Marking conversation as resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But my point is we can't change the scan_word parameters. I forgot that and had to revert with #23828
| #define ALLOW_PACKAGE (1 << 1) | ||
| #define CHECK_DOLLAR (1 << 2) | ||
| #define IDFIRST_ONLY (1 << 3) | ||
| #define STOP_AT_FIRST_NON_DIGIT (1 << 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just note:
I'd suggest NUMERIC_ONLY - shorter and it is in fact what user of this function wants - numeric identiifier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On further consideration, NUMERIC_ONLY isn't an accurate name. What happens with this flag is if the first character is a digit, then it stops parsing when it finds a non-digit. The function still accepts the first character not being a digit, and parses that as usual. I'm open to a better name, but I think your suggestion is less accurate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, function behaves like that ... but is it expected behaviour?
Extracts from function which bug me:
assert((stop_at_first_non_digit & idfirst_only) == 0);
while (s < send) {
if ( (advance = isIDFIRST_lazy_if_safe(s, send, is_utf8)) && (is_utf8 || idfirst_only)) { ... }
else if (stop_at_first_non_digit && isDIGIT_A(*s)) { ... }
else if (! idfirst_only && isWORDCHAR_A(*s) ) { ... }
else if ( allow_package ... ) { handle package separator }
}
In assert you do not allow idfirst_only and stop_at_first_non_digit.
I think this function doesn't do what it should do, let me show some use cases. Use cases will contain:
ASCII- whenis_utf8isfalseUTF8- whenis_utf8istrueSTOP- whenstop_...is set totrueIDFIRST- whenidfirst_onlyis set totrue¥- represents character which is idfirst, is wordchar; but is not idcont
ASCII; STOP; string A0000B000+
- first outer loop iteration
advancecondition isfalse(idfirstis incompatible withSTOP, assured byassert)stop_condition isfalse('¥' is not a ascii digit)! idfirst_onlycondition istrue(Ais word character)- it's inner condition consumes
A0000B000
- it's inner condition consumes
- second outer loop iteration
- all conditions fails; loop breaks
Returned value: A0000B000
UTF8; STOP; string A0000B000+
- first outer loop iteration
advancecondition istrue- it's inner condition consumes
A0000B000on first run
- it's inner condition consumes
- second outer loop iteration
- all conditions fails; loop breaks
Returned value: A0000B000
ASCII; STOP; string ¥abcd¥0000+
- first outer loop iteration
advancecondition isfalse(idfirstis incompatible withSTOP, assured byassert)stop_condition isfalse('¥' is not ascii digit)! idfirst_onlycondition istrue(Ais word character)- it's inner condition consumes
¥abcd¥0000
- it's inner condition consumes
- second outer loop iteration
- all conditions fails; loop breaks
Returned value: ¥abcd¥0000
UTF8; STOP; string ¥abcd¥0000+
- first outer loop iteration
advancecondition istrue(is utf8 && is IDFIRST`)- it's inner condition consumes
¥abcd
- it's inner condition consumes
- second outer loop iteration
advancecondition istrue(is utf8 && is IDFIRST`)- it's inner condition consumes
¥0000
- it's inner condition consumes
- third outer loop iteration
- all conditions fails; loop breaks
Returned value: ¥abcd¥0000
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your final two examples are wrong. ¥ is neither a word character nor IDFirst. In example three you mention A, but there is no such letter in that example. I think you meant ¥. So in both cases, it consumes nothing. I think it gives the right answer in all 4 cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know ¥ is neither IDFirst nor IDCont. I used is placeholder for character which is IDFirst but not IDcont. And sorry for typo in example, you are right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and clarification: this discussion should not be treated as one preventing merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I understand now what you meant.
There are no characters that are IDFirst and not IDCont. This is guaranteed by Unicode.
ID_Continue characters include ID_Start characters, plus other characters
(paraphrased from https://www.unicode.org/reports/tr31 . You can check this out by
perl -le 'use re qw(Debug COMPILE); qr/(?[ \p{IDS} - \p{IDC} ])/'
It yields
Final program:
1: OPFAIL (3)
3: END (0)
meaning there is nothing that matches things that are IDS that aren't IDC,
Note the slight differences in terminology. Our macro uses 'first', not 'start', for historical reasons.
We also use XIDS and XIDC, which are more modern versions of IDS and IDC, but the same guarantee applies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hence your final two examples would show a flaw in the algorithm if Unicode didn't forbid these situations from occurring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it looks my mistake in terminology. Having experience with writing parsers for different languages (eg one with support of currency symbols as identifier characters) I understood macros as terms in domain Perl, not in domain Unicode.
5f37df6 to
25a31f4
Compare
This does better vertical alignment, and fixes a typo in a comment that led to it being misleading
So don't mark it as such in its definition. embed.fnc already does not say it is inline. It is too complicated to be inline.
I don't know why these required ternaries; perhaps a bug in a C89 compiler.
This makes it clear at each call point what is happening, instead of having to jump to the S_force_word definition to know what 'false, true' vs 'true, false' actually means. And this prepares for future commits.
This makes it clearer at each call point what is happening, and prepares for future commits where more flags will be passed to this function.
Prior to this commit, the string passed to this function had to be pointing to somewhere in PL_bufptr. But this is only because it assumed that the initial position is less than PL_bufend. By passing the upper bound in, that assumption is automatically removed.
This function doesn't change anything in the string delimitted by these parameters, and future commits will call it with const strings that otherwise would have to cast away const
And make it clearer
All calls to it did this termination individually. Better to do it in one place
This removes the test for the identifier being too long at the beginning of the loop. Almost all the branches already have their own checks for this, and by restructuring those slightly, this one becomes extraneous, except for the one branch that didn't have a check. And it turns out this was wrong for that branch, which is the one that turns an apostrophe package separator into ::. The removed code assumed that the output doesn't expand, whereas each single apostrophe is in fact replaced by two characters. This commit corrects that. And it consolidates the handling to the target of a goto at the end of the function. Right now, there is no real advantage in that, but a future commit will make the handling more complicated, so a single point for it will be useful.
We know the contents of *s here; it is a colon. Just use that instead of derefencing.
These branches differ only
1) in part of the conditions that indicate to take them, so combine
those conditions together,
2) the number of bytes to advance, which is easily determinable
Otherwise they are identical, so it is easier to understand if they are
made common
The loop is refactored to eliminate an assignment at the end, and I think it is slightly clearer. But more importantly, it prepares for future commits. There is some extra indentation that will make sense when those commits are done
An identifier parsed by this function can include the ones most people would expect, but also ones that begin with a digit followed by ASCII \w characters. This commit adds a flag so that the function doesn't recognize the latter type as an identifier
S_scan_ident accepts another type of identifier: all digits. This moves the code that looks for those into S_parse_ident, adding a flag to accept them in that function. This adds a bit of complexity to S_parse_ident, removing equivalent complexity, and a bit more from S_scan_ident. Future commits will remove more complexity. This commit does just a bit beyond the bare minimum to move the code. The next commit will simplify the moved code a bit.
The previous commit just moved some code and added comments. This commit takes advantage of the new place, and simplifies the code, using the paradigm from the isWORDCHAR_A clause just below.
The previous two commits moved some code into a function. This commit takes very similar code and replaces it with a call to that function.
When this flag is set, when an illegal identifier is found, instead of croaking the function returns NULL. This will allow future commits to use this function when all that is desired is to determine if the identifier is legal or not.
This gives a more meaningful mnemonic
This new function is for callers that are merely checking if the string being parsed is a legal identifier or not, and arent interested in the normalized version of the identifier that parse_indent() generates. This new function allows callers to not have to think about this buffer; it just wraps plain parse_ident() using a throw-away buffer to hold the returned normalized text. This avoids introducing a bunch of conditionals inside parse_ident.
Instead of rolling its own version, incompletely.
25a31f4 to
3c5916e
Compare
|
Merged via ec8fb4c |
This p.r. generalizes this function so that it can be called in more circumstances; the final commit calls it from one such place, as a demonstration.
The goal is to centralize here the sort of handling of names, like identifiers, so changes can be made in just one place, and we can get rid of the other places which have slightly different rules from each other.
Sly identifier names have been used in projects to slip in code (past reviewers) which did not do what on the surface it appeared to. Unicode has in recent releases started to address that issue, and this p.r. will allow us to use their methods.
For example, I could see some sort of 'use strict' form that a perlcritic policy would enforce being used, and which would allow parse_ident() to reject problematic names.