Skip to content

Conversation

@khwilliamson
Copy link
Contributor

This p.r. generalizes this function so that it can be called in more circumstances; the final commit calls it from one such place, as a demonstration.

The goal is to centralize here the sort of handling of names, like identifiers, so changes can be made in just one place, and we can get rid of the other places which have slightly different rules from each other.

Sly identifier names have been used in projects to slip in code (past reviewers) which did not do what on the surface it appeared to. Unicode has in recent releases started to address that issue, and this p.r. will allow us to use their methods.

For example, I could see some sort of 'use strict' form that a perlcritic policy would enforce being used, and which would allow parse_ident() to reject problematic names.

  • This set of changes does not require a perldelta entry.

@khwilliamson khwilliamson requested a review from leonerd October 12, 2025 00:55
@khwilliamson khwilliamson added the Use merge commit Don't merge this p.r. from github It contains multiple related commits. Instructions in perlgit label Oct 12, 2025
toke.c Outdated

parse_ident(&s, &d, e, allow_package, is_utf8, TRUE);
parse_ident(&s, &d, e, is_utf8,
(CHECK_DOLLAR | ((allow_package) ? ALLOW_PACKAGE : 0)));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usage of U32 instead of bool allows to preserve ALLOW_PACKAGE bit allows just CHECK_DOLLAR | allow_package.

Is there any reason to use bool type ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand. This code is in Perl_scan_word(). It was my original intent to make allow_package a U32 parameter to scan_word, but then I realized that is a function that is accessible outside this file, so changing parameters is problematic. And even if we could change it, that means the bit definitions for the flags parameter would have to be exported, which I did not want to do.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to use ALLOW_PACKAGE (or complement DISALLOW_PACKAGE) as parameters of scan_word as well but on second though, let's focus on parse_ident only. Marking conversation as resolved.

Copy link
Contributor Author

@khwilliamson khwilliamson Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But my point is we can't change the scan_word parameters. I forgot that and had to revert with #23828

#define ALLOW_PACKAGE (1 << 1)
#define CHECK_DOLLAR (1 << 2)
#define IDFIRST_ONLY (1 << 3)
#define STOP_AT_FIRST_NON_DIGIT (1 << 4)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just note:
I'd suggest NUMERIC_ONLY - shorter and it is in fact what user of this function wants - numeric identiifier

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On further consideration, NUMERIC_ONLY isn't an accurate name. What happens with this flag is if the first character is a digit, then it stops parsing when it finds a non-digit. The function still accepts the first character not being a digit, and parses that as usual. I'm open to a better name, but I think your suggestion is less accurate

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, function behaves like that ... but is it expected behaviour?
Extracts from function which bug me:

    assert((stop_at_first_non_digit & idfirst_only) == 0);
while (s < send) {
	if (   (advance = isIDFIRST_lazy_if_safe(s, send, is_utf8)) && (is_utf8 || idfirst_only)) { ... }
	else if (stop_at_first_non_digit && isDIGIT_A(*s)) { ... }
	else if (! idfirst_only && isWORDCHAR_A(*s) ) { ... }
	else if (   allow_package ... ) { handle package separator }
}

In assert you do not allow idfirst_only and stop_at_first_non_digit.

I think this function doesn't do what it should do, let me show some use cases. Use cases will contain:

  • ASCII - when is_utf8 is false
  • UTF8 - when is_utf8 is true
  • STOP - when stop_... is set to true
  • IDFIRST - when idfirst_only is set to true
  • ¥ - represents character which is idfirst, is wordchar; but is not idcont

ASCII; STOP; string A0000B000+

  1. first outer loop iteration
  • advance condition is false (idfirst is incompatible with STOP, assured by assert)
  • stop_ condition is false ('¥' is not a ascii digit)
  • ! idfirst_only condition is true (A is word character)
    • it's inner condition consumes A0000B000
  1. second outer loop iteration
  • all conditions fails; loop breaks

Returned value: A0000B000

UTF8; STOP; string A0000B000+

  1. first outer loop iteration
  • advance condition is true
    • it's inner condition consumes A0000B000 on first run
  1. second outer loop iteration
  • all conditions fails; loop breaks

Returned value: A0000B000

ASCII; STOP; string ¥abcd¥0000+

  1. first outer loop iteration
  • advance condition is false (idfirst is incompatible with STOP, assured by assert)
  • stop_ condition is false ('¥' is not ascii digit)
  • ! idfirst_only condition is true (A is word character)
    • it's inner condition consumes ¥abcd¥0000
  1. second outer loop iteration
  • all conditions fails; loop breaks

Returned value: ¥abcd¥0000

UTF8; STOP; string ¥abcd¥0000+

  1. first outer loop iteration
  • advance condition is true (is utf8 && is IDFIRST`)
    • it's inner condition consumes ¥abcd
  1. second outer loop iteration
  • advance condition is true (is utf8 && is IDFIRST`)
    • it's inner condition consumes ¥0000
  1. third outer loop iteration
  • all conditions fails; loop breaks

Returned value: ¥abcd¥0000

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your final two examples are wrong. ¥ is neither a word character nor IDFirst. In example three you mention A, but there is no such letter in that example. I think you meant ¥. So in both cases, it consumes nothing. I think it gives the right answer in all 4 cases.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know ¥ is neither IDFirst nor IDCont. I used is placeholder for character which is IDFirst but not IDcont. And sorry for typo in example, you are right.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and clarification: this discussion should not be treated as one preventing merge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I understand now what you meant.

There are no characters that are IDFirst and not IDCont. This is guaranteed by Unicode.

ID_Continue characters include ID_Start characters, plus other characters

(paraphrased from https://www.unicode.org/reports/tr31 . You can check this out by

perl -le 'use re qw(Debug COMPILE); qr/(?[ \p{IDS} - \p{IDC} ])/'

It yields

Final program:
1: OPFAIL (3)
3: END (0)

meaning there is nothing that matches things that are IDS that aren't IDC,

Note the slight differences in terminology. Our macro uses 'first', not 'start', for historical reasons.

We also use XIDS and XIDC, which are more modern versions of IDS and IDC, but the same guarantee applies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hence your final two examples would show a flaw in the algorithm if Unicode didn't forbid these situations from occurring.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it looks my mistake in terminology. Having experience with writing parsers for different languages (eg one with support of currency symbols as identifier characters) I understood macros as terms in domain Perl, not in domain Unicode.

@khwilliamson khwilliamson force-pushed the parse_ident branch 2 times, most recently from 5f37df6 to 25a31f4 Compare October 12, 2025 21:28
This does better vertical alignment, and fixes a typo in a comment that
led to it being misleading
So don't mark it as such in its definition.  embed.fnc already does not say it is inline.  It is too complicated to be inline.
I don't know why these required ternaries; perhaps a bug in a C89
compiler.
This makes it clear at each call point what is happening, instead of
having to jump to the S_force_word definition to know what 'false, true'
vs 'true, false' actually means.

And this prepares for future commits.
This makes it clearer at each call point what is happening, and prepares
for future commits where more flags will be passed to this function.
Prior to this commit, the string passed to this function had to be
pointing to somewhere in PL_bufptr.  But this is only because it assumed
that the initial position is less than PL_bufend.  By passing the upper
bound in, that assumption is automatically removed.
This function doesn't change anything in the string delimitted by these
parameters, and future commits will call it with const strings that
otherwise would have to cast away const
All calls to it did this termination individually.  Better to do it in
one place
This removes the test for the identifier being too long at the beginning
of the loop.

Almost all the branches already have their own checks for this, and by
restructuring those slightly, this one becomes extraneous, except for
the one branch that didn't have a check.  And it turns out this was
wrong for that branch, which is the one that turns an apostrophe package
separator into ::.  The removed code assumed that the output doesn't
expand, whereas each single apostrophe is in fact replaced by two
characters.  This commit corrects that.

And it consolidates the handling to the target of a goto at the end of
the function.  Right now, there is no real advantage in that, but a
future commit will make the handling more complicated, so a single point
for it will be useful.
We know the contents of *s here; it is a colon.  Just use that instead
of derefencing.
These branches differ only
    1)  in part of the conditions that indicate to take them, so combine
        those conditions together,
    2)  the number of bytes to advance, which is easily determinable

Otherwise they are identical, so it is easier to understand if they are
made common
The loop is refactored to eliminate an assignment at the end, and I
think it is slightly clearer.  But more importantly, it prepares for
future commits.  There is some extra indentation that will make sense
when those commits are done
An identifier parsed by this function can include the ones most people
would expect, but also ones that begin with a digit followed by ASCII \w
characters.  This commit adds a flag so that the function doesn't
recognize the latter type as an identifier
S_scan_ident accepts another type of identifier: all digits.  This moves
the code that looks for those into S_parse_ident, adding a flag to
accept them in that function.  This adds a bit of complexity to
S_parse_ident, removing equivalent complexity, and a bit more from
S_scan_ident.  Future commits will remove more complexity.

This commit does just a bit beyond the bare minimum to move the code.
The next commit will simplify the moved code a bit.
The previous commit just moved some code and added comments.  This
commit takes advantage of the new place, and simplifies the code, using
the paradigm from the isWORDCHAR_A clause just below.
The previous two commits moved some code into a function.  This commit
takes very similar code and replaces it with a call to that function.
When this flag is set, when an illegal identifier is found, instead of
croaking the function returns NULL.  This will allow future commits to
use this function when all that is desired is to determine if the
identifier is legal or not.
This gives a more meaningful mnemonic
This new function is for callers that are merely checking if the string
being parsed is a legal identifier or not, and arent interested in the
normalized version of the identifier that parse_indent() generates.

This new function allows callers to not have to think about this buffer;
it just wraps plain parse_ident() using a throw-away buffer to hold the
returned normalized text.  This avoids introducing a bunch of
conditionals inside parse_ident.
Instead of rolling its own version, incompletely.
@khwilliamson
Copy link
Contributor Author

Merged via ec8fb4c

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hasConflicts Use merge commit Don't merge this p.r. from github It contains multiple related commits. Instructions in perlgit

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants