toke.c: Generalize S_parse_ident() #23835

khwilliamson · 2025-10-12T00:55:15Z

This p.r. generalizes this function so that it can be called in more circumstances; the final commit calls it from one such place, as a demonstration.

The goal is to centralize here the sort of handling of names, like identifiers, so changes can be made in just one place, and we can get rid of the other places which have slightly different rules from each other.

Sly identifier names have been used in projects to slip in code (past reviewers) which did not do what on the surface it appeared to. Unicode has in recent releases started to address that issue, and this p.r. will allow us to use their methods.

For example, I could see some sort of 'use strict' form that a perlcritic policy would enforce being used, and which would allow parse_ident() to reject problematic names.

This set of changes does not require a perldelta entry.

happy-barney · 2025-10-12T04:19:17Z

toke.c


-    parse_ident(&s, &d, e, allow_package, is_utf8, TRUE);
+    parse_ident(&s, &d, e, is_utf8,
+                (CHECK_DOLLAR | ((allow_package) ? ALLOW_PACKAGE : 0)));


Usage of U32 instead of bool allows to preserve ALLOW_PACKAGE bit allows just CHECK_DOLLAR | allow_package.

Is there any reason to use bool type ?

I'm not sure I understand. This code is in Perl_scan_word(). It was my original intent to make allow_package a U32 parameter to scan_word, but then I realized that is a function that is accessible outside this file, so changing parameters is problematic. And even if we could change it, that means the bit definitions for the flags parameter would have to be exported, which I did not want to do.

I meant to use ALLOW_PACKAGE (or complement DISALLOW_PACKAGE) as parameters of scan_word as well but on second though, let's focus on parse_ident only. Marking conversation as resolved.

But my point is we can't change the scan_word parameters. I forgot that and had to revert with #23828

happy-barney · 2025-10-12T04:41:33Z

toke.c

 #define ALLOW_PACKAGE               (1 << 1)
 #define CHECK_DOLLAR                (1 << 2)
 #define IDFIRST_ONLY                (1 << 3)
+#define STOP_AT_FIRST_NON_DIGIT     (1 << 4)


Just note:
I'd suggest NUMERIC_ONLY - shorter and it is in fact what user of this function wants - numeric identiifier

On further consideration, NUMERIC_ONLY isn't an accurate name. What happens with this flag is if the first character is a digit, then it stops parsing when it finds a non-digit. The function still accepts the first character not being a digit, and parses that as usual. I'm open to a better name, but I think your suggestion is less accurate

yes, function behaves like that ... but is it expected behaviour?
Extracts from function which bug me:

assert((stop_at_first_non_digit & idfirst_only) == 0); while (s < send) { if ( (advance = isIDFIRST_lazy_if_safe(s, send, is_utf8)) && (is_utf8 || idfirst_only)) { ... } else if (stop_at_first_non_digit && isDIGIT_A(*s)) { ... } else if (! idfirst_only && isWORDCHAR_A(*s) ) { ... } else if ( allow_package ... ) { handle package separator } }

In assert you do not allow idfirst_only and stop_at_first_non_digit.

I think this function doesn't do what it should do, let me show some use cases. Use cases will contain:

ASCII - when is_utf8 is false

UTF8 - when is_utf8 is true

STOP - when stop_... is set to true

IDFIRST - when idfirst_only is set to true

¥ - represents character which is idfirst, is wordchar; but is not idcont

ASCII; STOP; string A0000B000+

first outer loop iteration

advance condition is false (idfirst is incompatible with STOP, assured by assert)

stop_ condition is false ('¥' is not a ascii digit)

! idfirst_only condition is true (A is word character)

it's inner condition consumes A0000B000

second outer loop iteration

all conditions fails; loop breaks

Returned value: A0000B000

UTF8; STOP; string A0000B000+

first outer loop iteration

advance condition is true

it's inner condition consumes A0000B000 on first run

second outer loop iteration

all conditions fails; loop breaks

Returned value: A0000B000

ASCII; STOP; string ¥abcd¥0000+

first outer loop iteration

advance condition is false (idfirst is incompatible with STOP, assured by assert)

stop_ condition is false ('¥' is not ascii digit)

! idfirst_only condition is true (A is word character)

it's inner condition consumes ¥abcd¥0000

second outer loop iteration

all conditions fails; loop breaks

Returned value: ¥abcd¥0000

UTF8; STOP; string ¥abcd¥0000+

first outer loop iteration

advance condition is true (is utf8 && is IDFIRST`)

it's inner condition consumes ¥abcd

second outer loop iteration

advance condition is true (is utf8 && is IDFIRST`)

it's inner condition consumes ¥0000

third outer loop iteration

all conditions fails; loop breaks

Returned value: ¥abcd¥0000

Your final two examples are wrong. ¥ is neither a word character nor IDFirst. In example three you mention A, but there is no such letter in that example. I think you meant ¥. So in both cases, it consumes nothing. I think it gives the right answer in all 4 cases.

I know ¥ is neither IDFirst nor IDCont. I used is placeholder for character which is IDFirst but not IDcont. And sorry for typo in example, you are right.

and clarification: this discussion should not be treated as one preventing merge.

Ah, I understand now what you meant.

There are no characters that are IDFirst and not IDCont. This is guaranteed by Unicode.

ID_Continue characters include ID_Start characters, plus other characters

(paraphrased from https://www.unicode.org/reports/tr31 . You can check this out by

perl -le 'use re qw(Debug COMPILE); qr/(?[ \p{IDS} - \p{IDC} ])/'

It yields

Final program:
1: OPFAIL (3)
3: END (0)

meaning there is nothing that matches things that are IDS that aren't IDC,

Note the slight differences in terminology. Our macro uses 'first', not 'start', for historical reasons.

We also use XIDS and XIDC, which are more modern versions of IDS and IDC, but the same guarantee applies.

Hence your final two examples would show a flaw in the algorithm if Unicode didn't forbid these situations from occurring.

So it looks my mistake in terminology. Having experience with writing parsers for different languages (eg one with support of currency symbols as identifier characters) I understood macros as terms in domain Perl, not in domain Unicode.

toke.c

embed.fnc

This does better vertical alignment, and fixes a typo in a comment that led to it being misleading

So don't mark it as such in its definition. embed.fnc already does not say it is inline. It is too complicated to be inline.

I don't know why these required ternaries; perhaps a bug in a C89 compiler.

This makes it clear at each call point what is happening, instead of having to jump to the S_force_word definition to know what 'false, true' vs 'true, false' actually means. And this prepares for future commits.

This makes it clearer at each call point what is happening, and prepares for future commits where more flags will be passed to this function.

Prior to this commit, the string passed to this function had to be pointing to somewhere in PL_bufptr. But this is only because it assumed that the initial position is less than PL_bufend. By passing the upper bound in, that assumption is automatically removed.

This function doesn't change anything in the string delimitted by these parameters, and future commits will call it with const strings that otherwise would have to cast away const

And make it clearer

All calls to it did this termination individually. Better to do it in one place

This removes the test for the identifier being too long at the beginning of the loop. Almost all the branches already have their own checks for this, and by restructuring those slightly, this one becomes extraneous, except for the one branch that didn't have a check. And it turns out this was wrong for that branch, which is the one that turns an apostrophe package separator into ::. The removed code assumed that the output doesn't expand, whereas each single apostrophe is in fact replaced by two characters. This commit corrects that. And it consolidates the handling to the target of a goto at the end of the function. Right now, there is no real advantage in that, but a future commit will make the handling more complicated, so a single point for it will be useful.

We know the contents of *s here; it is a colon. Just use that instead of derefencing.

These branches differ only 1) in part of the conditions that indicate to take them, so combine those conditions together, 2) the number of bytes to advance, which is easily determinable Otherwise they are identical, so it is easier to understand if they are made common

The loop is refactored to eliminate an assignment at the end, and I think it is slightly clearer. But more importantly, it prepares for future commits. There is some extra indentation that will make sense when those commits are done

An identifier parsed by this function can include the ones most people would expect, but also ones that begin with a digit followed by ASCII \w characters. This commit adds a flag so that the function doesn't recognize the latter type as an identifier

S_scan_ident accepts another type of identifier: all digits. This moves the code that looks for those into S_parse_ident, adding a flag to accept them in that function. This adds a bit of complexity to S_parse_ident, removing equivalent complexity, and a bit more from S_scan_ident. Future commits will remove more complexity. This commit does just a bit beyond the bare minimum to move the code. The next commit will simplify the moved code a bit.

The previous commit just moved some code and added comments. This commit takes advantage of the new place, and simplifies the code, using the paradigm from the isWORDCHAR_A clause just below.

The previous two commits moved some code into a function. This commit takes very similar code and replaces it with a call to that function.

When this flag is set, when an illegal identifier is found, instead of croaking the function returns NULL. This will allow future commits to use this function when all that is desired is to determine if the identifier is legal or not.

This gives a more meaningful mnemonic

This new function is for callers that are merely checking if the string being parsed is a legal identifier or not, and arent interested in the normalized version of the identifier that parse_indent() generates. This new function allows callers to not have to think about this buffer; it just wraps plain parse_ident() using a throw-away buffer to hold the returned normalized text. This avoids introducing a bunch of conditionals inside parse_ident.

Instead of rolling its own version, incompletely.

khwilliamson · 2025-10-17T18:46:00Z

Merged via ec8fb4c

khwilliamson requested a review from leonerd October 12, 2025 00:55

khwilliamson added the Use merge commit Don't merge this p.r. from github It contains multiple related commits. Instructions in perlgit label Oct 12, 2025

happy-barney reviewed Oct 12, 2025

View reviewed changes

github-actions bot added the hasConflicts label Oct 12, 2025

khwilliamson force-pushed the parse_ident branch 2 times, most recently from 5f37df6 to 25a31f4 Compare October 12, 2025 21:28

khwilliamson removed the hasConflicts label Oct 12, 2025

khwilliamson added 22 commits October 16, 2025 07:52

toke.c: White space/comments only

3ad7102

This does better vertical alignment, and fixes a typo in a comment that led to it being misleading

S_parse_ident is not inline

248bf8c

So don't mark it as such in its definition. embed.fnc already does not say it is inline. It is too complicated to be inline.

toke.c: Simplify two ternaries

c13ae23

I don't know why these required ternaries; perhaps a bug in a C89 compiler.

toke.c: Use flags parameter for S_force_word

a3ce8ca

This makes it clear at each call point what is happening, instead of having to jump to the S_force_word definition to know what 'false, true' vs 'true, false' actually means. And this prepares for future commits.

toke.c: Document S_parse_ident

a446c7c

toke.c: Use flags parameter to S_parse_ident

d057bb1

This makes it clearer at each call point what is happening, and prepares for future commits where more flags will be passed to this function.

parse_ident: Add const to two parameters

e1e09d4

This function doesn't change anything in the string delimitted by these parameters, and future commits will call it with const strings that otherwise would have to cast away const

parse_ident: Use variable to avoid expression duplication

0b7f4ec

And make it clearer

S_parse_ident: Terminate output with NUL

1e51d75

All calls to it did this termination individually. Better to do it in one place

parse_ident: Use constant when known.

ede1f1d

We know the contents of *s here; it is a colon. Just use that instead of derefencing.

S_parse_ident: Restructure loop

7342473

The loop is refactored to eliminate an assignment at the end, and I think it is slightly clearer. But more importantly, it prepares for future commits. There is some extra indentation that will make sense when those commits are done

toke.c: S_parse_ident: simplify

f1e7c09

The previous commit just moved some code and added comments. This commit takes advantage of the new place, and simplifies the code, using the paradigm from the isWORDCHAR_A clause just below.

S_scan_ident: Replace redundant code with function call

9082b27

The previous two commits moved some code into a function. This commit takes very similar code and replaces it with a call to that function.

Add #define for the maximum Perl identifier length

bd66bfa

This gives a more meaningful mnemonic

Perl_valid_identifier_pve: Use parse_ident

3c5916e

Instead of rolling its own version, incompletely.

khwilliamson force-pushed the parse_ident branch from 25a31f4 to 3c5916e Compare October 16, 2025 14:06

github-actions bot added the hasConflicts label Oct 17, 2025

khwilliamson closed this Oct 17, 2025

Leont mentioned this pull request Oct 18, 2025

BBC: Blead Breaks HTTP::Exception #23860

Closed

jkeenan mentioned this pull request Oct 22, 2025

ec8fb4ca44 breaks -DDEBUGGING builds #23854

Open

toke.c: Generalize S_parse_ident() #23835

toke.c: Generalize S_parse_ident() #23835

Uh oh!

Conversation

khwilliamson commented Oct 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

khwilliamson Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

ASCII; STOP; string A0000B000+

UTF8; STOP; string A0000B000+

ASCII; STOP; string ¥abcd¥0000+

UTF8; STOP; string ¥abcd¥0000+

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

khwilliamson commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

khwilliamson Oct 13, 2025 •

edited

Loading

ASCII; STOP; string `A0000B000+`

UTF8; STOP; string `A0000B000+`

ASCII; STOP; string `¥abcd¥0000+`

UTF8; STOP; string `¥abcd¥0000+`