Skip to content

Commit d8c30f3

Browse files
committed
toke.c: Fix inconsistency under 'use utf8'
This code can't work properly: if (UTF ? isIDFIRST_utf8((U8*)s+1) : isWORDCHAR_A(s[1])) Suppose you have a string composed entirely of ASCII characters beginning with a digit. If the string isn't encoded in UTF-8, the condition is true, but it is false if the string happens to have the UTF-8 flag set for whatever reason. One of those reasons simply is that the Perl program is being compiled under 'use utf8'. The UTF-8 flag should not change the behavior of ASCII strings. The code was introduced in 9d58dbc in 2015, to fix [perl #123963] "@<fullwidth digit>". The line it replaced was if (isWORDCHAR_lazy_if(s+1,UTF)) (The code was modified in 2016 by fac0f7a as part of a global substitution to use isIDFIRST_utf8_safe() so as to have no possibility of going off the end of the buffer), but that did not affect the logic. The problem the original commit was trying to solve was that fullwidth digits (U+FF10 etc) were accepted when they shouldn't be, whereas [0-9] should remain as being accepted. The defect is that [0-9] stopped being accepted when the UTF-8 flag is on. The solution is to change it to instead be if (isDIGIT_A(s[1]) || isIDFIRST_lazy_if_safe(s+1, send, UTF)) This causes [0-9] to remain accepted regardless of the UTF-8 flag. So when it is on, the only difference between before this commit and after is that [0-9] are accepted. In the ASCII range, the only difference between \w and IDFirst is that the former includes the digits 0-9, so when the UTF-8 flag is off this evaluates to isWORD_CHAR_A, as before. (Changing to isIDFIRST from isWORDCHAR in the original commit did solve a bunch of other cases where a \w is not supposed to be the first character in a name. There are about 4K such characters currently in Unicode.)
1 parent 1d3e75e commit d8c30f3

File tree

3 files changed

+15
-5
lines changed

3 files changed

+15
-5
lines changed

pod/perldelta.pod

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -392,6 +392,11 @@ consisted of only ASCII characters. The real upper limit was as few as
392392
Chinese or Osage. Now an identifier in any language may contain at
393393
least 255 characters.
394394

395+
=item *
396+
397+
Fixed parsing of array names starting with a digit in double-quotish
398+
context under C<use utf8;>.
399+
395400
=back
396401

397402
=head1 Known Problems

t/op/sub_lval.t

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ BEGIN {
55
require './test.pl';
66
set_up_inc('../lib');
77
}
8-
plan tests=>213;
8+
plan tests=>215;
99

1010
sub a : lvalue { my $a = 34; ${\(bless \$a)} } # Return a temporary
1111
sub b : lvalue { ${\shift} }
@@ -1041,6 +1041,14 @@ sub else119797 : lvalue {
10411041
eval { (else119797(0)) = 1..3 };
10421042
is $@, "", '$@ after writing to array returned by else';
10431043
is "@119797", "1 2 3", 'writing to array returned by else';
1044+
1045+
{ # Being in UTF-8 used to break this
1046+
use utf8;
1047+
eval { (else119797(0)) = 1..3 };
1048+
is $@, "", '$@ after writing to array returned by else';
1049+
is "@119797", "1 2 3", 'writing to array returned by else';
1050+
}
1051+
10441052
eval { (else119797(1)) = 4..6 };
10451053
is $@, "", '$@ after writing to array returned by if (with else)';
10461054
is "@119797", "4 5 6", 'writing to array returned by if (with else)';

toke.c

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3713,10 +3713,7 @@ S_scan_const(pTHX_ char *start)
37133713
* (@foo, @::foo, @'foo, @{foo}, @$foo, @+, @-)
37143714
*/
37153715
else if (*s == '@' && s[1]) {
3716-
if (UTF
3717-
? isIDFIRST_utf8_safe(s+1, send)
3718-
: isWORDCHAR_A(s[1]))
3719-
{
3716+
if (isDIGIT_A(s[1]) || isIDFIRST_lazy_if_safe(s+1, send, UTF)) {
37203717
break;
37213718
}
37223719
if (memCHRs(":'{$", s[1]))

0 commit comments

Comments
 (0)