Skip to content

Proper handling of UTF-8 character in bitwise xor when using $1 #23552

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: blead
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 52 additions & 1 deletion t/op/bop.t
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ use warnings;
# If you find tests are failing, please try adding names to tests to track
# down where the failure is, and supply your new names as a patch.
# (Just-in-time test naming)
plan tests => 510 + 6 * 2;
plan tests => 512 + 6 * 2;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit message is misleading.

The title is Proper handling of UTF-8 character in bitwise xor when using $1 but this commit doesn't fix anything.
It's adding tests for an issue that was fixed.

Ideally it would also link to the commit that fixed the issue but looking at #9972 that might not be easy to find because it was fixed in separate steps (#9972 (comment)) so might not be worth the effort.

At the very least adding something in the summary based on #9972 (comment) would be good so that there is a reference to when it was fixed (from a quick glance at the ticket: partially between 5.8 and 5.12 and fully fixed between 5.12 and 5.14)


# numerics
ok ((0xdead & 0xbeef) == 0x9ead);
Expand Down Expand Up @@ -725,3 +725,54 @@ EOS
'',
{}, "[perl #17844] access beyond end of block");
}

{
# GH #9972 (previously [perl #70652])

my $warn = 0;
use strict;
use warnings;
local $SIG{__WARN__} = sub { $warn++ };

my $unicodestring = "\x{5454}\x{6655}";
my $normalstring = "0\36\4\13\200\0\31V\3\0\320\225\342\26\365\4\0\240\r\2\3\0\242_\2\1\0\2\1\0000\0\b\b\b\b\b\b\b\b";
my $iv = "\246\205\236\367]\257\304\276";

# First we need $1 to be unicode, otherwise the bug won't occur
$unicodestring =~ m/(.)/;

my @t;

# $1 is assigned but not yet unicode: UTF8-Flag ($1)
push @t, utf8::is_utf8 ($1);

# After we copy $1 the Flag is on: UTF8-Flag ($1)
my $copy = $1;
push @t, utf8::is_utf8 ($1);
Comment on lines +746 to +751

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusing / contradictory..

These are the first two items in @t and should match the first two items in the $exp arrayref.
The $exp arrayref starts with: [1, 1, ...]
So both of the utf8::is_utf_8 call return 1?
Based on the comments about the block I would have expected these to be [0, 1, ...
The comments the first says: but not yet unicode while the second says: the Flag is on which to me implies a difference..


# Now we take 8 Bytes of a normal string with m/(.{8})/
push @t, utf8::is_utf8 ($normalstring);

$normalstring =~ m/(.{8})/;

# The UTF-8 Flag of $1 is still on: UTF8-Flag ($1)
push @t, utf8::is_utf8 ($1);
# We have a second value called ($iv) without an UTF-8 Flag : UTF8-Flag ($iv)
push @t, utf8::is_utf8 ($iv);

# Now the UTF-8 Flag of $1 is off: UTF8-Flag ($1)
push @t, utf8::is_utf8 ($1);

my $x = $1 ^ $iv;
# $1 is now not UTF-8 anymore UTF8-Flag ($1)
push @t, utf8::is_utf8 ($1);
# $x is now UTF-8: UTF8-Flag ($x)
push @t, utf8::is_utf8 ($x);
# $iv suddenly is also UTF-8: UTF8-Flag ($iv)
push @t, utf8::is_utf8 ($iv);

ok(! $warn, "No warnings in this block");
my $got = [@t];
my $exp = [1, 1, "", "", "", "", "", "", ""];
ok( eq_array($got, $exp), "GH 9972: no malformed UTF-8 character in bitwise xor");
Comment on lines +775 to +777

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I'm not a fan of this construct.
It think it would be much nicer to do replace (for example):

   # Now we take 8 Bytes of a normal string with m/(.{8})/
    push @t, utf8::is_utf8 ($normalstring);

with:

   # Now we take 8 Bytes of a normal string with m/(.{8})/
    is(utf8::is_utf8 ($normalstring) ,1, "\$normalstring has the UTF-8 flag set");

now it all ends up in one big eq_array at the end which makes it difficult to trace back to what is happening.

}
Loading