Skip to content

Commit 7606224

Browse files
committed
utf8_hop forwards Change continuation start behavior
Prior to this commit, when hopping forwards, and the initial position to hop from is a continuation byte, it treats it and each such successive one as a single character until it gets to a start byte, and switches into normal mode. In contrast, in hopping backwards, all the consecutive continuation bytes are considered to be part of a single character (as they indeed are). Thus there is a discrepancy between forward/backwards hopping; and the forward version seems wrong to me. This commit removes the discrepancy. There is no change in behavior if the starting position is to the beginning of a character. All calls in the core except for the API test are of this form. But, if the initial position is in the middle of a character, it now moves to the beginning of the next character, subtracting just 1 from the count of characters to hop (instead of subtracting however many continuation bytes there are). This is how I would have expected it to work all along. Succinctly, getting to the next character now consumes one hop count, no matter the direction nor which byte in the character is the starting position.
1 parent f0cb6a0 commit 7606224

File tree

2 files changed

+37
-12
lines changed

2 files changed

+37
-12
lines changed

ext/XS-APItest/t/utf8.t

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1207,9 +1207,10 @@ SKIP:
12071207
[ $utf, $utf_ch_len * 5, -4, $utf_ch_len, "utf in range b, backward" ],
12081208
[ $utf, $utf_ch_len * 5, 6, length($utf), "utf out of range, forward" ],
12091209
[ $utf, $utf_ch_len * 5, -6, 0, "utf out of range, backward" ],
1210-
[ $bad_start, 0, 1, 1, "bad start, forward 1 from 0" ],
1211-
[ $bad_start, 0, $utf_ch_len-1, $utf_ch_len-1, "bad start, forward ch_len-1 from 0" ],
1212-
[ $bad_start, 0, $utf_ch_len, $utf_ch_len*2-1, "bad start, forward ch_len from 0" ],
1210+
[ $bad_start, 0, 1, $utf_ch_len-1, "bad start, forward 1 from 0" ],
1211+
[ $bad_start, 0, 5, 5 * $utf_ch_len-1, "bad start, forward 5 chars from 0" ],
1212+
[ $bad_start, 0, 9, length($bad_start)-$utf_ch_len, "bad start, forward 9 chars from 0" ],
1213+
[ $bad_start, 0, 10, length $bad_start, "bad start, forward 10 chars from 0" ],
12131214
[ $bad_start, $utf_ch_len-1, -1, 0, "bad start, back 1 from first start byte" ],
12141215
[ $bad_start, $utf_ch_len-2, -1, 0, "bad start, back 1 from before first start byte" ],
12151216
[ $bad_start, 0, -1, 0, "bad start, back 1 from 0" ],

inline.h

Lines changed: 33 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1985,18 +1985,17 @@ Perl_utf8_distance(pTHX_ const U8 *a, const U8 *b)
19851985
=for apidoc utf8_hop
19861986
19871987
Return the UTF-8 pointer C<s> displaced by C<off> characters, either
1988-
forward or backward.
1988+
forward (if C<off> is positive) or backward (if negative). C<s> does not need
1989+
to be pointing to the starting byte of a character. If it isn't, one count of
1990+
C<off> will be used up to get to the start of the next character for forward
1991+
hops, and to the start of the current character for negative ones.
19891992
19901993
WARNING: Prefer L</utf8_hop_safe> to this one.
19911994
19921995
Do NOT use this function unless you B<know> C<off> is within
19931996
the UTF-8 data pointed to by C<s> B<and> that on entry C<s> is aligned
19941997
on the first byte of a character or just after the last byte of a character.
19951998
1996-
If <off> is negative, C<s> does not need to be pointing to the starting byte of
1997-
a character. If it isn't, one count of C<off> will be used up to get to that
1998-
start.
1999-
20001999
=cut
20012000
*/
20022001

@@ -2006,10 +2005,20 @@ Perl_utf8_hop(const U8 *s, SSize_t off)
20062005
PERL_ARGS_ASSERT_UTF8_HOP;
20072006

20082007
/* Note: cannot use UTF8_IS_...() too eagerly here since e.g
2009-
* the bitops (especially ~) can create illegal UTF-8.
2008+
* the XXX bitops (especially ~) can create illegal UTF-8.
20102009
* In other words: in Perl UTF-8 is not just for Unicode. */
20112010

2012-
if (off >= 0) {
2011+
if (off > 0) {
2012+
2013+
/* Get to next non-continuation byte */
2014+
if (UNLIKELY(UTF8_IS_CONTINUATION(*s))) {
2015+
do {
2016+
s++;
2017+
}
2018+
while (UTF8_IS_CONTINUATION(*s));
2019+
off--;
2020+
}
2021+
20132022
while (off--)
20142023
s += UTF8SKIP(s);
20152024
}
@@ -2020,6 +2029,7 @@ Perl_utf8_hop(const U8 *s, SSize_t off)
20202029
s--;
20212030
}
20222031
}
2032+
20232033
GCC_DIAG_IGNORE(-Wcast-qual)
20242034
return (U8 *)s;
20252035
GCC_DIAG_RESTORE
@@ -2029,7 +2039,9 @@ Perl_utf8_hop(const U8 *s, SSize_t off)
20292039
=for apidoc utf8_hop_forward
20302040
20312041
Return the UTF-8 pointer C<s> displaced by up to C<off> characters,
2032-
forward.
2042+
forward. C<s> does not need to be pointing to the starting byte of a
2043+
character. If it isn't, one count of C<off> will be used up to get to the
2044+
start of the next character.
20332045
20342046
C<off> must be non-negative.
20352047
@@ -2054,6 +2066,15 @@ Perl_utf8_hop_forward(const U8 *s, SSize_t off, const U8 *end)
20542066
assert(s <= end);
20552067
assert(off >= 0);
20562068

2069+
if (off && UNLIKELY(UTF8_IS_CONTINUATION(*s))) {
2070+
/* Get to next non-continuation byte */
2071+
do {
2072+
s++;
2073+
}
2074+
while (UTF8_IS_CONTINUATION(*s));
2075+
off--;
2076+
}
2077+
20572078
while (off--) {
20582079
STRLEN skip = UTF8SKIP(s);
20592080
if ((STRLEN)(end - s) <= skip) {
@@ -2122,7 +2143,10 @@ Perl_utf8_hop_back(const U8 *s, SSize_t off, const U8 *start)
21222143
=for apidoc utf8_hop_safe
21232144
21242145
Return the UTF-8 pointer C<s> displaced by up to C<off> characters,
2125-
either forward or backward.
2146+
either forward or backward. C<s> does not need to be pointing to the starting
2147+
byte of a character. If it isn't, one count of C<off> will be used up to get
2148+
to the start of the next character for forward hops, and to the start of the
2149+
current character for negative ones.
21262150
21272151
When moving backward it will not move before C<start>.
21282152

0 commit comments

Comments
 (0)