Skip to content

Commit 2cb0034

Browse files
committed
Unroll valid_utf8_to_uv loop
This gives a bit of performance boost in this function that can be called during pattern matching. Here are some cachegrind comparisons with blead: Key: Ir Instruction read Dr Data read Dw Data write COND conditional branches IND indirect branches The numbers represent relative counts per loop iteration, compared to blead at 100.0%. Higher is better: for example, using half as many instructions gives 200%, while using twice as many gives 50%. GCC CLANG valid_utf8_to_uv(0x007f), length is 1 blead hacked blead hacked ------ ----------- ------ ------ Ir 100.00 100.69 Ir 100.00 99.11 Dr 100.00 101.47 Dr 100.00 99.74 Dw 100.00 100.00 Dw 100.00 99.57 COND 100.00 101.20 COND 100.00 100.00 IND 100.00 100.00 IND 100.00 94.12 valid_utf8_to_uv(0x07ff), length is 2 blead hacked blead hacked ------ ----------- ------ ------ Ir 100.00 100.68 Ir 100.00 99.04 Dr 100.00 101.47 Dr 100.00 99.74 Dw 100.00 100.00 Dw 100.00 99.57 COND 100.00 102.40 COND 100.00 101.23 IND 100.00 100.00 IND 100.00 94.12 valid_utf8_to_uv(0xfffd), length is 3 blead hacked blead hacked ------ ----------- ------ ------ Ir 100.00 100.83 Ir 100.00 99.04 Dr 100.00 101.47 Dr 100.00 99.75 Dw 100.00 100.00 Dw 100.00 99.57 COND 100.00 102.99 COND 100.00 101.84 IND 100.00 100.00 IND 100.00 94.12 valid_utf8_to_uv(0xffffd), length is 4 blead hacked blead hacked ------ ----------- ------ ------ Ir 100.00 100.91 Ir 100.00 99.13 Dr 100.00 101.46 Dr 100.00 99.75 Dw 100.00 100.00 Dw 100.00 99.57 COND 100.00 103.59 COND 100.00 102.45 IND 100.00 100.00 IND 100.00 94.12 valid_utf8_to_uv(0x3ffffff), length is 5 blead hacked blead hacked ------ ----------- ------ ------ Ir 100.00 101.28 Ir 100.00 99.29 Dr 100.00 101.46 Dr 100.00 99.75 Dw 100.00 100.00 Dw 100.00 99.57 COND 100.00 104.19 COND 100.00 103.07 IND 100.00 100.00 IND 100.00 94.12 valid_utf8_to_uv(0x7fffffff), length is 6 blead hacked blead hacked ------ ----------- ------ ------ Ir 100.00 89.83 Ir 100.00 88.83 Dr 100.00 95.22 Dr 100.00 92.94 Dw 100.00 92.44 Dw 100.00 91.63 COND 100.00 86.21 COND 100.00 87.11 IND 100.00 100.00 IND 100.00 88.89 Clang gives slightly worse results than gcc. But there is an improvement in both cases for conditionals for two-byte and longer characters.. This shows that the performance is significantly worse for code points that take 6 bytes (or more, which I didn't include) to represent. These are all well outside the Unicode range; hence are very rarely encountered. Performance is improved a bit for the typical cases. The algorithm used could handle 6 and 7 byte characters, but that increases memory usage, and can lead to the compiler choosing to not inline this function. In blead, experiments with clang gave these results Max bytes inlined Instances in the code where not inlined 3 14 4 19 5 19 6 19 7 57 We really need to accomodate any Unicode code point, which is 4 bytes (5 on EBCDIC). But the others we don't care about. Even though 6 bytes doesn't show as being worse than 4, I chose to not include it, because we don't care about performance for these rare non-Unicode code points, and it just might cause non-inlining for different compilers or clang versions.
1 parent 367e634 commit 2cb0034

File tree

5 files changed

+112
-20
lines changed

5 files changed

+112
-20
lines changed

embed.fnc

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1858,6 +1858,9 @@ CTopr |void |locale_panic |NN const char *msg \
18581858
: Used in perly.y
18591859
p |OP * |localize |NN OP *o \
18601860
|I32 lex
1861+
CTp |UV |long_valid_utf8_to_uv \
1862+
|NN const U8 * const s \
1863+
|NN const U8 * const e
18611864
ARdp |I32 |looks_like_number \
18621865
|NN SV * const sv
18631866
CRTip |unsigned|lsbit_pos32 |U32 word

embed.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -358,6 +358,7 @@
358358
# define lex_stuff_pvn(a,b,c) Perl_lex_stuff_pvn(aTHX_ a,b,c)
359359
# define lex_stuff_sv(a,b) Perl_lex_stuff_sv(aTHX_ a,b)
360360
# define lex_unstuff(a) Perl_lex_unstuff(aTHX_ a)
361+
# define long_valid_utf8_to_uv Perl_long_valid_utf8_to_uv
361362
# define looks_like_number(a) Perl_looks_like_number(aTHX_ a)
362363
# define lsbit_pos32 Perl_lsbit_pos32
363364
# define magic_dump(a) Perl_magic_dump(aTHX_ a)

inline.h

Lines changed: 88 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1334,31 +1334,99 @@ Perl_valid_utf8_to_uv(const U8 *s, STRLEN *retlen)
13341334

13351335
const UV expectlen = UTF8SKIP(s);
13361336
ASSUME(inRANGE(expectlen, 1, UTF8_MAXBYTES));
1337-
const U8* send = s + expectlen;
1338-
UV uv = *s;
1337+
UV uv = 0;
13391338

1340-
if (retlen) {
1341-
*retlen = expectlen;
1342-
}
1343-
1344-
/* An invariant is trivially returned */
1345-
if (expectlen == 1) {
1346-
return uv;
1339+
/* Note that this is branchless except for the switch() jump table, and
1340+
* checking that the caller wants a *retlen returned.
1341+
*
1342+
* There is wasted effort for length 1 inputs of initializing 'uv' to 0
1343+
* and calculating 'full_shift' (unless the compiler optimizes that out).
1344+
* Benchmarks indicate this is acceptable.
1345+
* See GH #23690 */
1346+
1347+
/* Consider a 4-byte UTF-8-encoded charater. On ASCII platforms it looks
1348+
* like:
1349+
* 1st Byte 2nd Byte 3rd Byte 4th Byte
1350+
* 1111 0ddd 10cc cccc 10bb bbbb 10aa aaaa
1351+
*
1352+
* And the code point it represents is dddccccccbbbbbbbbaaaaaa
1353+
* Each continuation byte contributes its lower 6 bits to the total. For
1354+
* generality call that number 'L'.
1355+
*
1356+
* You get that code point by masking off the top bits of each byte, then
1357+
* or'ing together:
1358+
* the start byte shifted left by 3*L bits,
1359+
* with byte [1] shifted left by 2*L bits
1360+
* with byte [2] shifted left by 1*L bits
1361+
* with byte [3] shifted left by 0*L bits
1362+
*
1363+
* The order is immaterial, so we can rewrite that as
1364+
* 'or' together byte [3] shifted left by 0*L bits
1365+
* with byte [2] shifted left by 1*L bits
1366+
* with byte [1] shifted left by 2*L bits
1367+
* with byte [0] shifted left by 3*L bits,
1368+
*
1369+
* All share the paradigm that for byte n you mask off the top bits and
1370+
* shift the remainder left by (4 - 1 - n) * L bits. So we get
1371+
* (s[n] & mask) << (4 - 1 - n) * L
1372+
* For a three-byte character it would be
1373+
* (s[n] & mask) << (3 - 1 - n) * L
1374+
* and generally
1375+
* (s[n] & mask) << (expectlen - 1 - n) * L
1376+
* which can be rewritten
1377+
* (s[n] & mask) << (expectlen - 1) * L - nL
1378+
* Calculate the term once that isn't compile-time constant and is the same
1379+
* for all n */
1380+
U8 full_shift = (expectlen - 1) * UTF_ACCUMULATION_SHIFT;
1381+
1382+
/* Then create a macro that does the full calculation given n. For EBCDIC,
1383+
* we need to transform s[n] to I8 */
1384+
#define PERL_VALID_UTF8_NEXT_ACCUMULATION(n) \
1385+
(( (UV) ( NATIVE_UTF8_TO_I8( s[n] ) & UTF_CONTINUATION_MASK)) \
1386+
<< (full_shift - (n) * UTF_ACCUMULATION_SHIFT))
1387+
1388+
switch (expectlen) {
1389+
default:
1390+
uv = long_valid_utf8_to_uv(s, s + expectlen);
1391+
break;
1392+
1393+
#if 0 /* See GH #23690 */
1394+
/* These cases give the correct results, but the extra memory used lowers
1395+
* the chances of the compiler actually inlining this, and we only care
1396+
* about performance for Unicode code points, all of which can be
1397+
* expressed with 4 bytes (5 on EBCDIC). Experiements with clang showed
1398+
* no difference between 4,5,6, but a huge drop off with 7. */
1399+
case 7: uv |= PERL_VALID_UTF8_NEXT_ACCUMULATION(6);
1400+
/* FALLTHROUGH */
1401+
case 6: uv |= PERL_VALID_UTF8_NEXT_ACCUMULATION(5);
1402+
/* FALLTHROUGH */
1403+
#endif
1404+
case 5: uv |= PERL_VALID_UTF8_NEXT_ACCUMULATION(4);
1405+
/* FALLTHROUGH */
1406+
case 4:
1407+
uv |= PERL_VALID_UTF8_NEXT_ACCUMULATION(3);
1408+
/* FALLTHROUGH */
1409+
case 3:
1410+
uv |= PERL_VALID_UTF8_NEXT_ACCUMULATION(2);
1411+
/* FALLTHROUGH */
1412+
case 2:
1413+
uv |= PERL_VALID_UTF8_NEXT_ACCUMULATION(1);
1414+
1415+
uv = UNI_TO_NATIVE(uv | ( ((UV)( NATIVE_UTF8_TO_I8(s[0])
1416+
& UTF_START_MASK(expectlen))
1417+
<< full_shift)));
1418+
break;
1419+
1420+
case 1:
1421+
uv = s[0];
1422+
break;
13471423
}
13481424

1349-
/* Remove the leading bits that indicate the number of bytes, leaving just
1350-
* the bits that are part of the value */
1351-
uv = NATIVE_UTF8_TO_I8(uv) & UTF_START_MASK(expectlen);
1352-
1353-
/* Now, loop through the remaining bytes, accumulating each into the
1354-
* working total as we go. (I khw tried unrolling the loop for up to 4
1355-
* bytes, but there was no performance improvement) */
1356-
for (++s; s < send; s++) {
1357-
uv = UTF8_ACCUMULATE(uv, *s);
1425+
if (retlen) {
1426+
*retlen = expectlen;
13581427
}
13591428

1360-
return UNI_TO_NATIVE(uv);
1361-
1429+
return uv;
13621430
}
13631431

13641432
/* This looks like 0x010101... */

proto.h

Lines changed: 5 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

utf8.c

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,21 @@ static const char malformed_text[] = "Malformed UTF-8 character";
3737
static const char unees[] =
3838
"Malformed UTF-8 character (unexpected end of string)";
3939

40+
UV
41+
Perl_long_valid_utf8_to_uv(const U8 * const s, const U8 * const e)
42+
{
43+
PERL_ARGS_ASSERT_LONG_VALID_UTF8_TO_UV;
44+
45+
/* This exists entirely to make the inlined 'valid_utf8_to_uv' smaller, to
46+
* increase its chances of actually getting inlined. For the code points
47+
* it doesn't handle, it calls utf8_to_uv_or_die(), which is also inlined.
48+
* So the compiler would try to inline both, getting a too-large-to-inline
49+
* result. So this non-inlined routine acts as an intermediary, to avoid
50+
* that */
51+
52+
return utf8_to_uv_or_die(s, e, NULL);
53+
}
54+
4055
/*
4156
These are various utility functions for manipulating UTF8-encoded
4257
strings. For the uninitiated, this is a method of representing arbitrary

0 commit comments

Comments
 (0)