Skip to content

Commit 8c15ff3

Browse files
committed
utf8_to_bytes_: Calculate needed malloc size
Prior to this commit, the size malloced was just the same as the length of the input string, which is a worst case scenario. This commit changes so the new pass through the input (introduced in the previous commit) also calculates the needed length. The additional cost of doing this is minimal. It has advantages on a very long string with lots of sequences that are convertible.
1 parent 0a5edc8 commit 8c15ff3

File tree

1 file changed

+17
-2
lines changed

1 file changed

+17
-2
lines changed

utf8.c

Lines changed: 17 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2404,11 +2404,12 @@ Perl_utf8_to_bytes_(pTHX_ U8 **s_ptr, STRLEN *lenp, U8 ** free_me,
24042404
const U8 * const send = s0 + *lenp;
24052405
U8 * s = first_variant;
24062406
Size_t invariant_length = first_variant - s0;
2407+
Size_t variant_count = 0;
24072408

24082409
#ifndef EBCDIC /* The below relies on the bit patterns of UTF-8 */
24092410

24102411
/* Do a first pass through the string to see if it actually is translatable
2411-
* into bytes. On long strings this is
2412+
* into bytes, and if so, how big the result is. On long strings this is
24122413
* done a word at a time, so is relatively quick. (There is some
24132414
* start-up/tear-down overhead with the per-word algorithm, so no real gain
24142415
* unless the remaining portion of the string is long enough. The current
@@ -2435,8 +2436,11 @@ Perl_utf8_to_bytes_(pTHX_ U8 **s_ptr, STRLEN *lenp, U8 ** free_me,
24352436
if (! UTF8_IS_NEXT_CHAR_DOWNGRADEABLE(s, send)) {
24362437
return false;
24372438
}
2439+
24382440
s++;
2441+
variant_count++;
24392442
}
2443+
24402444
s++;
24412445
}
24422446

@@ -2486,6 +2490,12 @@ Perl_utf8_to_bytes_(pTHX_ U8 **s_ptr, STRLEN *lenp, U8 ** free_me,
24862490
return false;
24872491
}
24882492

2493+
/* Commit 03c1e4ab1d6ee9062fb3f94b0ba31db6698724b1 contains an
2494+
explanation of how this works */
2495+
variant_count +=
2496+
(Size_t) (((((start_bytes)) >> 7) * PERL_COUNT_MULTIPLIER)
2497+
>> ((PERL_WORDSIZE - 1) * CHARBITS));
2498+
24892499
s += PERL_WORDSIZE;
24902500
} while (s + PERL_WORDSIZE <= send);
24912501

@@ -2494,6 +2504,7 @@ Perl_utf8_to_bytes_(pTHX_ U8 **s_ptr, STRLEN *lenp, U8 ** free_me,
24942504
* first byte of the character */
24952505
if (s > first_variant && UTF8_IS_START(*(s-1))) {
24962506
s--;
2507+
variant_count--;
24972508
}
24982509
}
24992510

@@ -2505,16 +2516,20 @@ Perl_utf8_to_bytes_(pTHX_ U8 **s_ptr, STRLEN *lenp, U8 ** free_me,
25052516
return false;
25062517
}
25072518
s++;
2519+
variant_count++;
25082520
}
25092521
s++;
25102522
}
25112523

2524+
/* Here, we passed the tests above and know how many UTF-8 variant
2525+
* characters there are, which allows us to calculate the size to malloc
2526+
* for the non-destructive case */
25122527
U8 *d0;
25132528
if (result_as == PL_utf8_to_bytes_overwrite) {
25142529
d0 = s0;
25152530
}
25162531
else {
2517-
Newx(d0, *lenp + 1, U8);
2532+
Newx(d0, (*lenp) + 1 - variant_count, U8);
25182533
Copy(s0, d0, invariant_length, U8);
25192534
}
25202535

0 commit comments

Comments
 (0)