Skip to content

Commit 7ad406a

Browse files
committed
Fix crash in mb_substr with MacJapanese encoding
Thanks to the GitHub user vi3tL0u1s (Viet Hoang Luu) for reporting this issue. The MacJapanese legacy text encoding has a very unusual property; it is possible for a string to encode more codepoints than it has bytes. In some corner cases, this resulted in a situation where the implementation code for mb_substr() would allocate a buffer of size -1. As you can probably imagine, that doesn't end well. Fixes phpGH-20832.
1 parent 03ca089 commit 7ad406a

File tree

2 files changed

+37
-1
lines changed

2 files changed

+37
-1
lines changed

ext/mbstring/mbstring.c

Lines changed: 27 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2107,8 +2107,34 @@ static zend_string* mb_get_substr_slow(unsigned char *in, size_t in_len, size_t
21072107
uint32_t wchar_buf[128];
21082108
unsigned int state = 0;
21092109

2110+
/* For the below call to mb_convert_buf_init, we need to estimate how many bytes the output of this operation will need.
2111+
* If possible, we want to initialize the output buffer with enough space, so it is not necessary to grow it dynamically;
2112+
* At the same time, we don't want to make it huge and waste a lot of memory.
2113+
*
2114+
* `len` is the requested number of codepoints; we optimistically guess that each codepoint can be encoded in one byte.
2115+
* However, the caller may have requested a huge number of codepoints, more than are actually present in the input string;
2116+
* so we also use a 2nd estimate to avoid unnecessary, huge allocations:
2117+
*
2118+
* `in_len` is the number of input bytes, `from` is the number of codepoints to skip; again, if each leading codepoint is
2119+
* encoded in one byte, then there may be as many as `in_len - from` bytes remaining after skipping leading codepoints;
2120+
* that gives our 2nd estimate of the needed output buffer size.
2121+
*
2122+
* If `len == MBFL_SUBSTR_UNTIL_END`, then `len` will be the largest possible `size_t` value, and the `in_len - from`
2123+
* estimate will certainly be used instead. */
2124+
2125+
size_t initial_buf_size;
2126+
if (from > in_len) {
2127+
/* Normally, if `from > in_len`, then the output will definitely be empty, and in fact, `mb_get_substr` uses
2128+
* this fact to (usually) just return an empty string in such situations.
2129+
* But for SJIS-Mac, one byte can decode to more than one codepoint, so we can't assume the output will
2130+
* definitely be empty. If it's not... then our output buffer will be dynamically resized. */
2131+
initial_buf_size = 0;
2132+
} else {
2133+
initial_buf_size = MIN(len, in_len - from);
2134+
}
2135+
21102136
mb_convert_buf buf;
2111-
mb_convert_buf_init(&buf, MIN(len, in_len - from), MBSTRG(current_filter_illegal_substchar), MBSTRG(current_filter_illegal_mode));
2137+
mb_convert_buf_init(&buf, initial_buf_size, MBSTRG(current_filter_illegal_substchar), MBSTRG(current_filter_illegal_mode));
21122138

21132139
while (in_len && len) {
21142140
size_t out_len = enc->to_wchar(&in, &in_len, wchar_buf, 128, &state);

ext/mbstring/tests/gh20832.phpt

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
--TEST--
2+
Ensure mb_substr does not crash with MacJapanese input, when codepoints to skip are more than number of bytes in input string
3+
--EXTENSIONS--
4+
mbstring
5+
--FILE--
6+
<?php
7+
var_dump(mb_substr("\x85\xAC", -1, encoding: 'MacJapanese'));
8+
?>
9+
--EXPECT--
10+
string(1) "V"

0 commit comments

Comments
 (0)