Skip to content

Commit bb6ed3b

Browse files
committed
Add wp_is_valid_utf8() for normalizing UTF-8 checks.
There are several existing mechanisms in Core to determine if a given string contains valid UTF-8 bytes or not. These are spread out and depend on which extensions are installed on the running system and what is set for `blog_charset`. The `seems_utf8()` function is one of these mechanisms. `seems_utf8()` does not properly validate UTF-8, unfortunately, and is slow, and the purpose of the function is veiled behind its name and historic legacy. This patch deprecates `seems_utf()` and introduces `wp_is_valid_utf8()`; a new, spec-compliant, efficient, and focused UTF-8 validator. This new validator defers to `mb_check_encoding()` where present, otherwise validating with a pure-PHP implementation. This makes the spec-compliant validator available on all systems regardless of their runtime environment. Developed in #9317 Discussed in https://core.trac.wordpress.org/ticket/38044 Props dmsnell, jonsurrell, jorbin. Fixes #38044. git-svn-id: https://develop.svn.wordpress.org/trunk@60630 602fd350-edb4-49c9-b593-d223f7449a82
1 parent fc78e19 commit bb6ed3b

File tree

8 files changed

+1166
-51
lines changed

8 files changed

+1166
-51
lines changed

src/wp-admin/includes/export.php

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -243,7 +243,7 @@ function export_wp( $args = array() ) {
243243
* @return string
244244
*/
245245
function wxr_cdata( $str ) {
246-
if ( ! seems_utf8( $str ) ) {
246+
if ( ! wp_is_valid_utf8( $str ) ) {
247247
$str = utf8_encode( $str );
248248
}
249249
// $str = ent2ncr(esc_html($str));

src/wp-admin/includes/image.php

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1039,13 +1039,13 @@ function wp_read_image_metadata( $file ) {
10391039
}
10401040

10411041
foreach ( array( 'title', 'caption', 'credit', 'copyright', 'camera', 'iso' ) as $key ) {
1042-
if ( $meta[ $key ] && ! seems_utf8( $meta[ $key ] ) ) {
1042+
if ( $meta[ $key ] && ! wp_is_valid_utf8( $meta[ $key ] ) ) {
10431043
$meta[ $key ] = utf8_encode( $meta[ $key ] );
10441044
}
10451045
}
10461046

10471047
foreach ( $meta['keywords'] as $key => $keyword ) {
1048-
if ( ! seems_utf8( $keyword ) ) {
1048+
if ( ! wp_is_valid_utf8( $keyword ) ) {
10491049
$meta['keywords'][ $key ] = utf8_encode( $keyword );
10501050
}
10511051
}

src/wp-includes/formatting.php

Lines changed: 177 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -876,11 +876,14 @@ function shortcode_unautop( $text ) {
876876
*
877877
* @author bmorel at ssi dot fr (modified)
878878
* @since 1.2.1
879+
* @deprecated 6.9.0 Use {@see wp_is_valid_utf8()} instead.
879880
*
880881
* @param string $str The string to be checked.
881882
* @return bool True if $str fits a UTF-8 model, false otherwise.
882883
*/
883884
function seems_utf8( $str ) {
885+
_deprecated_function( __FUNCTION__, '6.9.0', 'wp_is_valid_utf8()' );
886+
884887
mbstring_binary_safe_encoding();
885888
$length = strlen( $str );
886889
reset_mbstring_encoding();
@@ -914,6 +917,177 @@ function seems_utf8( $str ) {
914917
return true;
915918
}
916919

920+
/**
921+
* Determines if a given byte string represents a valid UTF-8 encoding.
922+
*
923+
* Note that it’s unlikely for non-UTF-8 data to validate as UTF-8, but
924+
* it is still possible. Many texts are simultaneously valid UTF-8,
925+
* valid US-ASCII, and valid ISO-8859-1 (`latin1`).
926+
*
927+
* Example:
928+
*
929+
* true === wp_is_valid_utf8( '' );
930+
* true === wp_is_valid_utf8( 'just a test' );
931+
* true === wp_is_valid_utf8( "\xE2\x9C\x8F" ); // Pencil, U+270F.
932+
* true === wp_is_valid_utf8( "\u{270F}" ); // Pencil, U+270F.
933+
* true === wp_is_valid_utf8( '✏' ); // Pencil, U+270F.
934+
*
935+
* false === wp_is_valid_utf8( "just \xC0 test" ); // Invalid bytes.
936+
* false === wp_is_valid_utf8( "\xE2\x9C" ); // Invalid/incomplete sequences.
937+
* false === wp_is_valid_utf8( "\xC1\xBF" ); // Overlong sequences.
938+
* false === wp_is_valid_utf8( "\xED\xB0\x80" ); // Surrogate halves.
939+
* false === wp_is_valid_utf8( "B\xFCch" ); // ISO-8859-1 high-bytes.
940+
* // E.g. The “ü” in ISO-8859-1 is a single byte 0xFC,
941+
* // but in UTF-8 is the two-byte sequence 0xC3 0xBC.
942+
*
943+
* @see _wp_is_valid_utf8_fallback
944+
*
945+
* @since 6.9.0
946+
*
947+
* @param string $bytes String which might contain text encoded as UTF-8.
948+
* @return bool Whether the provided bytes can decode as valid UTF-8.
949+
*/
950+
function wp_is_valid_utf8( string $bytes ): bool {
951+
/*
952+
* Since PHP 8.3.0 the UTF-8 validity is cached internally
953+
* on string objects, making this a direct property lookup.
954+
*
955+
* This is to be preferred exclusively once PHP 8.3.0 is
956+
* the minimum supported version, because even when the
957+
* status isn’t cached, it uses highly-optimized code to
958+
* validate the byte stream.
959+
*/
960+
return function_exists( 'mb_check_encoding' )
961+
? mb_check_encoding( $bytes, 'UTF-8' )
962+
: _wp_is_valid_utf8_fallback( $bytes );
963+
}
964+
965+
/**
966+
* Fallback mechanism for safely validating UTF-8 bytes.
967+
*
968+
* By implementing a raw method here the code will behave in the same way on
969+
* all installed systems, regardless of what extensions are installed.
970+
*
971+
* @see wp_is_valid_utf8
972+
*
973+
* @since 6.9.0
974+
* @access private
975+
*
976+
* @param string $bytes String which might contain text encoded as UTF-8.
977+
* @return bool Whether the provided bytes can decode as valid UTF-8.
978+
*/
979+
function _wp_is_valid_utf8_fallback( string $bytes ): bool {
980+
$end = strlen( $bytes );
981+
982+
for ( $i = 0; $i < $end; $i++ ) {
983+
/*
984+
* Quickly skip past US-ASCII bytes, all of which are valid UTF-8.
985+
*
986+
* This optimization step improves the speed from 10x to 100x
987+
* depending on whether the JIT has optimized the function.
988+
*/
989+
$i += strspn(
990+
$bytes,
991+
"\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f" .
992+
"\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f" .
993+
" !\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f",
994+
$i
995+
);
996+
if ( $i >= $end ) {
997+
break;
998+
}
999+
1000+
/**
1001+
* The above fast-track handled all single-byte UTF-8 characters. What
1002+
* follows MUST be a multibyte sequence otherwise there’s invalid UTF-8.
1003+
*
1004+
* Therefore everything past here is checking those multibyte sequences.
1005+
* Because it’s possible that there are truncated characters, the use of
1006+
* the null-coalescing operator with "\xC0" is a convenience for skipping
1007+
* length checks on every continuation bytes. This works because 0xC0 is
1008+
* always invalid in a UTF-8 string, meaning that if the string has been
1009+
* truncated, it will find 0xC0 and reject as invalid UTF-8.
1010+
*
1011+
* > [The following table] lists all of the byte sequences that are well-formed
1012+
* > in UTF-8. A range of byte values such as A0..BF indicates that any byte
1013+
* > from A0 to BF (inclusive) is well-formed in that position. Any byte value
1014+
* > outside of the ranges listed is ill-formed.
1015+
*
1016+
* > Table 3-7. Well-Formed UTF-8 Byte Sequences
1017+
* ╭─────────────────────┬────────────┬──────────────┬─────────────┬──────────────╮
1018+
* │ Code Points │ First Byte │ Second Byte │ Third Byte │ Fourth Byte │
1019+
* ├─────────────────────┼────────────┼──────────────┼─────────────┼──────────────┤
1020+
* │ U+0000..U+007F │ 00..7F │ │ │ │
1021+
* │ U+0080..U+07FF │ C2..DF │ 80..BF │ │ │
1022+
* │ U+0800..U+0FFF │ E0 │ A0..BF │ 80..BF │ │
1023+
* │ U+1000..U+CFFF │ E1..EC │ 80..BF │ 80..BF │ │
1024+
* │ U+D000..U+D7FF │ ED │ 80..9F │ 80..BF │ │
1025+
* │ U+E000..U+FFFF │ EE..EF │ 80..BF │ 80..BF │ │
1026+
* │ U+10000..U+3FFFF │ F0 │ 90..BF │ 80..BF │ 80..BF │
1027+
* │ U+40000..U+FFFFF │ F1..F3 │ 80..BF │ 80..BF │ 80..BF │
1028+
* │ U+100000..U+10FFFF │ F4 │ 80..8F │ 80..BF │ 80..BF │
1029+
* ╰─────────────────────┴────────────┴──────────────┴─────────────┴──────────────╯
1030+
*
1031+
* Notice that all valid third and forth bytes are in the range 80..BF. This
1032+
* validator takes advantage of that to only check the range of those bytes once.
1033+
*
1034+
* @see https://lemire.me/blog/2018/05/09/how-quickly-can-you-check-that-a-string-is-valid-unicode-utf-8/
1035+
* @see https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G27506
1036+
*/
1037+
1038+
$b1 = ord( $bytes[ $i ] );
1039+
$b2 = ord( $bytes[ $i + 1 ] ?? "\xC0" );
1040+
1041+
// Valid two-byte code points.
1042+
1043+
if ( $b1 >= 0xC2 && $b1 <= 0xDF && $b2 >= 0x80 && $b2 <= 0xBF ) {
1044+
$i++;
1045+
continue;
1046+
}
1047+
1048+
$b3 = ord( $bytes[ $i + 2 ] ?? "\xC0" );
1049+
1050+
// Valid three-byte code points.
1051+
1052+
if ( $b3 < 0x80 || $b3 > 0xBF ) {
1053+
return false;
1054+
}
1055+
1056+
if (
1057+
( 0xE0 === $b1 && $b2 >= 0xA0 && $b2 <= 0xBF ) ||
1058+
( $b1 >= 0xE1 && $b1 <= 0xEC && $b2 >= 0x80 && $b2 <= 0xBF ) ||
1059+
( 0xED === $b1 && $b2 >= 0x80 && $b2 <= 0x9F ) ||
1060+
( $b1 >= 0xEE && $b1 <= 0xEF && $b2 >= 0x80 && $b2 <= 0xBF )
1061+
) {
1062+
$i += 2;
1063+
continue;
1064+
}
1065+
1066+
$b4 = ord( $bytes[ $i + 3 ] ?? "\xC0" );
1067+
1068+
// Valid four-byte code points.
1069+
1070+
if ( $b4 < 0x80 || $b4 > 0xBF ) {
1071+
return false;
1072+
}
1073+
1074+
if (
1075+
( 0xF0 === $b1 && $b2 >= 0x90 && $b2 <= 0xBF ) ||
1076+
( $b1 >= 0xF1 && $b1 <= 0xF3 && $b2 >= 0x80 && $b2 <= 0xBF ) ||
1077+
( 0xF4 === $b1 && $b2 >= 0x80 && $b2 <= 0x8F )
1078+
) {
1079+
$i += 3;
1080+
continue;
1081+
}
1082+
1083+
// Any other sequence is invalid.
1084+
return false;
1085+
}
1086+
1087+
// Reaching the end implies validating every byte.
1088+
return true;
1089+
}
1090+
9171091
/**
9181092
* Converts a number of special characters into their HTML entities.
9191093
*
@@ -1597,7 +1771,7 @@ function remove_accents( $text, $locale = '' ) {
15971771
return $text;
15981772
}
15991773

1600-
if ( seems_utf8( $text ) ) {
1774+
if ( wp_is_valid_utf8( $text ) ) {
16011775

16021776
/*
16031777
* Unicode sequence normalization from NFD (Normalization Form Decomposed)
@@ -2028,7 +2202,7 @@ function sanitize_file_name( $filename ) {
20282202
$utf8_pcre = @preg_match( '/^./u', 'a' );
20292203
}
20302204

2031-
if ( ! seems_utf8( $filename ) ) {
2205+
if ( ! wp_is_valid_utf8( $filename ) ) {
20322206
$_ext = pathinfo( $filename, PATHINFO_EXTENSION );
20332207
$_name = pathinfo( $filename, PATHINFO_FILENAME );
20342208
$filename = sanitize_title_with_dashes( $_name ) . '.' . $_ext;
@@ -2277,7 +2451,7 @@ function sanitize_title_with_dashes( $title, $raw_title = '', $context = 'displa
22772451
// Restore octets.
22782452
$title = preg_replace( '|---([a-fA-F0-9][a-fA-F0-9])---|', '%$1', $title );
22792453

2280-
if ( seems_utf8( $title ) ) {
2454+
if ( wp_is_valid_utf8( $title ) ) {
22812455
if ( function_exists( 'mb_strtolower' ) ) {
22822456
$title = mb_strtolower( $title, 'UTF-8' );
22832457
}
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2021 flenniken
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
# utf8tests
2+
3+
This directory contains a third-party test suite used for testing UTF-8 functionality.
4+
It primarily provides a set of tests containing various valid and invalid UTF-8 byte sequences.
5+
6+
`utf8tests` can be found on GitHub at [flenniken/utf8tests](https://github.com/flenniken/utf8tests/).
7+
8+
The necessary files have been copied to this directory:
9+
10+
- `LICENSE`
11+
- `utf8tests.txt`
12+
13+
The version of these files was taken from the git commit with
14+
SHA [`52cbdf830f3603047036070b086a1e5196df94d1`](https://github.com/flenniken/utf8tests/blob/52cbdf830f3603047036070b086a1e5196df94d1).
15+
16+
## Updating
17+
18+
If there have been changes to the `utf8tests` repository, this test suite can be updated. In
19+
order to update:
20+
21+
1. Check out the latest version of git repository mentioned above.
22+
1. Copy the files listed above into this directory.
23+
1. Update the SHA mentioned in this README file with the new `utf8tests` SHA.

0 commit comments

Comments
 (0)