Skip to content

Commit cfcb4e9

Browse files
committed
Charset: Add explanatory note about what consitutes “valid” UTF-8.
This patch adds a clarifying note about what constitutes a valid UTF-8 byte stream. This was brought up in review as a potentially ambiguous term, so a link to the spec has been provided to fix the behavior to the standard. Developed in #9716 Discussed in https://core.trac.wordpress.org/ticket/38044 Follow-up to [60630]. Props dmsnell, agulbra. See #38044. git-svn-id: https://develop.svn.wordpress.org/trunk@60702 602fd350-edb4-49c9-b593-d223f7449a82
1 parent 413b7ea commit cfcb4e9

File tree

1 file changed

+7
-0
lines changed

1 file changed

+7
-0
lines changed

src/wp-includes/formatting.php

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -940,6 +940,13 @@ function seems_utf8( $str ) {
940940
* // E.g. The “ü” in ISO-8859-1 is a single byte 0xFC,
941941
* // but in UTF-8 is the two-byte sequence 0xC3 0xBC.
942942
*
943+
* A “valid” string consists of “well-formed UTF-8 code unit sequence[s],” meaning
944+
* that the bytes conform to the UTF-8 encoding scheme, all characters use the minimal
945+
* byte sequence required by UTF-8, and that no sequence encodes a UTF-16 surrogate
946+
* code point or any character above the representable range.
947+
*
948+
* @see https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G32860
949+
*
943950
* @see _wp_is_valid_utf8_fallback
944951
*
945952
* @since 6.9.0

0 commit comments

Comments
 (0)