Charset: Add explanatory note about what consitutes “valid” UTF-8.

dmsnell · dmsnell · commit cfcb4e9e168b · 2025-09-02T23:51:40.000Z
This patch adds a clarifying note about what constitutes a valid UTF-8 byte stream. This was brought up in review as a potentially ambiguous term, so a link to the spec has been provided to fix the behavior to the standard. Developed in #9716 Discussed in https://core.trac.wordpress.org/ticket/38044 Follow-up to [60630]. Props dmsnell, agulbra. See #38044. git-svn-id: https://develop.svn.wordpress.org/trunk@60702 602fd350-edb4-49c9-b593-d223f7449a82
diff --git a/src/wp-includes/formatting.php b/src/wp-includes/formatting.php
@@ -940,6 +940,13 @@ function seems_utf8( $str ) {
  *                                                     // E.g. The “ü” in ISO-8859-1 is a single byte 0xFC,
  *                                                     // but in UTF-8 is the two-byte sequence 0xC3 0xBC.
  *
+ * A “valid” string consists of “well-formed UTF-8 code unit sequence[s],” meaning
+ * that the bytes conform to the UTF-8 encoding scheme, all characters use the minimal
+ * byte sequence required by UTF-8, and that no sequence encodes a UTF-16 surrogate
+ * code point or any character above the representable range.
+ *
+ * @see https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G32860
+ *
  * @see _wp_is_valid_utf8_fallback
  *
  * @since 6.9.0

Original file line number	Diff line number	Diff line change
`@@ -940,6 +940,13 @@ function seems_utf8( $str ) {`
`940`	`940`	`* // E.g. The “ü” in ISO-8859-1 is a single byte 0xFC,`
`941`	`941`	`* // but in UTF-8 is the two-byte sequence 0xC3 0xBC.`
`942`	`942`	`*`
	`943`	`+ * A “valid” string consists of “well-formed UTF-8 code unit sequence[s],” meaning`
	`944`	`+ * that the bytes conform to the UTF-8 encoding scheme, all characters use the minimal`
	`945`	`+ * byte sequence required by UTF-8, and that no sequence encodes a UTF-16 surrogate`
	`946`	`+ * code point or any character above the representable range.`
	`947`	`+ *`
	`948`	`+ * @see https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G32860`
	`949`	`+ *`
`943`	`950`	`* @see _wp_is_valid_utf8_fallback`
`944`	`951`	`*`
`945`	`952`	`* @since 6.9.0`