-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Formatting: Introduce normalizing function for escaped HTML. #10600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -992,6 +992,148 @@ function _wp_specialchars( $text, $quote_style = ENT_NOQUOTES, $charset = false, | |||||||||||||
| return $text; | ||||||||||||||
| } | ||||||||||||||
|
|
||||||||||||||
| /** | ||||||||||||||
| * Normalize the escaping for content within an HTML string. | ||||||||||||||
| * | ||||||||||||||
| * @since {WP_VERSION} | ||||||||||||||
| * | ||||||||||||||
| * @param string $context "attribute" for strings comprising a full HTML attribute value, | ||||||||||||||
| * or "data" for text nodes. | ||||||||||||||
| * @param string $text string containing HTML-escaped or escapable content, in UTF-8. | ||||||||||||||
| * @return string version of input where all appropriate characters and escapes | ||||||||||||||
| * are standard and predictable. | ||||||||||||||
| */ | ||||||||||||||
| function wp_normalize_escaped_html_text( string $context, string $text ): string { | ||||||||||||||
| $normalized = array(); | ||||||||||||||
| $end = strlen( $text ); | ||||||||||||||
| $at = 0; | ||||||||||||||
| $was_at = 0; | ||||||||||||||
| $token_length = 0; | ||||||||||||||
|
|
||||||||||||||
| while ( $at < $end ) { | ||||||||||||||
| $next_character_reference_at = strpos( $text, '&', $at ); | ||||||||||||||
| if ( false === $next_character_reference_at ) { | ||||||||||||||
| break; | ||||||||||||||
| } | ||||||||||||||
|
|
||||||||||||||
| $character_reference = WP_HTML_Decoder::read_character_reference( $context, $text, $next_character_reference_at, $token_length ); | ||||||||||||||
|
|
||||||||||||||
| // This is an un-escaped ampersand character, so encode it. | ||||||||||||||
| if ( ! isset( $character_reference ) ) { | ||||||||||||||
| $normalized[] = substr( $text, $was_at, $next_character_reference_at - $was_at ) . '&'; | ||||||||||||||
| $at = $next_character_reference_at + 1; | ||||||||||||||
| $was_at = $at; | ||||||||||||||
| continue; | ||||||||||||||
| } | ||||||||||||||
|
|
||||||||||||||
| // Some characters are best left visible to the human mind. | ||||||||||||||
| $should_unhide = 1 === strspn( $character_reference, ',%()0123456789:[]ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz{}' ); | ||||||||||||||
|
Comment on lines
+1029
to
+1030
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is interesting and requires some careful thought.
Generally, I wonder about being so prescriptive about what is prevented from being escaped. For example, we'd need to consider interaction with common templating systems. Something like blade relies on ASCII alphanumerics seem a bit safer to un-escape, but I'm reluctant to override how HTML was authored without good reason.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the “proper” way to escape shortcodes is through the use of shortcode escaping. use I think of this as an outbound-to-browser function, so templating systems that only run in JavaScript would be affected, and I would hope they have some escape. This entire unhiding behavior is purely a security concession to make it easier for humans to identify malicious inputs and also to make downstream security checks more reliable, since they often assume malicious inputs arrive in a friendly way. I do agree that this requires careful thought. It’s a powerful and exciting (to me) aspect made possible via the HTML API, but not integral to this function.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. could be a filter. /**
* Selects which US-ASCII characters to enforce rendering as the byte itself
* rather than as any HTML character reference.
*
* Must be single-byte US-ASCII characters only.
*/
$unhidden_ascii = apply_filters( 'always_raw_escaped_html_ascii', ',%()0123456789…' );Then again, this might be best unfilterable. |
||||||||||||||
| if ( $should_unhide ) { | ||||||||||||||
| $normalized[] = substr( $text, $was_at, $next_character_reference_at - $was_at ) . $character_reference; | ||||||||||||||
| $at = $next_character_reference_at + $token_length; | ||||||||||||||
| $was_at = $at; | ||||||||||||||
| continue; | ||||||||||||||
| } | ||||||||||||||
|
|
||||||||||||||
| $is_syntax = 1 === strspn( $character_reference, '&"\'<>' ); | ||||||||||||||
| if ( $is_syntax && '#' === $text[ $next_character_reference_at + 1 ] ) { | ||||||||||||||
| $named_form = strtr( | ||||||||||||||
| $character_reference, | ||||||||||||||
| array( | ||||||||||||||
| '&' => '&', | ||||||||||||||
| '"' => '"', | ||||||||||||||
| "'" => ''', | ||||||||||||||
| '<' => '<', | ||||||||||||||
| '>' => '>', | ||||||||||||||
| ) | ||||||||||||||
| ); | ||||||||||||||
| $normalized[] = substr( $text, $was_at, $next_character_reference_at - $was_at ) . $named_form; | ||||||||||||||
| $at = $next_character_reference_at + $token_length; | ||||||||||||||
| $was_at = $at; | ||||||||||||||
| continue; | ||||||||||||||
| } | ||||||||||||||
|
|
||||||||||||||
| // This is a valid character reference, but it might not be normative. | ||||||||||||||
| $needs_semicolon = ';' !== $text[ $next_character_reference_at + $token_length - 1 ]; | ||||||||||||||
|
|
||||||||||||||
| // This is a named character reference. | ||||||||||||||
| if ( '#' !== $text[ $next_character_reference_at + 1 ] ) { | ||||||||||||||
| // Nothing to do for already-normalized named character references. | ||||||||||||||
| if ( ! $needs_semicolon ) { | ||||||||||||||
| $at = $next_character_reference_at + $token_length; | ||||||||||||||
| continue; | ||||||||||||||
| } | ||||||||||||||
|
|
||||||||||||||
| // Add the missing semicolon. | ||||||||||||||
| $normalized[] = substr( $text, $was_at, $next_character_reference_at - $was_at + $token_length ) . ';'; | ||||||||||||||
| $at = $next_character_reference_at + $token_length; | ||||||||||||||
| $was_at = $at; | ||||||||||||||
| continue; | ||||||||||||||
| } | ||||||||||||||
|
|
||||||||||||||
| /* | ||||||||||||||
| * While named character references have only a single form and are case sensitive, | ||||||||||||||
| * numeric character references may contain upper or lowercase hex values and may | ||||||||||||||
| * contain unlimited preceding zeros. | ||||||||||||||
| */ | ||||||||||||||
| $is_hex = 'x' === $text[ $next_character_reference_at + 2 ] || 'X' === $text[ $next_character_reference_at + 2 ]; | ||||||||||||||
| $digits_at = $next_character_reference_at + ( $is_hex ? 3 : 2 ); | ||||||||||||||
| $leading_zeros = '0' === $text[ $digits_at ] ? strspn( $text, '0', $digits_at ) : 0; | ||||||||||||||
|
Comment on lines
+1074
to
+1081
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The hex
|
||||||||||||||
|
|
||||||||||||||
| if ( ! $needs_semicolon && ! $is_hex && 0 === $leading_zeros ) { | ||||||||||||||
| // Nothing to do for already-normalized decimal numeric character references. | ||||||||||||||
| $at = $next_character_reference_at + $token_length; | ||||||||||||||
| continue; | ||||||||||||||
| } | ||||||||||||||
|
|
||||||||||||||
| $digits = substr( $text, $digits_at + $leading_zeros, $next_character_reference_at + $token_length - $digits_at - $leading_zeros - ( $needs_semicolon ? 0 : 1 ) ); | ||||||||||||||
| if ( $is_hex ) { | ||||||||||||||
| $lower_digits = strtolower( $digits ); | ||||||||||||||
|
|
||||||||||||||
| // Nothing to do for already-normalized hexadecimal numeric character references. | ||||||||||||||
| if ( $lower_digits === $digits && ! $needs_semicolon && 0 === $leading_zeros ) { | ||||||||||||||
| $at = $next_character_reference_at + $token_length; | ||||||||||||||
| continue; | ||||||||||||||
|
Comment on lines
+1089
to
+1096
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems like we could avoid or defer until absolutely necessary the digits substring and lowercasing with something like |
||||||||||||||
| } | ||||||||||||||
|
Comment on lines
+1091
to
+1097
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Lowercasing seems fine. Stylistically, I prefer the hex-number part to be uppercase, like wordpress-develop/src/wp-includes/blocks.php Lines 1629 to 1634 in 2929fe2
Interestingly, KSES entity normalization doesn't normalize hex cases: wp_kses_normalize_entities('< >');
// "< >" |
||||||||||||||
|
|
||||||||||||||
| $normalized[] = substr( $text, $was_at, $next_character_reference_at - $was_at ) . "&#x{$lower_digits};"; | ||||||||||||||
| $at = $next_character_reference_at + $token_length; | ||||||||||||||
| $was_at = $at; | ||||||||||||||
| continue; | ||||||||||||||
| } else { | ||||||||||||||
| $normalized[] = substr( $text, $was_at, $next_character_reference_at - $was_at ) . "&#{$digits};"; | ||||||||||||||
| $at = $next_character_reference_at + $token_length; | ||||||||||||||
| $was_at = $at; | ||||||||||||||
| continue; | ||||||||||||||
| } | ||||||||||||||
|
|
||||||||||||||
| die( 'should not have arrived here' ); | ||||||||||||||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. for dev-time, because this should be unreachable. need to confirm before removing. |
||||||||||||||
| ++$at; | ||||||||||||||
| } | ||||||||||||||
|
|
||||||||||||||
| if ( 0 === $was_at ) { | ||||||||||||||
| $normalized_text = strtr( $text, '&', '&' ); | ||||||||||||||
| } else { | ||||||||||||||
| $normalized[] = substr( $text, $was_at, $end - $was_at ); | ||||||||||||||
| $normalized_text = implode( '', $normalized ); | ||||||||||||||
| } | ||||||||||||||
|
|
||||||||||||||
| return strtr( | ||||||||||||||
| $normalized_text, | ||||||||||||||
| array( | ||||||||||||||
| '<' => '<', | ||||||||||||||
| '>' => '>', | ||||||||||||||
| '"' => '"', | ||||||||||||||
| "'" => ''', | ||||||||||||||
| /* | ||||||||||||||
| * Stray ampersand "&" characters have already been replaced above, | ||||||||||||||
| * so it’s inappropriate to replace again here, as all remaining | ||||||||||||||
| * instances should be part of a normalized character reference. | ||||||||||||||
| */ | ||||||||||||||
| ) | ||||||||||||||
| ); | ||||||||||||||
| } | ||||||||||||||
|
|
||||||||||||||
| /** | ||||||||||||||
| * Converts a number of HTML entities into their special characters. | ||||||||||||||
| * | ||||||||||||||
|
|
||||||||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,58 @@ | ||
| <?php | ||
|
|
||
| /** | ||
| * @group formatting | ||
| * | ||
| * @covers \wp_normalize_escaped_html_text() | ||
| */ | ||
| class Tests_Formatting_NormalizeEscapedHtmlText extends WP_UnitTestCase { | ||
| /** | ||
| * Ensures that HTML test is properly normalized. | ||
| * | ||
| * @dataProvider data_example_datasets | ||
| * | ||
| * @param string $context | ||
| * @param string $text | ||
| * @param string $expected | ||
| */ | ||
| public function test_example_datasets( $context, $text, $expected ) { | ||
| $this->assertEquals( | ||
| $expected, | ||
| wp_normalize_escaped_html_text( $context, $text ) | ||
| ); | ||
| } | ||
|
|
||
| public static function data_example_datasets() { | ||
| return array( | ||
| array( 'attribute', 'test', 'test' ), | ||
| array( 'attribute', 'test & done', 'test & done' ), | ||
| array( 'attribute', 'þ is not iron', 'þ is not iron' ), | ||
| array( 'attribute', 'spec > guess', 'spec > guess' ), | ||
| array( 'attribute', 'art & copy', 'art & copy' ), | ||
| array( 'attribute', '🅰', '🅰' ), | ||
| array( 'attribute', '🅰 ', '🅰 ' ), | ||
|
|
||
| array( 'data', 'test', 'test' ), | ||
| array( 'data', 'test & done', 'test & done' ), | ||
| array( 'data', 'þ is not iron', 'þ is not iron' ), | ||
| array( 'data', 'spec > guess', 'spec > guess' ), | ||
| array( 'data', 'art & copy', 'art & copy' ), | ||
| array( 'data', '🅰', '🅰' ), | ||
| array( 'data', '🅰 ', '🅰 ' ), | ||
|
|
||
| // The “ambiguous ampersand” has different rules in the attribute value and data states. | ||
| array( 'attribute', '¬myproblem', '&notmyproblem' ), | ||
| array( 'data', '¬myproblem', '¬myproblem' ), | ||
|
|
||
| // Certain characters should remain plaintext. | ||
| array( 'attribute', 'eat 3 apples', 'eat 3 apples' ), | ||
| array( 'data', 'eat 3 apples', 'eat 3 apples' ), | ||
| array( 'data', '<script>', '<script>' ), | ||
| array( 'attribute', 'javascript:alert({"test"})', 'javascript:alert({"test"})' ), | ||
|
|
||
| // Syntax characters should be represented uniformly. | ||
| array( 'attribute', '<IMG>', '<IMG>' ), | ||
| array( 'data', '<IMG>', '<IMG>' ), | ||
| ); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could add starting and ending ranges, as bytes or as characters, but probably best as bytes. Either way, if starting in the middle of an escape sequence we can walk back and find the nearest boundary before the starting offset, but that is computationally more expensive than requiring callers to send appropriate byte offsets.
If callers do this, it could be nontrivial to find the boundaries (e.g. a numeric character reference with 1 MB of leading zeros). Benefit of doing that in this function is that we can determine if we’re in the middle of one. Worst-case is the leading-zero numeric character reference, but even then it would not likely be too expensive…
Moving forward is obviously easier.
If we make offsets “characters” (meaning code points probably) then the computation is more expensive because we have to count code point string length, which will either be slow or allocating.