Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
142 changes: 142 additions & 0 deletions src/wp-includes/formatting.php
Original file line number Diff line number Diff line change
Expand Up @@ -992,6 +992,148 @@ function _wp_specialchars( $text, $quote_style = ENT_NOQUOTES, $charset = false,
return $text;
}

/**
* Normalize the escaping for content within an HTML string.
*
* @since {WP_VERSION}
*
* @param string $context "attribute" for strings comprising a full HTML attribute value,
* or "data" for text nodes.
* @param string $text string containing HTML-escaped or escapable content, in UTF-8.
* @return string version of input where all appropriate characters and escapes
* are standard and predictable.
*/
function wp_normalize_escaped_html_text( string $context, string $text ): string {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could add starting and ending ranges, as bytes or as characters, but probably best as bytes. Either way, if starting in the middle of an escape sequence we can walk back and find the nearest boundary before the starting offset, but that is computationally more expensive than requiring callers to send appropriate byte offsets.

If callers do this, it could be nontrivial to find the boundaries (e.g. a numeric character reference with 1 MB of leading zeros). Benefit of doing that in this function is that we can determine if we’re in the middle of one. Worst-case is the leading-zero numeric character reference, but even then it would not likely be too expensive…

if Char at Start is ';' or HexDigit:
	$last_hash = strrpos( $input, '#' )
	if false === $last_hash:
		bail

	$is_all_digits = strspn( $input, '0123456789ABCDEFabcdef', $last_hash + IsHexAdjustment );
	if $is_all_digits === ( Start - $last_hash ):
		IsABoundary

Moving forward is obviously easier.

If we make offsets “characters” (meaning code points probably) then the computation is more expensive because we have to count code point string length, which will either be slow or allocating.

$normalized = array();
$end = strlen( $text );
$at = 0;
$was_at = 0;
$token_length = 0;

while ( $at < $end ) {
$next_character_reference_at = strpos( $text, '&', $at );
if ( false === $next_character_reference_at ) {
break;
}

$character_reference = WP_HTML_Decoder::read_character_reference( $context, $text, $next_character_reference_at, $token_length );

// This is an un-escaped ampersand character, so encode it.
if ( ! isset( $character_reference ) ) {
$normalized[] = substr( $text, $was_at, $next_character_reference_at - $was_at ) . '&amp;';
$at = $next_character_reference_at + 1;
$was_at = $at;
continue;
}

// Some characters are best left visible to the human mind.
$should_unhide = 1 === strspn( $character_reference, ',%()0123456789:[]ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz{}' );
Comment on lines +1029 to +1030
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting and requires some careful thought.

[] concerns me because literal [] are used for shortcodes and escaping [] is one way to prevent shortcode matching.

Generally, I wonder about being so prescriptive about what is prevented from being escaped. For example, we'd need to consider interaction with common templating systems. Something like blade relies on {{ … }} for templating and presumably encoding {} is a way to prevent template processing on literal {{ … }}.

ASCII alphanumerics seem a bit safer to un-escape, but I'm reluctant to override how HTML was authored without good reason.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the “proper” way to escape shortcodes is through the use of shortcode escaping. use [[gallery]] instead of [gallery].

I think of this as an outbound-to-browser function, so templating systems that only run in JavaScript would be affected, and I would hope they have some escape.

This entire unhiding behavior is purely a security concession to make it easier for humans to identify malicious inputs and also to make downstream security checks more reliable, since they often assume malicious inputs arrive in a friendly way.

I do agree that this requires careful thought. It’s a powerful and exciting (to me) aspect made possible via the HTML API, but not integral to this function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be a filter.

/**
 * Selects which US-ASCII characters to enforce rendering as the byte itself
 * rather than as any HTML character reference.
 *
 * Must be single-byte US-ASCII characters only.
 */
$unhidden_ascii = apply_filters( 'always_raw_escaped_html_ascii', ',%()0123456789…' );

Then again, this might be best unfilterable.

if ( $should_unhide ) {
$normalized[] = substr( $text, $was_at, $next_character_reference_at - $was_at ) . $character_reference;
$at = $next_character_reference_at + $token_length;
$was_at = $at;
continue;
}

$is_syntax = 1 === strspn( $character_reference, '&"\'<>' );
if ( $is_syntax && '#' === $text[ $next_character_reference_at + 1 ] ) {
$named_form = strtr(
$character_reference,
array(
'&' => '&amp;',
'"' => '&quot;',
"'" => '&apos;',
'<' => '&lt;',
'>' => '&gt;',
)
);
$normalized[] = substr( $text, $was_at, $next_character_reference_at - $was_at ) . $named_form;
$at = $next_character_reference_at + $token_length;
$was_at = $at;
continue;
}

// This is a valid character reference, but it might not be normative.
$needs_semicolon = ';' !== $text[ $next_character_reference_at + $token_length - 1 ];

// This is a named character reference.
if ( '#' !== $text[ $next_character_reference_at + 1 ] ) {
// Nothing to do for already-normalized named character references.
if ( ! $needs_semicolon ) {
$at = $next_character_reference_at + $token_length;
continue;
}

// Add the missing semicolon.
$normalized[] = substr( $text, $was_at, $next_character_reference_at - $was_at + $token_length ) . ';';
$at = $next_character_reference_at + $token_length;
$was_at = $at;
continue;
}

/*
* While named character references have only a single form and are case sensitive,
* numeric character references may contain upper or lowercase hex values and may
* contain unlimited preceding zeros.
*/
$is_hex = 'x' === $text[ $next_character_reference_at + 2 ] || 'X' === $text[ $next_character_reference_at + 2 ];
$digits_at = $next_character_reference_at + ( $is_hex ? 3 : 2 );
$leading_zeros = '0' === $text[ $digits_at ] ? strspn( $text, '0', $digits_at ) : 0;
Comment on lines +1074 to +1081
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hex x prefix is not normalized to lowercase if other normalizations are not required.

  • &#XFe; -> &#xfe;
  • &#X0fe; -> &#xfe;
  • &#Xfe; -> &#Xfe;


if ( ! $needs_semicolon && ! $is_hex && 0 === $leading_zeros ) {
// Nothing to do for already-normalized decimal numeric character references.
$at = $next_character_reference_at + $token_length;
continue;
}

$digits = substr( $text, $digits_at + $leading_zeros, $next_character_reference_at + $token_length - $digits_at - $leading_zeros - ( $needs_semicolon ? 0 : 1 ) );
if ( $is_hex ) {
$lower_digits = strtolower( $digits );

// Nothing to do for already-normalized hexadecimal numeric character references.
if ( $lower_digits === $digits && ! $needs_semicolon && 0 === $leading_zeros ) {
$at = $next_character_reference_at + $token_length;
continue;
Comment on lines +1089 to +1096
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we could avoid or defer until absolutely necessary the digits substring and lowercasing with something like strspn($s, '0123456789abcdef') === strlen($s) (using the appropriate offsets and lengths).

}
Comment on lines +1091 to +1097
Copy link
Member

@sirreal sirreal Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lowercasing seems fine.

Stylistically, I prefer the hex-number part to be uppercase, like &#x3C;. I looked for precedent in the codebase and if any it's probably for lower case:

'\\\\' => '\\u005c',
'--' => '\\u002d\\u002d',
'<' => '\\u003c',
'>' => '\\u003e',
'&' => '\\u0026',
'\\"' => '\\u0022',

Interestingly, KSES entity normalization doesn't normalize hex cases:

wp_kses_normalize_entities('&#x0000003C; &#x0000003e;');
// "&#x3C; &#x3e;"


$normalized[] = substr( $text, $was_at, $next_character_reference_at - $was_at ) . "&#x{$lower_digits};";
$at = $next_character_reference_at + $token_length;
$was_at = $at;
continue;
} else {
$normalized[] = substr( $text, $was_at, $next_character_reference_at - $was_at ) . "&#{$digits};";
$at = $next_character_reference_at + $token_length;
$was_at = $at;
continue;
}

die( 'should not have arrived here' );
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for dev-time, because this should be unreachable. need to confirm before removing.

++$at;
}

if ( 0 === $was_at ) {
$normalized_text = strtr( $text, '&', '&amp;' );
} else {
$normalized[] = substr( $text, $was_at, $end - $was_at );
$normalized_text = implode( '', $normalized );
}

return strtr(
$normalized_text,
array(
'<' => '&lt;',
'>' => '&gt;',
'"' => '&quot;',
"'" => '&apos;',
/*
* Stray ampersand "&" characters have already been replaced above,
* so it’s inappropriate to replace again here, as all remaining
* instances should be part of a normalized character reference.
*/
)
);
}

/**
* Converts a number of HTML entities into their special characters.
*
Expand Down
58 changes: 58 additions & 0 deletions tests/phpunit/tests/formatting/normalizeEscapedHtmlText.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
<?php

/**
* @group formatting
*
* @covers \wp_normalize_escaped_html_text()
*/
class Tests_Formatting_NormalizeEscapedHtmlText extends WP_UnitTestCase {
/**
* Ensures that HTML test is properly normalized.
*
* @dataProvider data_example_datasets
*
* @param string $context
* @param string $text
* @param string $expected
*/
public function test_example_datasets( $context, $text, $expected ) {
$this->assertEquals(
$expected,
wp_normalize_escaped_html_text( $context, $text )
);
}

public static function data_example_datasets() {
return array(
array( 'attribute', 'test', 'test' ),
array( 'attribute', 'test & done', 'test &amp; done' ),
array( 'attribute', '&#XFe; is not iron', '&#xfe; is not iron' ),
array( 'attribute', 'spec > guess', 'spec &gt; guess' ),
array( 'attribute', 'art & copy', 'art &amp; copy' ),
array( 'attribute', '&#x1F170', '&#x1f170;' ),
array( 'attribute', '&#x1F170 ', '&#x1f170; ' ),

array( 'data', 'test', 'test' ),
array( 'data', 'test & done', 'test &amp; done' ),
array( 'data', '&#XFe; is not iron', '&#xfe; is not iron' ),
array( 'data', 'spec > guess', 'spec &gt; guess' ),
array( 'data', 'art & copy', 'art &amp; copy' ),
array( 'data', '&#x1F170', '&#x1f170;' ),
array( 'data', '&#x1F170 ', '&#x1f170; ' ),

// The “ambiguous ampersand” has different rules in the attribute value and data states.
array( 'attribute', '&notmyproblem', '&amp;notmyproblem' ),
array( 'data', '&notmyproblem', '&not;myproblem' ),

// Certain characters should remain plaintext.
array( 'attribute', 'eat &#x000033; apples', 'eat 3 apples' ),
array( 'data', 'eat &#x000033; apples', 'eat 3 apples' ),
array( 'data', '<&#x00073;cr&#0105pt&gt;', '&lt;script&gt;' ),
array( 'attribute', '&#x6a;avascript&#58alert&#40;&#x0000007b"test&quot;&#125;&#41;', 'javascript:alert({&quot;test&quot;})' ),

// Syntax characters should be represented uniformly.
array( 'attribute', '&#X3CIMG&#00062', '&lt;IMG&gt;' ),
array( 'data', '&#X3CIMG&#00062', '&lt;IMG&gt;' ),
);
}
}
Loading