Skip to content
Closed
Show file tree
Hide file tree
Changes from 12 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 36 additions & 12 deletions src/wp-includes/html-api/class-wp-html-tag-processor.php
Original file line number Diff line number Diff line change
Expand Up @@ -1496,13 +1496,48 @@ private function skip_script_data(): bool {
while ( false !== $at && $at < $doc_length ) {
$at += strcspn( $html, '-<', $at );

/*
* Optimization: Terminating a complete script element requires at least eight
* additional bytes in the document. Some checks below may cause local escaped
* state transitions when processing shorter strings, but those transitions are
* irrelevant if the script tag is incomplete and the function must return false.
*
* This may need updating if those transitions become significant or exported from
* this function in some way, such as when building safe methods to embed JavaScript
* or data inside a SCRIPT element.
*
* $at may be here.
* ↓
* ...</script>
* ╰──┬───╯
* $at + 8 additional bytes are required for a non-false return value.
*
* This single check eliminates the need to check lengths for the shorter spans:
*
* $at may be here.
* ↓
* <script><!-- --></script>
* ├╯
* $at + 2 additional characters does not require a length check.
*
* The transition from "escaped" to "unescaped" is not relevant if the document ends:
*
* $at may be here.
* ↓
* <script><!-- -->[[END-OF-DOCUMENT]]
* ╰──┬───╯
* $at + 8 additional bytes is not satisfied, return false.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this patch is good, but before we merge, can we also note in this comment that we’re looking for other strings and they are covered here? I think the comment at the moment is slightly misleading, but we also check for <!-- for instance, and it’s currently incidental I think that terminating false is the same for all places within the SCRIPT.

imagine, however, that we wanted to signal something different if we failed to exit the SCRIPT because there was no end-tag vs. that we ended inside some escaped state? then, by returning early at this point we removed our ability to find the <!-- that transitions state.

I believe in review that the code is solid, but I want to make sure we leave this note for the future as a warning to reconsider the check if changing things. I don’t know that these are all currently caught in the tests (which of course would be another great way to cover this risk, by adding new tests for each of these cases, if we can properly think them up).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I've updated the comment to clarify things.

I do have a number of tests to add to this. I can push them to this branch.

Given how complicated script tag parsing is, I'm tempted to extract them to their own test suite like wpHtmlTagProcessor-scriptParsing. Do you have thoughts on that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pulled in your comment changes from #9397 and made some further tweaks. I'm pretty happy with it now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your third example in the comment is perfect. I was probably lazy in not suggesting that in my comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract them to their own test suite like wpHtmlTagProcessor-scriptParsing. Do you have thoughts on that?

not that I consider it a huge priority, but I think this is a great idea and would make it clearer for us to be more comprehensive in the tests. I think for every branch of the state machine in the HTML spec we could have data providers generating example inputs for them.

if ( $at + 8 >= $doc_length ) {
return false;
}

/*
* For all script states a "-->" transitions
* back into the normal unescaped script mode,
* even if that's the current state.
*/
if (
$at + 2 < $doc_length &&
'-' === $html[ $at ] &&
'-' === $html[ $at + 1 ] &&
'>' === $html[ $at + 2 ]
Expand All @@ -1512,10 +1547,6 @@ private function skip_script_data(): bool {
continue;
}

if ( $at + 1 >= $doc_length ) {
return false;
}

/*
* Everything of interest past here starts with "<".
* Check this character and advance position regardless.
Expand All @@ -1537,7 +1568,6 @@ private function skip_script_data(): bool {
* parsing after updating the state.
*/
if (
$at + 2 < $doc_length &&
'!' === $html[ $at ] &&
'-' === $html[ $at + 1 ] &&
'-' === $html[ $at + 2 ]
Expand All @@ -1561,7 +1591,6 @@ private function skip_script_data(): bool {
* proceed scanning to the next potential token in the text.
*/
if ( ! (
$at + 6 < $doc_length &&
( 's' === $html[ $at ] || 'S' === $html[ $at ] ) &&
( 'c' === $html[ $at + 1 ] || 'C' === $html[ $at + 1 ] ) &&
( 'r' === $html[ $at + 2 ] || 'R' === $html[ $at + 2 ] ) &&
Expand All @@ -1579,9 +1608,6 @@ private function skip_script_data(): bool {
* "<script123" should not end a script region even though
* "<script" is found within the text.
*/
if ( $at + 6 >= $doc_length ) {
continue;
}
$at += 6;
$c = $html[ $at ];
if ( ' ' !== $c && "\t" !== $c && "\r" !== $c && "\n" !== $c && '/' !== $c && '>' !== $c ) {
Expand Down Expand Up @@ -1611,8 +1637,6 @@ private function skip_script_data(): bool {
}

if ( $this->bytes_already_parsed >= $doc_length ) {
$this->parser_state = self::STATE_INCOMPLETE_INPUT;

return false;
}

Expand Down
43 changes: 43 additions & 0 deletions tests/phpunit/tests/html-api/wpHtmlTagProcessor.php
Original file line number Diff line number Diff line change
Expand Up @@ -1981,6 +1981,49 @@ public static function data_next_tag_ignores_script_tag_contents() {
);
}

/**
* Test that script tags are parsed correctly.
*
* Script tag parsing is very complicated, see the following resources for more details:
*
* - https://html.spec.whatwg.org/multipage/parsing.html#script-data-state
* - https://html.spec.whatwg.org/multipage/scripting.html#restrictions-for-contents-of-script-elements
*
* @ticket 63738
*
* @dataProvider data_script_tag
*/
public function test_script_tag_parsing( string $input, bool $closes ) {
$processor = new WP_HTML_Tag_Processor( $input );

if ( $closes ) {
$this->assertTrue( $processor->next_token(), 'Expected to find complete script tag.' );
$this->assertSame( 'SCRIPT', $processor->get_tag() );
return;
}

$this->assertFalse( $processor->next_token(), 'Expected to fail next_token().' );
$this->assertTrue( $processor->paused_at_incomplete_token(), 'Expected an incomplete SCRIPT tag token.' );
}

/**
* Data provider.
*/
public static function data_script_tag(): array {
return array(
'Basic script tag' => array( '<script></script>', true ),
'Script with type attribute' => array( '<script type="text/javascript"></script>', true ),
'Script data escaped' => array( '<script><!--</script>', true ),
'Script data double-escaped exit (comment)' => array( '<script><!--<script>--></script>', true ),
'Script data double-escaped exit (closed)' => array( '<script><!--<script></script></script>', true ),
'Script data double-escaped exit (closed/truncated)' => array( '<script><!--<script></script </script>', true ),
'Script data no double-escape' => array( '<script><!-- --><script></script>', true ),

'Script tag with self-close flag (ignored)' => array( '<script />', false ),
'Script data double-escaped' => array( '<script><!--<script></script>', false ),
);
}

/**
* Invalid tag names are comments on tag closers.
*
Expand Down
Loading