diff --git a/autoresearch.ideas.md b/autoresearch.ideas.md new file mode 100644 index 0000000000000..3ff482776cbad --- /dev/null +++ b/autoresearch.ideas.md @@ -0,0 +1,16 @@ +# Autoresearch Ideas Backlog + +## High Priority (user-suggested) +- **Stack on_push/on_pop callbacks** — the HTML processor stack operations have push/pop callbacks. If these fire during tokenization (even indirectly), they could be significant overhead. Investigate whether any stack operations happen in the tag processor's read-only path, or whether these only apply to the HTML processor's tree-building. +- **Bookmark on_destroy callback** — bookmarks may have cleanup behavior that adds overhead. Check if any bookmark operations happen during pure tokenization. + +## Medium Priority +- **Lazy token_length** — derive from bytes_already_parsed - token_starts_at instead of writing per token. Saves ~1M writes/pass. Requires changing all read sites. +- **Lazy is_closing_tag** — derive from html bytes. Saves 1 write/tag but adds cost to reads. +- **Deferred property writes with lazy flush** — save all non-essential writes, flush on demand. Big win for read-only, slight overhead for read-write. Protected properties can't be deferred. +- **Single boolean for modification check** — replace 2 array reads with 1 boolean read in hot loop. + +## Low Priority / Speculative +- **Integer state constants** — replace string comparisons with int. API-breaking for protected parser_state. +- **Packed tag name properties** — combine tag_name_starts_at + tag_name_length into single int. +- **Static variable caching** — cache html/doc_length across calls. diff --git a/autoresearch.md b/autoresearch.md new file mode 100644 index 0000000000000..3636b973f1433 --- /dev/null +++ b/autoresearch.md @@ -0,0 +1,98 @@ +# Autoresearch: HTML Tag Processor Performance + +## Objective +Optimize `WP_HTML_Tag_Processor::next_token()` tokenization throughput on html-standard.html (~large real-world HTML). The benchmark iterates all tokens with no modifications — purely read-only tokenization speed. + +## Metrics +- **Primary**: mean execution time (ms, lower is better) via `hyperfine` +- **Secondary**: peak memory (bytes, lower is better) via `/usr/bin/time -l` + +## How to Run +`./autoresearch.sh` — runs hyperfine, outputs `METRIC mean_ms=number` lines. + +## Files in Scope +- `src/wp-includes/html-api/class-wp-html-tag-processor.php` — main parser, all hot path methods +- `src/wp-includes/html-api/class-wp-html-attribute-token.php` — attribute token object (6 props, allocated per attr) +- `src/wp-includes/html-api/class-wp-html-span.php` — span object (2 props, allocated on dup attrs) +- `src/wp-includes/html-api/class-wp-html-text-replacement.php` — text replacement (3 props, not in hot path for read-only) + +## Off Limits +- Test files +- `bench.php` and `bootstrap-html-api.php` +- Any file outside `src/wp-includes/html-api/` + +## Constraints +- PHPUnit tests must pass: `./vendor/bin/phpunit -c tests/phpunit/tests/html-api/phpunit.xml --stop-on-error --stop-on-failure --stop-on-warning --stop-on-defect` +- No new dependencies +- stddev and outliers from hyperfine must remain acceptable +- Changes must preserve all existing behavior + +## What's Been Tried + +### Baseline: ~699ms + +### Wins (cumulative, all committed) +1. **Replace per-attribute function call loop with skip_attributes_and_find_closer()** — eliminates parse_next_attribute(false) calls. Single method scans for `>` handling quoted values. +2. **Inline after_tag() into base_class_next_token()** — removes method call overhead per token. +3. **Inline fast paths for text nodes and regular tags** — handles the two most common token types (text ~378K, tags ~646K) directly in base_class_next_token, falling through to full parse_next_tag() only for complex tokens. +4. **Direct byte comparisons for single-char strspn** — replace strspn for single-character checks with direct `===` comparisons. +5. **Cache doc_length as instance variable** — avoid strlen() per token. +6. **Fast path for '>' immediately after tag name** — skip attribute scanning for tags like ``, `
`. +7. **Defer property resets to type-specific return paths** — text nodes only reset tag-related properties, tags only reset text-related properties. +8. **Tag name length filter before special element check** — special elements have lengths 3,5,6,7,8. Tags of other lengths return immediately without calling get_tag(). +9. **Reorder checks: length before strspn** — many common tags eliminated by cheap integer comparison before the strspn function call. +10. **Optimize attribute scanner for common name="value" pattern** — check for `=` and quote char directly after attribute name, avoiding two strspn() calls that typically return 0. +11. **Inline single-space and '>' checks in attribute scanner loop** — replace strspn for whitespace between attributes with direct byte comparisons for single-space (most common) and '>' (tag closer). +12. **Remove redundant STATE_COMPLETE check** — $at >= $doc_length bounds check handles this case. +13. **Remove text_node_classification write from tag fast path** — never read for tag tokens. +14. **Use null text_starts_at for tags** — allows removing text_length=0 write. get_modifiable_text() returns '' on null text_starts_at. +15. **Avoid redundant bytes_already_parsed property read** — use local $was_at for $at when no lexical updates. +16. **Remove attribute_scan_from property** — compute scan position as tag_name_starts_at + tag_name_length on demand in ensure_attributes_parsed(). Eliminates property and 3 writes. +17. **Remove attributes_parsed write from text nodes** — all callers of ensure_attributes_parsed() guard with STATE_MATCHED_TAG check, so the flag is never read for non-tag tokens. +18. **Short-circuit closing tags before after_tag_match** — closing tags never need special element processing. Return early using local $is_closer instead of reading property through the shared label. +19. **Move closer check out of after_tag_match** — both fast path and full_parse path return early for closers. after_tag_match now only handles openers, eliminating is_closing_tag read. +20. **Skip strpos when at '<'** — check for '<' at current position before calling strpos(). Tags (~63% of tokens) start at '<' and skip the function call entirely. +21. **Remove text_starts_at null write for tags** — use bounds check (text_starts_at < token_starts_at) in get_modifiable_text() to detect stale text instead of proactively nulling. +22. **Restructure get_tag() for state-based dispatch** — check STATE_MATCHED_TAG first instead of null check on tag_name_starts_at. Allows skipping tag_name null writes for text nodes (~756K writes eliminated). +23. **Replace attributes_parsed boolean with version-based staleness check** — use attributes_parsed_at integer compared against token_starts_at. Eliminates ~646K attributes_parsed=false writes per parse iteration. +24. **Pre-filter special element length in fast path before goto** — check tag name length (3,5,6,7,8) before goto after_tag_match. Tags with lengths 1,2,4 (88% of all tags: a, p, br, li, span, code, etc.) return immediately. +25. **Merge STATE_INCOMPLETE_INPUT check into bounds check** — remove dedicated parser_state read at loop start. Set bytes_already_parsed=doc_length on incomplete input so the existing bounds check handles it. Eliminates 1 property read per token. + +### Current: ~316ms (54.8% faster) + +### Dead Ends +- **First-letter bitwise OR + 7 comparisons** — replacing strspn('iIlLnNpPsStTxX',...) was WORSE. PHP bitwise string OR creates allocation; 7 comparisons slower than one C-level strspn. +- **substr_compare for special element names** — no measurable improvement. The special element check is already rare. +- **Simplified closer detection** — removing ternary `$is_closer ? 1 : 0` by computing $tag_at incrementally. Neutral. +- **Local vars for after_tag_match** — passing tag_length/tag_at as locals through the goto label. Neutral. +- **Pass $at parameter to skip_attributes_and_find_closer** — extra function parameter overhead cancels savings. +- **Add strspn first-letter check to fast path filter** — adding strspn('iIlLnNpPsStTxX') alongside the length filter. Neutral — length filter already catches 88% of tags. +- **Conditional text_node_classification write** — `if (TEXT_IS_GENERIC !== $this->text_node_classification)` before writing. Neutral — the conditional read costs the same as the write. +- **1-byte text node lookahead** — check `$html[$at+1] === '<'` before calling strpos. WORSE (~15ms regression). The extra branch on every text path hurts; strpos with memchr is already very fast for single bytes. +- **Length-3 first-letter filter in fast path** — for len=3 tags, check first letter against p/P/x/X (only PRE/XMP are special). Neutral — extra comparisons offset the savings from avoiding after_tag_match for ~74K div tags. +- **Single boolean has_pending_updates flag** — replace `classname_updates || lexical_updates` (2 reads) with a single boolean. Too invasive: 16+ modification sites need `$this->has_pending_updates = true`. Correctness concerns with clearing the flag. +- **Defer classname_updates check** — only check lexical_updates in hot loop, defer classname conversion. Incorrect: classname conversion requires current tag's attributes; deferring past cursor advance would use wrong attributes. + +### Architecture Notes +- **Token distribution**: ~646K tags (325K openers, 321K closers), ~378K text nodes, ~247K attributes, 1 other, across ~1M tokens in html-standard.html +- **Tag name length distribution**: len=1: 184K (28%), len=2: 211K (33%), len=3: 75K (12%), len=4: 174K (27%), len=5+: 4K (0.6%). Length filter catches 88% of tags. +- **Attribute distribution**: ~517K tags without attributes, ~129K with attributes (~20%) +- **Text node length**: 73K are 1 byte, 22K are 2 bytes, 30K are 3 bytes, etc. Most are short (whitespace between tags). +- **Text-tag alternation**: Most tokens alternate text→tag→text→tag. The strpos skip optimization exploits this — tags start at '<' so no search is needed. +- **PHP overhead dominates**: At 316ms / 1M tokens = 316ns/token (per pass, 3 passes). Property reads (~5-10ns each), property writes (~10-15ns), method dispatch (~10-20ns for JIT-optimized private calls). +- **next_token()→base_class_next_token() dispatch**: ~1M extra method calls, cannot be eliminated because get_updated_html() needs the base implementation. +- **Remaining property reads per token (hot path start)**: bytes_already_parsed, classname_updates, lexical_updates, html, doc_length = 5 reads. +- **Remaining property writes per token**: text nodes ~7, tags ~7. Total ~7M writes per benchmark pass. +- **Protected properties constrain optimization**: parser_state and text_node_classification are protected (read directly by WP_HTML_Processor subclass). Cannot defer or version-gate these without changing the subclass, which is off-limits. +- **after_tag() is dead code**: the method exists but is never called (fully inlined into base_class_next_token). Could be removed, but cosmetic. + +### Unexplored Ideas +- **Stack operations on_push/on_pop callbacks** — the HTML processor's open_elements stack has push/pop callbacks that fire during tree-building. These are not in scope for the tag processor benchmark, but if the benchmark changes to use the HTML processor, these callbacks could be significant overhead. +- **Bookmark on_destroy callback** — bookmarks have cleanup behavior. Not in hot path for read-only benchmark. +- **Lazy token_length computation** — token_length = bytes_already_parsed - token_starts_at for all fast-path tokens. Could eliminate 1 write per token (~1M writes/pass). But read sites are numerous and some (special elements, bookmarks) set token_length independently. Would need to change all read sites. +- **Lazy is_closing_tag computation** — derive from html[token_starts_at+1] === '/'. Saves 1 write per tag but adds 2 property reads + 1 byte access per read (many read sites including subclass). +- **Integer state constants** — replace string parser_state constants with integers for faster comparison. But parser_state is protected and used by external code with string comparisons. +- **Packed tag name properties** — store tag_name_starts_at and tag_name_length in a single 64-bit int. Saves 1 write, adds shift/mask to reads. Only useful if reads are rare (true for fast-path-filtered tags). +- **Static variable caching for $html/$doc_length** — cache across method calls. Saves ~1 property read/call. Shared across instances (problematic for multi-instance usage). +- **Deferred property writes with lazy flush** — store pending token data, only write to properties when external code reads them. Saves all property writes for read-only benchmark. Requires flush checks in all getter methods. Protected properties can't be deferred. +- **Eliminate classname_updates read in hot loop** — both classname_updates and lexical_updates are always empty in the benchmark. Replacing 2 array truthiness checks with a single boolean flag would save 1 read/token, but requires setting the flag in 16+ update methods. diff --git a/bench.php b/bench.php new file mode 100755 index 0000000000000..6279051835a55 --- /dev/null +++ b/bench.php @@ -0,0 +1,14 @@ +#!/usr/bin/env php +next_token() ) { +} +$p = new WP_HTML_Tag_Processor( $html ); +while ( $p->next_token() ) { +} +$p = new WP_HTML_Tag_Processor( $html ); +while ( $p->next_token() ) { +} diff --git a/bootstrap-html-api.php b/bootstrap-html-api.php new file mode 100644 index 0000000000000..aa9ac94e2689a --- /dev/null +++ b/bootstrap-html-api.php @@ -0,0 +1,46 @@ +', '"' ), array( '<', '>', '"' ), $s ); + } +} + +if ( ! function_exists( '__' ) ) { + function __( $s ) { + return $s; + } +} + +if ( ! function_exists( '_doing_it_wrong' ) ) { + function _doing_it_wrong( $message ) { + trigger_error( $message ); + } +} + +if ( ! function_exists( 'wp_kses_uri_attributes' ) ) { + function wp_kses_uri_attributes() { + return array(); + } +} diff --git a/src/wp-includes/html-api/class-wp-html-tag-processor.php b/src/wp-includes/html-api/class-wp-html-tag-processor.php index 8397ecf520fa2..9368c72b8c380 100644 --- a/src/wp-includes/html-api/class-wp-html-tag-processor.php +++ b/src/wp-includes/html-api/class-wp-html-tag-processor.php @@ -439,6 +439,16 @@ class WP_HTML_Tag_Processor { */ protected $html; + /** + * Cached byte length of the HTML string. + * + * Updated whenever $this->html is set to avoid repeated strlen() calls. + * + * @since 6.9.0 + * @var int + */ + private $doc_length = 0; + /** * The last query passed to next_tag(). * @@ -682,6 +692,16 @@ class WP_HTML_Tag_Processor { */ private $is_closing_tag; + /** + * The token_starts_at value when attributes were last parsed. + * + * Used to detect whether cached attributes are stale. When this + * doesn't match token_starts_at, attributes need re-parsing. + * + * @var int + */ + private $attributes_parsed_at = -1; + /** * Lazily-built index of attributes found within an HTML tag, keyed by the attribute name. * @@ -842,7 +862,8 @@ public function __construct( $html ) { ); $html = ''; } - $this->html = $html; + $this->html = $html; + $this->doc_length = strlen( $html ); } /** @@ -953,77 +974,182 @@ public function next_token(): bool { */ private function base_class_next_token(): bool { $was_at = $this->bytes_already_parsed; - $this->after_tag(); - - // Don't proceed if there's nothing more to scan. - if ( - self::STATE_COMPLETE === $this->parser_state || - self::STATE_INCOMPLETE_INPUT === $this->parser_state - ) { - return false; - } + $at = $was_at; /* - * The next step in the parsing loop determines the parsing state; - * clear it so that state doesn't linger from the previous step. + * Apply attribute updates and clean up the previous tag. + * Inlined from after_tag() to avoid method call overhead + * in the hot tokenization loop. */ - $this->parser_state = self::STATE_READY; + if ( $this->classname_updates || $this->lexical_updates ) { + $this->class_name_updates_to_attributes_updates(); - if ( $this->bytes_already_parsed >= strlen( $this->html ) ) { - $this->parser_state = self::STATE_COMPLETE; - return false; - } + if ( 1000 < count( $this->lexical_updates ) ) { + $this->get_updated_html(); + } - // Find the next tag if it exists. - if ( false === $this->parse_next_tag() ) { - if ( self::STATE_INCOMPLETE_INPUT === $this->parser_state ) { - $this->bytes_already_parsed = $was_at; + foreach ( $this->lexical_updates as $name => $update ) { + if ( $update->start >= $this->bytes_already_parsed ) { + $this->get_updated_html(); + break; + } + + if ( is_int( $name ) ) { + continue; + } + + $this->lexical_updates[] = $update; + unset( $this->lexical_updates[ $name ] ); } + $at = $this->bytes_already_parsed; + } + + $html = $this->html; + $doc_length = $this->doc_length; + + if ( $at >= $doc_length ) { + if ( self::STATE_INCOMPLETE_INPUT !== $this->parser_state ) { + $this->parser_state = self::STATE_COMPLETE; + } return false; } /* - * For legacy reasons the rest of this function handles tags and their - * attributes. If the processor has reached the end of the document - * or if it matched any other token then it should return here to avoid - * attempting to process tag-specific syntax. + * Fast path: handle the two most common token types inline. + * + * 1. At '<': try to match a regular tag directly (skip strpos). + * 2. Text nodes: text between tags (strpos finds next '<'). + * + * Complex tokens (comments, DOCTYPE, CDATA, etc.) fall through + * to the full parse_next_tag() method. */ - if ( - self::STATE_INCOMPLETE_INPUT !== $this->parser_state && - self::STATE_COMPLETE !== $this->parser_state && - self::STATE_MATCHED_TAG !== $this->parser_state - ) { + if ( '<' !== $html[ $at ] ) { + $at = strpos( $html, '<', $at ); + + // No '<' found: the rest of the document is a text node. + if ( false === $at ) { + $this->parser_state = self::STATE_TEXT_NODE; + $this->token_starts_at = $was_at; + $this->text_starts_at = $was_at; + $this->text_length = $doc_length - $was_at; + $this->text_node_classification = self::TEXT_IS_GENERIC; + $this->bytes_already_parsed = $doc_length; + return true; + } + + // Validate the '<' starts a valid token before returning text. + $next_byte = $html[ $at + 1 ] ?? ''; + if ( + '!' !== $next_byte && '/' !== $next_byte && '?' !== $next_byte && + ! ctype_alpha( $next_byte ) + ) { + /* + * The '<' doesn't start a valid token. Fall through to + * the full parse_next_tag() which handles continuation. + */ + goto full_parse; + } + + $this->parser_state = self::STATE_TEXT_NODE; + $this->token_starts_at = $was_at; + $this->text_starts_at = $was_at; + $this->text_length = $at - $was_at; + $this->text_node_classification = self::TEXT_IS_GENERIC; + $this->bytes_already_parsed = $at; return true; } - // Parse all of its attributes. - while ( $this->parse_next_attribute() ) { - continue; + // At '<': try to match a regular tag. + $first_char = $html[ $at + 1 ] ?? ''; + $is_closer = '/' === $first_char; + if ( $is_closer ) { + $first_char = $html[ $at + 2 ] ?? ''; } - // Ensure that the tag closes before the end of the document. - if ( - self::STATE_INCOMPLETE_INPUT === $this->parser_state || - $this->bytes_already_parsed >= strlen( $this->html ) - ) { - // Does this appropriately clear state (parsed attributes)? - $this->parser_state = self::STATE_INCOMPLETE_INPUT; - $this->bytes_already_parsed = $was_at; + if ( ctype_alpha( $first_char ) ) { + $tag_at = $at + 1 + ( $is_closer ? 1 : 0 ); + $tag_length = strcspn( $html, " \t\f\r\n/>", $tag_at ); + $after_name = $tag_at + $tag_length; + + $this->token_starts_at = $at; + $this->tag_name_starts_at = $tag_at; + $this->tag_name_length = $tag_length; + + // Fast path: '>' immediately after tag name. + if ( $after_name < $doc_length && '>' === $html[ $after_name ] ) { + $tag_ends_at = $after_name; + } else { + $this->bytes_already_parsed = $after_name; + $tag_ends_at = $this->skip_attributes_and_find_closer( $html, $doc_length ); + if ( false === $tag_ends_at ) { + $this->parser_state = self::STATE_INCOMPLETE_INPUT; + $this->bytes_already_parsed = $doc_length; + return false; + } + } + + $this->parser_state = self::STATE_MATCHED_TAG; + $this->bytes_already_parsed = $tag_ends_at + 1; + + if ( $is_closer ) { + return true; + } + + /* + * Quick length filter for special elements before goto. + * Special element names have lengths 3, 5, 6, 7, or 8. + * Common tags with other lengths (a, p, br, li, span, code, etc.) + * can return immediately without the goto dispatch. + */ + if ( $tag_length < 3 || $tag_length > 8 || 4 === $tag_length ) { + return true; + } + + goto after_tag_match; + } + + // Complex token: fall through to full parse_next_tag(). + full_parse: + + /* + * Reset state for the full parse path. + */ + $this->parser_state = self::STATE_READY; + $this->tag_name_starts_at = null; + $this->tag_name_length = null; + $this->text_starts_at = 0; + $this->text_length = 0; + $this->text_node_classification = self::TEXT_IS_GENERIC; + + if ( false === $this->parse_next_tag() ) { + if ( self::STATE_INCOMPLETE_INPUT === $this->parser_state ) { + $this->bytes_already_parsed = $doc_length; + } return false; } - $tag_ends_at = strpos( $this->html, '>', $this->bytes_already_parsed ); + if ( self::STATE_MATCHED_TAG !== $this->parser_state ) { + return true; + } + + // Tag found by parse_next_tag — scan attributes. + $tag_ends_at = $this->skip_attributes_and_find_closer( $html, $doc_length ); if ( false === $tag_ends_at ) { $this->parser_state = self::STATE_INCOMPLETE_INPUT; - $this->bytes_already_parsed = $was_at; + $this->bytes_already_parsed = $doc_length; return false; } $this->parser_state = self::STATE_MATCHED_TAG; $this->bytes_already_parsed = $tag_ends_at + 1; - $this->token_length = $this->bytes_already_parsed - $this->token_starts_at; + + if ( '/' === $html[ $this->token_starts_at + 1 ] ) { + return true; + } + + after_tag_match: /* * Certain tags require additional processing. The first-letter pre-check @@ -1040,10 +1166,19 @@ private function base_class_next_token(): bool { * - TITLE * - XMP (deprecated) */ + if ( 'html' !== $this->parsing_namespace ) { + return true; + } + + /* + * Quick length filter: special elements have name lengths 3, 5, 6, 7, or 8. + * Checking length before the first-letter strspn avoids a function call for + * the many common tags (a, p, li, div, span, etc.) with non-matching lengths. + */ + $special_tag_name_length = $this->tag_name_length; if ( - $this->is_closing_tag || - 'html' !== $this->parsing_namespace || - 1 !== strspn( $this->html, 'iIlLnNpPsStTxX', $this->tag_name_starts_at, 1 ) + $special_tag_name_length < 3 || $special_tag_name_length > 8 || 4 === $special_tag_name_length || + 1 !== strspn( $html, 'iIlLnNpPsStTxX', $this->tag_name_starts_at, 1 ) ) { return true; } @@ -1078,7 +1213,8 @@ private function base_class_next_token(): bool { */ $tag_name_starts_at = $this->tag_name_starts_at; $tag_name_length = $this->tag_name_length; - $tag_ends_at = $this->token_starts_at + $this->token_length; + $tag_ends_at = $this->bytes_already_parsed; + $this->ensure_attributes_parsed(); $attributes = $this->attributes; $duplicate_attributes = $this->duplicate_attributes; @@ -1119,7 +1255,7 @@ private function base_class_next_token(): bool { if ( ! $found_closer ) { $this->parser_state = self::STATE_INCOMPLETE_INPUT; - $this->bytes_already_parsed = $was_at; + $this->bytes_already_parsed = $doc_length; return false; } @@ -1131,7 +1267,6 @@ private function base_class_next_token(): bool { * the inner content of the tag. */ $this->token_starts_at = $was_at; - $this->token_length = $this->bytes_already_parsed - $this->token_starts_at; $this->text_starts_at = $tag_ends_at; $this->text_length = $this->tag_name_starts_at - $this->text_starts_at; $this->tag_name_starts_at = $tag_name_starts_at; @@ -1354,7 +1489,7 @@ public function set_bookmark( $name ): bool { return false; } - $this->bookmarks[ $name ] = new WP_HTML_Span( $this->token_starts_at, $this->token_length ); + $this->bookmarks[ $name ] = new WP_HTML_Span( $this->token_starts_at, $this->bytes_already_parsed - $this->token_starts_at ); return true; } @@ -1412,7 +1547,7 @@ private function skip_rawtext( string $tag_name ): bool { */ private function skip_rcdata( string $tag_name ): bool { $html = $this->html; - $doc_length = strlen( $html ); + $doc_length = $this->doc_length; $tag_length = strlen( $tag_name ); $at = $this->bytes_already_parsed; @@ -1449,7 +1584,7 @@ private function skip_rcdata( string $tag_name ): bool { $at += $tag_length; $this->bytes_already_parsed = $at; - if ( $at >= strlen( $html ) ) { + if ( $at >= $doc_length ) { return false; } @@ -1502,7 +1637,7 @@ private function skip_rcdata( string $tag_name ): bool { private function skip_script_data(): bool { $state = 'unescaped'; $html = $this->html; - $doc_length = strlen( $html ); + $doc_length = $this->doc_length; $at = $this->bytes_already_parsed; while ( false !== $at && $at < $doc_length ) { @@ -1710,10 +1845,8 @@ private function skip_script_data(): bool { * @return bool Whether a tag was found before the end of the document. */ private function parse_next_tag(): bool { - $this->after_tag(); - $html = $this->html; - $doc_length = strlen( $html ); + $doc_length = $this->doc_length; $was_at = $this->bytes_already_parsed; $at = $was_at; @@ -1736,27 +1869,30 @@ private function parse_next_tag(): bool { * * @see https://html.spec.whatwg.org/#tag-open-state */ - if ( 1 !== strspn( $html, '!/?abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ', $at + 1, 1 ) ) { + $next_byte = $html[ $at + 1 ] ?? ''; + if ( + '!' !== $next_byte && '/' !== $next_byte && '?' !== $next_byte && + ! ctype_alpha( $next_byte ) + ) { ++$at; continue; } $this->parser_state = self::STATE_TEXT_NODE; $this->token_starts_at = $was_at; - $this->token_length = $at - $was_at; $this->text_starts_at = $was_at; - $this->text_length = $this->token_length; + $this->text_length = $at - $was_at; $this->bytes_already_parsed = $at; return true; } $this->token_starts_at = $at; - if ( $at + 1 < $doc_length && '/' === $this->html[ $at + 1 ] ) { - $this->is_closing_tag = true; + if ( $at + 1 < $doc_length && '/' === $html[ $at + 1 ] ) { + $is_closer = true; ++$at; } else { - $this->is_closing_tag = false; + $is_closer = false; } /* @@ -1773,12 +1909,12 @@ private function parse_next_tag(): bool { * * https://html.spec.whatwg.org/multipage/parsing.html#data-state * * https://html.spec.whatwg.org/multipage/parsing.html#tag-open-state */ - $tag_name_prefix_length = strspn( $html, 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ', $at + 1 ); - if ( $tag_name_prefix_length > 0 ) { + $first_char = $html[ $at + 1 ] ?? ''; + if ( ctype_alpha( $first_char ) ) { ++$at; $this->parser_state = self::STATE_MATCHED_TAG; $this->tag_name_starts_at = $at; - $this->tag_name_length = $tag_name_prefix_length + strcspn( $html, " \t\f\r\n/>", $at + $tag_name_prefix_length ); + $this->tag_name_length = strcspn( $html, " \t\f\r\n/>", $at ); $this->bytes_already_parsed = $at + $this->tag_name_length; return true; } @@ -1797,7 +1933,7 @@ private function parse_next_tag(): bool { * `is_closing_tag && '!' === $html[ $at + 1 ] ) { + if ( ! $is_closer && '!' === $html[ $at + 1 ] ) { /* * `