Conversation
parse_next_tag() is only called from base_class_next_token(), which already calls after_tag() before invoking parse_next_tag(). The second call is redundant — all state has already been cleaned up. ~8% improvement in tokenization throughput (1250ms → 1150ms).
parse_next_attribute() accesses $this->html 8+ times per call via object property lookup. Local variable access is faster in PHP. ~4% improvement (1150ms → 1105ms).
Replace two skip_whitespace() method calls with inline strspn() using the local $html variable. This avoids function call overhead and allows PHP's JIT to optimize the entire parse_next_attribute() as one unit. Massive improvement: 1105ms → 316ms (~71% faster).
Avoid repeated property lookups and strlen() calls by caching in local variables. Eliminates 2 strlen() calls and 1 property access per token.
Guard class_name_updates_to_attributes_updates() and lexical_updates processing behind empty() checks. In read-only mode (the common case), these arrays are always empty, avoiding unnecessary function calls. ~7% improvement (312ms → 291ms).
Replace ~15 $this->bytes_already_parsed property accesses with a local $at variable, writing back once at the end. Local variable access is significantly faster than object property access in PHP's VM. ~2.3% improvement (291ms → 284ms).
parse_next_tag() is only called from base_class_next_token(), which already calls after_tag() immediately before. The second call does redundant work resetting state that was just reset. 852ms → 757ms (-11%)
For read-only tokenization, classname_updates and lexical_updates are always empty. Guard the update processing with a count check to avoid calling class_name_updates_to_attributes_updates() and iterating an empty array. 757ms → 723ms (-4.5%)
Avoids repeated property lookups and strlen() calls in the main tokenization loop.
Replace repeated $this->bytes_already_parsed and $this->html property access with local variables. Also inlines skip_whitespace() calls to avoid method call overhead and use the local $html variable. 728ms → 702ms (-3.6%)
For closing tags, attribute names are parsed but never stored, so the substr() and strtolower() calls are wasted. Move them after the closing tag early-return. 702ms → 696ms
…nd_find_closer()
Replaces the parse_next_attribute(false) while loop + strpos('>') with a
single method call that scans past all attributes and finds the tag-closing
'>' in one pass. Eliminates N function call overheads per tag during
read-only tokenization.
- Use local $html variable instead of $this->html for array access in parse_next_tag() - Avoid property read for text_length by computing $at - $was_at directly - Use truthiness check instead of count() > 0 for empty array detection in after_tag() - Move attributes/duplicate_attributes reset from after_tag() to ensure_attributes_parsed() to avoid empty array allocation on every token during read-only tokenization
…e '>' Tags without attributes (closing tags, simple void tags) have '>' immediately after the tag name. A direct byte check before entering the attribute-scanning loop avoids strspn/strcspn overhead for these common cases.
…in parse_next_tag() Replace strspn() calls used for single-character validation with direct character range comparisons. Also simplify tag name length calculation to use a single strcspn() instead of strspn() + strcspn(), since the first-character alpha check is now done via direct comparison.
…erty resets Inline the single-callsite after_tag() method into base_class_next_token() to eliminate method call overhead in the hot tokenization loop. Also remove 4 property resets (token_starts_at, token_length, is_closing_tag, comment_type) that are always overwritten before use or guarded by parser_state checks.
Inline the most common token paths (text nodes and regular tags) directly into base_class_next_token(), eliminating the parse_next_tag() function call for ~95% of tokens. Complex tokens (comments, DOCTYPE, CDATA) still use the full parse_next_tag() method. Also integrates the skip_attributes_and_find_closer fast path for tags with '>' immediately after the tag name, avoiding an additional method call for closing tags and attribute-less tags.
…ss_next_token() Instead of resetting all token properties at the top of every iteration, only reset the properties relevant to each token type at its return point. This eliminates ~3.4M unnecessary property writes per benchmark run.
Special elements (SCRIPT, STYLE, TEXTAREA, etc.) have name lengths of 3, 5, 6, 7, or 8 characters. Adding a quick length check before calling get_tag() avoids substr() + strtoupper() allocations for the many common tags (p, li, span, div, section, etc.) that pass the first-letter check but can never be special elements.
Move the tag name length filter before the strspn first-letter check. This avoids a function call for the many common tags with non-matching lengths (a, p, li, div, span, etc.) that would pass the closing-tag check but be eliminated by strspn.
Restructure skip_attributes_and_find_closer() to check for '=' and quote characters directly after the attribute name, avoiding two strspn() calls per attribute that almost always return 0. Well-formed attributes like class="foo" now go through a fast path with only byte-level comparisons + a single strpos for the closing quote.
Replace the strspn() call at the top of the attribute scanning loop with direct byte comparisons for the two most common cases: a single space separator between attributes and the '>' tag closer. Only falls back to strspn() for uncommon whitespace characters (tabs, newlines, form feeds, slashes).
Remove the STATE_COMPLETE check from base_class_next_token() since the $at >= $doc_length bounds check already handles this case. Also remove the text_node_classification property write from the tag fast path since this value is never read for tag tokens.
Set text_starts_at to null instead of 0 for tag tokens in the fast path, and remove the text_length = 0 write. get_modifiable_text() already returns '' when text_starts_at is null, making the text_length write redundant. Saves one property write per tag token.
Cache bytes_already_parsed in a local variable at function entry and only re-read the property when lexical updates have been applied, as those may adjust the position. Saves one property read per token in the common read-only tokenization case.
Eliminates the attribute_scan_from property entirely. The scan position is now computed as tag_name_starts_at + tag_name_length in ensure_attributes_parsed() on demand, avoiding ~646K property writes per benchmark iteration.
All callers of ensure_attributes_parsed() guard with STATE_MATCHED_TAG check, so the attributes_parsed flag is never read for text nodes. Removing this write saves ~378K property writes per benchmark iteration.
Text nodes don't need to null out tag_name_starts_at and tag_name_length. Instead, restructure get_tag() to check parser_state first, which correctly returns null for non-tag tokens without requiring the null sentinel. Saves ~756K property writes per benchmark iteration.
Closing tags never need special element processing, so return immediately from the fast path instead of goto after_tag_match. Avoids property reads for is_closing_tag and parsing_namespace checks for ~300K+ closing tags. Also avoid re-reading bytes_already_parsed for token_length computation.
Both fast path and full_parse path now return early for closing tags before reaching after_tag_match, which now only handles opening tags. Eliminates is_closing_tag property read from the shared after_tag_match section.
Check for '<' at current position before calling strpos(). When tokens alternate text→tag, the tag iteration starts at '<' and can skip strpos entirely. Since ~63% of tokens are tags, this eliminates ~646K strpos calls per benchmark iteration.
Instead of setting text_starts_at = null for every tag to prevent stale text in get_modifiable_text(), add a bounds check that detects stale text_starts_at (from previous text nodes) by comparing against token_starts_at. Saves ~646K property writes.
Current: ~330ms (52.8% faster than 699ms baseline)
…check Use attributes_parsed_at integer compared against token_starts_at to detect stale attribute data, eliminating ~646K attributes_parsed=false writes per parse iteration. The version check in ensure_attributes_parsed() automatically invalidates when a new token is parsed.
Check tag name length against special element lengths (3,5,6,7,8) before goto after_tag_match. Common tags with lengths 1,2,4 (a, p, br, li, span, code, etc.) return immediately, avoiding goto dispatch + namespace read + property reads for the majority of opener tags.
Remove the dedicated parser_state read for STATE_INCOMPLETE_INPUT at the top of the hot loop. Instead, set bytes_already_parsed = doc_length when incomplete input is detected, so the bounds check ($at >= $doc_length) handles it. This eliminates one property read per token (~1M reads per pass).
… backlog Document new wins (attributes_parsed_at, fast-path length filter, merged incomplete check), failed experiments, detailed architecture notes from profiling data, and create ideas backlog for next session.
token_length = bytes_already_parsed - token_starts_at holds at every write site. Derive the value on demand at the ~6 read sites instead of writing it at ~20 write sites (~1M writes/pass eliminated).
Eliminate is_closing_tag property writes (~646K per pass) by computing '/' === html[token_starts_at + 1] at the ~8 read sites. Uses local variable in parse_next_tag() for the slow path.
Replace inline range comparisons with ctype_alpha() which dispatches to a single C-level function call. Applies to all four sites: fast path tag check, fast path text validation, parse_next_tag() tag check, and parse_next_tag() text validation.
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request delivers substantial performance enhancements to the HTML Tag Processor by implementing a series of micro-optimizations within the core tokenization logic. The changes focus on reducing overhead, streamlining attribute parsing, and optimizing common code paths, leading to a significant speedup in processing HTML. The comprehensive approach is well-documented with detailed research and benchmarking results. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces significant performance optimizations to the WP_HTML_Tag_Processor, primarily within the base_class_next_token method. The changes include inlining function calls, adding fast paths for common tokens, caching values like document length, and implementing lazy parsing for attributes. These optimizations are well-documented in the new autoresearch.md file, which provides excellent context for the changes. The code modifications are extensive but appear to be correct and consistent with the stated performance goals. I have one suggestion for improving the readability of the new benchmark script.
| $p = new WP_HTML_Tag_Processor( $html ); | ||
| while ( $p->next_token() ) { | ||
| } | ||
| $p = new WP_HTML_Tag_Processor( $html ); | ||
| while ( $p->next_token() ) { | ||
| } | ||
| $p = new WP_HTML_Tag_Processor( $html ); | ||
| while ( $p->next_token() ) { | ||
| } |
There was a problem hiding this comment.
The benchmark script repeats the tokenization process three times. This repetition can be made more concise and maintainable by using a for loop. This avoids code duplication and makes it easier to change the number of iterations in the future.
for ( $i = 0; $i < 3; $i++ ) {
$p = new WP_HTML_Tag_Processor( $html );
while ( $p->next_token() ) {
}
}
No description provided.