Html api ai optimizer by sirreal · Pull Request #31 · sirreal/wordpress-develop

sirreal · 2026-03-13T19:51:46Z

No description provided.

parse_next_tag() is only called from base_class_next_token(), which already calls after_tag() before invoking parse_next_tag(). The second call is redundant — all state has already been cleaned up. ~8% improvement in tokenization throughput (1250ms → 1150ms).

parse_next_attribute() accesses $this->html 8+ times per call via object property lookup. Local variable access is faster in PHP. ~4% improvement (1150ms → 1105ms).

Replace two skip_whitespace() method calls with inline strspn() using the local $html variable. This avoids function call overhead and allows PHP's JIT to optimize the entire parse_next_attribute() as one unit. Massive improvement: 1105ms → 316ms (~71% faster).

Avoid repeated property lookups and strlen() calls by caching in local variables. Eliminates 2 strlen() calls and 1 property access per token.

Guard class_name_updates_to_attributes_updates() and lexical_updates processing behind empty() checks. In read-only mode (the common case), these arrays are always empty, avoiding unnecessary function calls. ~7% improvement (312ms → 291ms).

Replace ~15 $this->bytes_already_parsed property accesses with a local $at variable, writing back once at the end. Local variable access is significantly faster than object property access in PHP's VM. ~2.3% improvement (291ms → 284ms).

parse_next_tag() is only called from base_class_next_token(), which already calls after_tag() immediately before. The second call does redundant work resetting state that was just reset. 852ms → 757ms (-11%)

For read-only tokenization, classname_updates and lexical_updates are always empty. Guard the update processing with a count check to avoid calling class_name_updates_to_attributes_updates() and iterating an empty array. 757ms → 723ms (-4.5%)

Avoids repeated property lookups and strlen() calls in the main tokenization loop.

Replace repeated $this->bytes_already_parsed and $this->html property access with local variables. Also inlines skip_whitespace() calls to avoid method call overhead and use the local $html variable. 728ms → 702ms (-3.6%)

For closing tags, attribute names are parsed but never stored, so the substr() and strtolower() calls are wasted. Move them after the closing tag early-return. 702ms → 696ms

…nd_find_closer() Replaces the parse_next_attribute(false) while loop + strpos('>') with a single method call that scans past all attributes and finds the tag-closing '>' in one pass. Eliminates N function call overheads per tag during read-only tokenization.

- Use local $html variable instead of $this->html for array access in parse_next_tag() - Avoid property read for text_length by computing $at - $was_at directly - Use truthiness check instead of count() > 0 for empty array detection in after_tag() - Move attributes/duplicate_attributes reset from after_tag() to ensure_attributes_parsed() to avoid empty array allocation on every token during read-only tokenization

…) calls

…e '>' Tags without attributes (closing tags, simple void tags) have '>' immediately after the tag name. A direct byte check before entering the attribute-scanning loop avoids strspn/strcspn overhead for these common cases.

…in parse_next_tag() Replace strspn() calls used for single-character validation with direct character range comparisons. Also simplify tag name length calculation to use a single strcspn() instead of strspn() + strcspn(), since the first-character alpha check is now done via direct comparison.

…erty resets Inline the single-callsite after_tag() method into base_class_next_token() to eliminate method call overhead in the hot tokenization loop. Also remove 4 property resets (token_starts_at, token_length, is_closing_tag, comment_type) that are always overwritten before use or guarded by parser_state checks.

Inline the most common token paths (text nodes and regular tags) directly into base_class_next_token(), eliminating the parse_next_tag() function call for ~95% of tokens. Complex tokens (comments, DOCTYPE, CDATA) still use the full parse_next_tag() method. Also integrates the skip_attributes_and_find_closer fast path for tags with '>' immediately after the tag name, avoiding an additional method call for closing tags and attribute-less tags.

…ss_next_token() Instead of resetting all token properties at the top of every iteration, only reset the properties relevant to each token type at its return point. This eliminates ~3.4M unnecessary property writes per benchmark run.

Special elements (SCRIPT, STYLE, TEXTAREA, etc.) have name lengths of 3, 5, 6, 7, or 8 characters. Adding a quick length check before calling get_tag() avoids substr() + strtoupper() allocations for the many common tags (p, li, span, div, section, etc.) that pass the first-letter check but can never be special elements.

Move the tag name length filter before the strspn first-letter check. This avoids a function call for the many common tags with non-matching lengths (a, p, li, div, span, etc.) that would pass the closing-tag check but be eliminated by strspn.

Restructure skip_attributes_and_find_closer() to check for '=' and quote characters directly after the attribute name, avoiding two strspn() calls per attribute that almost always return 0. Well-formed attributes like class="foo" now go through a fast path with only byte-level comparisons + a single strpos for the closing quote.

Replace the strspn() call at the top of the attribute scanning loop with direct byte comparisons for the two most common cases: a single space separator between attributes and the '>' tag closer. Only falls back to strspn() for uncommon whitespace characters (tabs, newlines, form feeds, slashes).

Remove the STATE_COMPLETE check from base_class_next_token() since the $at >= $doc_length bounds check already handles this case. Also remove the text_node_classification property write from the tag fast path since this value is never read for tag tokens.

Set text_starts_at to null instead of 0 for tag tokens in the fast path, and remove the text_length = 0 write. get_modifiable_text() already returns '' when text_starts_at is null, making the text_length write redundant. Saves one property write per tag token.

Cache bytes_already_parsed in a local variable at function entry and only re-read the property when lexical updates have been applied, as those may adjust the position. Saves one property read per token in the common read-only tokenization case.

Eliminates the attribute_scan_from property entirely. The scan position is now computed as tag_name_starts_at + tag_name_length in ensure_attributes_parsed() on demand, avoiding ~646K property writes per benchmark iteration.

All callers of ensure_attributes_parsed() guard with STATE_MATCHED_TAG check, so the attributes_parsed flag is never read for text nodes. Removing this write saves ~378K property writes per benchmark iteration.

Text nodes don't need to null out tag_name_starts_at and tag_name_length. Instead, restructure get_tag() to check parser_state first, which correctly returns null for non-tag tokens without requiring the null sentinel. Saves ~756K property writes per benchmark iteration.

Closing tags never need special element processing, so return immediately from the fast path instead of goto after_tag_match. Avoids property reads for is_closing_tag and parsing_namespace checks for ~300K+ closing tags. Also avoid re-reading bytes_already_parsed for token_length computation.

Both fast path and full_parse path now return early for closing tags before reaching after_tag_match, which now only handles opening tags. Eliminates is_closing_tag property read from the shared after_tag_match section.

Check for '<' at current position before calling strpos(). When tokens alternate text→tag, the tag iteration starts at '<' and can skip strpos entirely. Since ~63% of tokens are tags, this eliminates ~646K strpos calls per benchmark iteration.

Instead of setting text_starts_at = null for every tag to prevent stale text in get_modifiable_text(), add a bounds check that detects stale text_starts_at (from previous text nodes) by comparing against token_starts_at. Saves ~646K property writes.

Current: ~330ms (52.8% faster than 699ms baseline)

…check Use attributes_parsed_at integer compared against token_starts_at to detect stale attribute data, eliminating ~646K attributes_parsed=false writes per parse iteration. The version check in ensure_attributes_parsed() automatically invalidates when a new token is parsed.

Check tag name length against special element lengths (3,5,6,7,8) before goto after_tag_match. Common tags with lengths 1,2,4 (a, p, br, li, span, code, etc.) return immediately, avoiding goto dispatch + namespace read + property reads for the majority of opener tags.

Remove the dedicated parser_state read for STATE_INCOMPLETE_INPUT at the top of the hot loop. Instead, set bytes_already_parsed = doc_length when incomplete input is detected, so the bounds check ($at >= $doc_length) handles it. This eliminates one property read per token (~1M reads per pass).

… backlog Document new wins (attributes_parsed_at, fast-path length filter, merged incomplete check), failed experiments, detailed architecture notes from profiling data, and create ideas backlog for next session.

token_length = bytes_already_parsed - token_starts_at holds at every write site. Derive the value on demand at the ~6 read sites instead of writing it at ~20 write sites (~1M writes/pass eliminated).

Eliminate is_closing_tag property writes (~646K per pass) by computing '/' === html[token_starts_at + 1] at the ~8 read sites. Uses local variable in parse_next_tag() for the slow path.

Replace inline range comparisons with ctype_alpha() which dispatches to a single C-level function call. Applies to all four sites: fast path tag check, fast path text validation, parse_next_tag() tag check, and parse_next_tag() text validation.

gemini-code-assist · 2026-03-13T19:52:18Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request delivers substantial performance enhancements to the HTML Tag Processor by implementing a series of micro-optimizations within the core tokenization logic. The changes focus on reducing overhead, streamlining attribute parsing, and optimizing common code paths, leading to a significant speedup in processing HTML. The comprehensive approach is well-documented with detailed research and benchmarking results.

Highlights

Performance Optimization: Significantly optimized the WP_HTML_Tag_Processor::next_token() method, resulting in a 54.8% speed improvement in read-only tokenization benchmarks.
Inlined Hot Paths: Inlined the after_tag() logic and introduced fast paths for common token types (text nodes and regular tags) directly into base_class_next_token() to reduce method call overhead.
Lazy Attribute Parsing: Implemented lazy parsing for attributes using an attributes_parsed_at property and an ensure_attributes_parsed() method, avoiding unnecessary object creation and processing.
Efficient Attribute Scanning: Replaced the per-attribute function call loop with a new skip_attributes_and_find_closer() method for faster read-only attribute scanning.
Reduced Property Writes: Eliminated the token_length property and derived its value on demand, and removed the is_closing_tag property by using direct character checks, reducing property writes per token.
Benchmarking and Documentation: Added detailed documentation (autoresearch.md) outlining the optimization objective, metrics, successful strategies, dead ends, and architectural notes, along with a new benchmark script (bench.php).

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

autoresearch.ideas.md
- Added a backlog of ideas for future HTML API optimization efforts.
autoresearch.md
- Added a detailed document outlining the objective, metrics, methodology, successful optimizations, dead ends, and architectural notes for HTML Tag Processor performance.
bench.php
- Added a new benchmark script to measure the performance of WP_HTML_Tag_Processor::next_token().
bootstrap-html-api.php
- Added a new bootstrap file to include necessary HTML API classes and define missing WordPress functions for benchmarking.
src/wp-includes/html-api/class-wp-html-tag-processor.php
- Added doc_length property to cache HTML string length and updated the constructor to initialize it.
- Added attributes_parsed_at property to track attribute parsing state for lazy loading.
- Refactored base_class_next_token() to inline after_tag() logic, introduce fast paths for text and tag tokens, and use skip_attributes_and_find_closer() for attribute scanning.
- Modified special element length filtering in base_class_next_token() for efficiency.
- Removed token_length property and calculated its value on demand.
- Updated set_bookmark() to calculate token length dynamically.
- Replaced strlen($this->html) with $this->doc_length in skip_rcdata, skip_script_data, and parse_next_tag for performance.
- Removed is_closing_tag property and replaced its usage with direct character checks on html[token_starts_at + 1].
- Refactored parse_next_tag() to remove redundant after_tag() call and simplify token length calculations.
- Modified parse_next_attribute() to accept a $store parameter and use a local $at variable for cursor management.
- Introduced skip_attributes_and_find_closer() method for efficient attribute scanning in read-only paths.
- Added ensure_attributes_parsed() method for lazy attribute parsing.
- Updated after_tag() to reset attributes_parsed_at and remove is_closing_tag and token_length resets.
- Added ensure_attributes_parsed() call in class_name_updates_to_attributes_updates().
- Updated apply_attributes_updates() to update doc_length after HTML modification.
- Updated seek() to calculate token length dynamically.
- Added ensure_attributes_parsed() calls in get_attribute(), get_attribute_names_with_prefix(), set_attribute(), and remove_attribute().
- Restructured get_tag() logic for state-based dispatch and removed redundant tag_name variable.
- Updated has_self_closing_flag() to use bytes_already_parsed for length calculation.
- Updated is_tag_closer() to use direct character check instead of is_closing_tag property.
- Removed token_length assignments in subdivide_text_appropriately().
- Modified get_modifiable_text() to include a bounds check for text_starts_at.
- Reset attributes_parsed_at in get_updated_html().

Activity

The author, sirreal, created this pull request to optimize the HTML API.
New files autoresearch.ideas.md, autoresearch.md, bench.php, and bootstrap-html-api.php were added to document optimization efforts and provide benchmarking tools.
Extensive modifications were made to src/wp-includes/html-api/class-wp-html-tag-processor.php, implementing numerous performance improvements.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces significant performance optimizations to the WP_HTML_Tag_Processor, primarily within the base_class_next_token method. The changes include inlining function calls, adding fast paths for common tokens, caching values like document length, and implementing lazy parsing for attributes. These optimizations are well-documented in the new autoresearch.md file, which provides excellent context for the changes. The code modifications are extensive but appear to be correct and consistent with the stated performance goals. I have one suggestion for improving the readability of the new benchmark script.

gemini-code-assist · 2026-03-13T19:58:47Z

bench.php

+$p = new WP_HTML_Tag_Processor( $html );
+while ( $p->next_token() ) {
+}
+$p = new WP_HTML_Tag_Processor( $html );
+while ( $p->next_token() ) {
+}
+$p = new WP_HTML_Tag_Processor( $html );
+while ( $p->next_token() ) {
+}


The benchmark script repeats the tokenization process three times. This repetition can be made more concise and maintainable by using a for loop. This avoids code duplication and makes it easier to change the number of iterations in the future.

for ( $i = 0; $i < 3; $i++ ) { $p = new WP_HTML_Tag_Processor( $html ); while ( $p->next_token() ) { } }

sirreal added 30 commits March 13, 2026 14:50

Perf: Cache $this->html in local variable in parse_next_attribute()

c33ce0e

parse_next_attribute() accesses $this->html 8+ times per call via object property lookup. Local variable access is faster in PHP. ~4% improvement (1150ms → 1105ms).

Perf: Cache $this->html and strlen in base_class_next_token()

a67da3a

Avoid repeated property lookups and strlen() calls by caching in local variables. Eliminates 2 strlen() calls and 1 property access per token.

restore trunk

cf2abc6

update benches

a2b1ed4

Perf: Remove redundant after_tag() call from parse_next_tag()

b0847f3

parse_next_tag() is only called from base_class_next_token(), which already calls after_tag() immediately before. The second call does redundant work resetting state that was just reset. 852ms → 757ms (-11%)

Perf: Cache $this->html and strlen in base_class_next_token()

bb92778

Avoids repeated property lookups and strlen() calls in the main tokenization loop.

Perf: Defer substr() for attribute name until needed

bad7d22

For closing tags, attribute names are parsed but never stored, so the substr() and strtolower() calls are wasted. Move them after the closing tag early-return. 702ms → 696ms

Perf: Cache doc_length as instance variable to avoid repeated strlen(…

5b90c80

…) calls

Perf: Remove attribute_scan_from property, compute on demand

75dd6a3

Eliminates the attribute_scan_from property entirely. The scan position is now computed as tag_name_starts_at + tag_name_length in ensure_attributes_parsed() on demand, avoiding ~646K property writes per benchmark iteration.

Perf: Remove unnecessary attributes_parsed write from text nodes

8e8dc9a

All callers of ensure_attributes_parsed() guard with STATE_MATCHED_TAG check, so the attributes_parsed flag is never read for text nodes. Removing this write saves ~378K property writes per benchmark iteration.

sirreal added 13 commits March 13, 2026 17:10

Perf: Move closer check before after_tag_match for full_parse path

8a00e32

Both fast path and full_parse path now return early for closing tags before reaching after_tag_match, which now only handles opening tags. Eliminates is_closing_tag property read from the shared after_tag_match section.

Update autoresearch.md with optimizations 16-22

ae0586c

Current: ~330ms (52.8% faster than 699ms baseline)

Perf: Eliminate token_length property writes by computing on demand

a8527fd

token_length = bytes_already_parsed - token_starts_at holds at every write site. Derive the value on demand at the ~6 read sites instead of writing it at ~20 write sites (~1M writes/pass eliminated).

Perf: Derive is_closing_tag from HTML bytes instead of storing per token

a89e78d

Eliminate is_closing_tag property writes (~646K per pass) by computing '/' === html[token_starts_at + 1] at the ~8 read sites. Uses local variable in parse_next_tag() for the slow path.

gemini-code-assist bot reviewed Mar 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Html api ai optimizer#31

Html api ai optimizer#31
sirreal wants to merge 43 commits intotrunkfrom
html-api-ai-optimizer

sirreal commented Mar 13, 2026

Uh oh!

gemini-code-assist bot commented Mar 13, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sirreal commented Mar 13, 2026

Uh oh!

gemini-code-assist bot commented Mar 13, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant