Skip to content

Html api ai optimizer#31

Draft
sirreal wants to merge 43 commits intotrunkfrom
html-api-ai-optimizer
Draft

Html api ai optimizer#31
sirreal wants to merge 43 commits intotrunkfrom
html-api-ai-optimizer

Conversation

@sirreal
Copy link
Owner

@sirreal sirreal commented Mar 13, 2026

No description provided.

sirreal added 30 commits March 13, 2026 14:50
parse_next_tag() is only called from base_class_next_token(), which
already calls after_tag() before invoking parse_next_tag(). The second
call is redundant — all state has already been cleaned up.

~8% improvement in tokenization throughput (1250ms → 1150ms).
parse_next_attribute() accesses $this->html 8+ times per call via
object property lookup. Local variable access is faster in PHP.

~4% improvement (1150ms → 1105ms).
Replace two skip_whitespace() method calls with inline strspn() using
the local $html variable. This avoids function call overhead and allows
PHP's JIT to optimize the entire parse_next_attribute() as one unit.

Massive improvement: 1105ms → 316ms (~71% faster).
Avoid repeated property lookups and strlen() calls by caching in local
variables. Eliminates 2 strlen() calls and 1 property access per token.
Guard class_name_updates_to_attributes_updates() and lexical_updates
processing behind empty() checks. In read-only mode (the common case),
these arrays are always empty, avoiding unnecessary function calls.

~7% improvement (312ms → 291ms).
Replace ~15 $this->bytes_already_parsed property accesses with a local
$at variable, writing back once at the end. Local variable access is
significantly faster than object property access in PHP's VM.

~2.3% improvement (291ms → 284ms).
parse_next_tag() is only called from base_class_next_token(), which
already calls after_tag() immediately before. The second call does
redundant work resetting state that was just reset.

852ms → 757ms (-11%)
For read-only tokenization, classname_updates and lexical_updates are
always empty. Guard the update processing with a count check to avoid
calling class_name_updates_to_attributes_updates() and iterating an
empty array.

757ms → 723ms (-4.5%)
Avoids repeated property lookups and strlen() calls in the main
tokenization loop.
Replace repeated $this->bytes_already_parsed and $this->html property
access with local variables. Also inlines skip_whitespace() calls to
avoid method call overhead and use the local $html variable.

728ms → 702ms (-3.6%)
For closing tags, attribute names are parsed but never stored, so the
substr() and strtolower() calls are wasted. Move them after the
closing tag early-return.

702ms → 696ms
…nd_find_closer()

Replaces the parse_next_attribute(false) while loop + strpos('>') with a
single method call that scans past all attributes and finds the tag-closing
'>' in one pass. Eliminates N function call overheads per tag during
read-only tokenization.
- Use local $html variable instead of $this->html for array access in parse_next_tag()
- Avoid property read for text_length by computing $at - $was_at directly
- Use truthiness check instead of count() > 0 for empty array detection in after_tag()
- Move attributes/duplicate_attributes reset from after_tag() to ensure_attributes_parsed()
  to avoid empty array allocation on every token during read-only tokenization
…e '>'

Tags without attributes (closing tags, simple void tags) have '>' immediately
after the tag name. A direct byte check before entering the attribute-scanning
loop avoids strspn/strcspn overhead for these common cases.
…in parse_next_tag()

Replace strspn() calls used for single-character validation with direct
character range comparisons. Also simplify tag name length calculation
to use a single strcspn() instead of strspn() + strcspn(), since the
first-character alpha check is now done via direct comparison.
…erty resets

Inline the single-callsite after_tag() method into base_class_next_token()
to eliminate method call overhead in the hot tokenization loop. Also remove
4 property resets (token_starts_at, token_length, is_closing_tag, comment_type)
that are always overwritten before use or guarded by parser_state checks.
Inline the most common token paths (text nodes and regular tags) directly
into base_class_next_token(), eliminating the parse_next_tag() function
call for ~95% of tokens. Complex tokens (comments, DOCTYPE, CDATA) still
use the full parse_next_tag() method.

Also integrates the skip_attributes_and_find_closer fast path for tags
with '>' immediately after the tag name, avoiding an additional method
call for closing tags and attribute-less tags.
…ss_next_token()

Instead of resetting all token properties at the top of every iteration,
only reset the properties relevant to each token type at its return point.
This eliminates ~3.4M unnecessary property writes per benchmark run.
Special elements (SCRIPT, STYLE, TEXTAREA, etc.) have name lengths of
3, 5, 6, 7, or 8 characters. Adding a quick length check before calling
get_tag() avoids substr() + strtoupper() allocations for the many common
tags (p, li, span, div, section, etc.) that pass the first-letter check
but can never be special elements.
Move the tag name length filter before the strspn first-letter check.
This avoids a function call for the many common tags with non-matching
lengths (a, p, li, div, span, etc.) that would pass the closing-tag
check but be eliminated by strspn.
Restructure skip_attributes_and_find_closer() to check for '=' and
quote characters directly after the attribute name, avoiding two
strspn() calls per attribute that almost always return 0. Well-formed
attributes like class="foo" now go through a fast path with only
byte-level comparisons + a single strpos for the closing quote.
Replace the strspn() call at the top of the attribute scanning loop
with direct byte comparisons for the two most common cases: a single
space separator between attributes and the '>' tag closer. Only falls
back to strspn() for uncommon whitespace characters (tabs, newlines,
form feeds, slashes).
Remove the STATE_COMPLETE check from base_class_next_token() since the
$at >= $doc_length bounds check already handles this case. Also remove
the text_node_classification property write from the tag fast path since
this value is never read for tag tokens.
Set text_starts_at to null instead of 0 for tag tokens in the fast path,
and remove the text_length = 0 write. get_modifiable_text() already
returns '' when text_starts_at is null, making the text_length write
redundant. Saves one property write per tag token.
Cache bytes_already_parsed in a local variable at function entry and
only re-read the property when lexical updates have been applied, as
those may adjust the position. Saves one property read per token in
the common read-only tokenization case.
Eliminates the attribute_scan_from property entirely. The scan
position is now computed as tag_name_starts_at + tag_name_length
in ensure_attributes_parsed() on demand, avoiding ~646K property
writes per benchmark iteration.
All callers of ensure_attributes_parsed() guard with
STATE_MATCHED_TAG check, so the attributes_parsed flag is never
read for text nodes. Removing this write saves ~378K property
writes per benchmark iteration.
sirreal added 13 commits March 13, 2026 17:10
Text nodes don't need to null out tag_name_starts_at and
tag_name_length. Instead, restructure get_tag() to check
parser_state first, which correctly returns null for non-tag
tokens without requiring the null sentinel. Saves ~756K property
writes per benchmark iteration.
Closing tags never need special element processing, so return
immediately from the fast path instead of goto after_tag_match.
Avoids property reads for is_closing_tag and parsing_namespace
checks for ~300K+ closing tags. Also avoid re-reading
bytes_already_parsed for token_length computation.
Both fast path and full_parse path now return early for closing
tags before reaching after_tag_match, which now only handles
opening tags. Eliminates is_closing_tag property read from the
shared after_tag_match section.
Check for '<' at current position before calling strpos(). When
tokens alternate text→tag, the tag iteration starts at '<' and
can skip strpos entirely. Since ~63% of tokens are tags, this
eliminates ~646K strpos calls per benchmark iteration.
Instead of setting text_starts_at = null for every tag to prevent
stale text in get_modifiable_text(), add a bounds check that
detects stale text_starts_at (from previous text nodes) by
comparing against token_starts_at. Saves ~646K property writes.
Current: ~330ms (52.8% faster than 699ms baseline)
…check

Use attributes_parsed_at integer compared against token_starts_at to detect
stale attribute data, eliminating ~646K attributes_parsed=false writes per
parse iteration. The version check in ensure_attributes_parsed() automatically
invalidates when a new token is parsed.
Check tag name length against special element lengths (3,5,6,7,8) before
goto after_tag_match. Common tags with lengths 1,2,4 (a, p, br, li, span,
code, etc.) return immediately, avoiding goto dispatch + namespace read +
property reads for the majority of opener tags.
Remove the dedicated parser_state read for STATE_INCOMPLETE_INPUT at the
top of the hot loop. Instead, set bytes_already_parsed = doc_length when
incomplete input is detected, so the bounds check ($at >= $doc_length)
handles it. This eliminates one property read per token (~1M reads per pass).
… backlog

Document new wins (attributes_parsed_at, fast-path length filter, merged
incomplete check), failed experiments, detailed architecture notes from
profiling data, and create ideas backlog for next session.
token_length = bytes_already_parsed - token_starts_at holds at every
write site. Derive the value on demand at the ~6 read sites instead of
writing it at ~20 write sites (~1M writes/pass eliminated).
Eliminate is_closing_tag property writes (~646K per pass) by computing
'/' === html[token_starts_at + 1] at the ~8 read sites. Uses local
variable in parse_next_tag() for the slow path.
Replace inline range comparisons with ctype_alpha() which dispatches
to a single C-level function call. Applies to all four sites: fast path
tag check, fast path text validation, parse_next_tag() tag check, and
parse_next_tag() text validation.
@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request delivers substantial performance enhancements to the HTML Tag Processor by implementing a series of micro-optimizations within the core tokenization logic. The changes focus on reducing overhead, streamlining attribute parsing, and optimizing common code paths, leading to a significant speedup in processing HTML. The comprehensive approach is well-documented with detailed research and benchmarking results.

Highlights

  • Performance Optimization: Significantly optimized the WP_HTML_Tag_Processor::next_token() method, resulting in a 54.8% speed improvement in read-only tokenization benchmarks.
  • Inlined Hot Paths: Inlined the after_tag() logic and introduced fast paths for common token types (text nodes and regular tags) directly into base_class_next_token() to reduce method call overhead.
  • Lazy Attribute Parsing: Implemented lazy parsing for attributes using an attributes_parsed_at property and an ensure_attributes_parsed() method, avoiding unnecessary object creation and processing.
  • Efficient Attribute Scanning: Replaced the per-attribute function call loop with a new skip_attributes_and_find_closer() method for faster read-only attribute scanning.
  • Reduced Property Writes: Eliminated the token_length property and derived its value on demand, and removed the is_closing_tag property by using direct character checks, reducing property writes per token.
  • Benchmarking and Documentation: Added detailed documentation (autoresearch.md) outlining the optimization objective, metrics, successful strategies, dead ends, and architectural notes, along with a new benchmark script (bench.php).

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • autoresearch.ideas.md
    • Added a backlog of ideas for future HTML API optimization efforts.
  • autoresearch.md
    • Added a detailed document outlining the objective, metrics, methodology, successful optimizations, dead ends, and architectural notes for HTML Tag Processor performance.
  • bench.php
    • Added a new benchmark script to measure the performance of WP_HTML_Tag_Processor::next_token().
  • bootstrap-html-api.php
    • Added a new bootstrap file to include necessary HTML API classes and define missing WordPress functions for benchmarking.
  • src/wp-includes/html-api/class-wp-html-tag-processor.php
    • Added doc_length property to cache HTML string length and updated the constructor to initialize it.
    • Added attributes_parsed_at property to track attribute parsing state for lazy loading.
    • Refactored base_class_next_token() to inline after_tag() logic, introduce fast paths for text and tag tokens, and use skip_attributes_and_find_closer() for attribute scanning.
    • Modified special element length filtering in base_class_next_token() for efficiency.
    • Removed token_length property and calculated its value on demand.
    • Updated set_bookmark() to calculate token length dynamically.
    • Replaced strlen($this->html) with $this->doc_length in skip_rcdata, skip_script_data, and parse_next_tag for performance.
    • Removed is_closing_tag property and replaced its usage with direct character checks on html[token_starts_at + 1].
    • Refactored parse_next_tag() to remove redundant after_tag() call and simplify token length calculations.
    • Modified parse_next_attribute() to accept a $store parameter and use a local $at variable for cursor management.
    • Introduced skip_attributes_and_find_closer() method for efficient attribute scanning in read-only paths.
    • Added ensure_attributes_parsed() method for lazy attribute parsing.
    • Updated after_tag() to reset attributes_parsed_at and remove is_closing_tag and token_length resets.
    • Added ensure_attributes_parsed() call in class_name_updates_to_attributes_updates().
    • Updated apply_attributes_updates() to update doc_length after HTML modification.
    • Updated seek() to calculate token length dynamically.
    • Added ensure_attributes_parsed() calls in get_attribute(), get_attribute_names_with_prefix(), set_attribute(), and remove_attribute().
    • Restructured get_tag() logic for state-based dispatch and removed redundant tag_name variable.
    • Updated has_self_closing_flag() to use bytes_already_parsed for length calculation.
    • Updated is_tag_closer() to use direct character check instead of is_closing_tag property.
    • Removed token_length assignments in subdivide_text_appropriately().
    • Modified get_modifiable_text() to include a bounds check for text_starts_at.
    • Reset attributes_parsed_at in get_updated_html().
Activity
  • The author, sirreal, created this pull request to optimize the HTML API.
  • New files autoresearch.ideas.md, autoresearch.md, bench.php, and bootstrap-html-api.php were added to document optimization efforts and provide benchmarking tools.
  • Extensive modifications were made to src/wp-includes/html-api/class-wp-html-tag-processor.php, implementing numerous performance improvements.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant performance optimizations to the WP_HTML_Tag_Processor, primarily within the base_class_next_token method. The changes include inlining function calls, adding fast paths for common tokens, caching values like document length, and implementing lazy parsing for attributes. These optimizations are well-documented in the new autoresearch.md file, which provides excellent context for the changes. The code modifications are extensive but appear to be correct and consistent with the stated performance goals. I have one suggestion for improving the readability of the new benchmark script.

Comment on lines +6 to +14
$p = new WP_HTML_Tag_Processor( $html );
while ( $p->next_token() ) {
}
$p = new WP_HTML_Tag_Processor( $html );
while ( $p->next_token() ) {
}
$p = new WP_HTML_Tag_Processor( $html );
while ( $p->next_token() ) {
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The benchmark script repeats the tokenization process three times. This repetition can be made more concise and maintainable by using a for loop. This avoids code duplication and makes it easier to change the number of iterations in the future.

for ( $i = 0; $i < 3; $i++ ) {
	$p = new WP_HTML_Tag_Processor( $html );
	while ( $p->next_token() ) {
	}
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant