Skip to content

Conversation

nielsdos
Copy link
Member

@nielsdos nielsdos commented Oct 11, 2024

This patch adds a fast path to the HTML serialization encoding that has
to encode to UTF-8. Because the DOM internally represents all strings
using UTF-8, we only need to validate here.

Tested on Wikipedia English home page on an i7-4790, serializing the page 1000 times:

Benchmark 1: ./sapi/cli/php x.php
  Time (mean ± σ):     516.0 ms ±   6.4 ms    [User: 511.2 ms, System: 3.5 ms]
  Range (min … max):   506.0 ms … 527.1 ms    10 runs

Benchmark 2: ./sapi/cli/php_old x.php
  Time (mean ± σ):     682.8 ms ±   6.5 ms    [User: 676.8 ms, System: 3.8 ms]
  Range (min … max):   675.8 ms … 695.6 ms    10 runs

Summary
  ./sapi/cli/php x.php ran
    1.32 ± 0.02 times faster than ./sapi/cli/php_old x.php

(And if you're interested: it takes over a second on my machine using the old DOMDocument class)

Future optimizations are certainly possible, but let's start here.

This patch adds a fast path to the HTML serialization encoding that has
to encode to UTF-8. Because the DOM internally represents all strings
using UTF-8, we only need to validate here.

Tested on Wikipedia English home page on an i7-4790:
```
Benchmark 1: ./sapi/cli/php x.php
  Time (mean ± σ):     516.0 ms ±   6.4 ms    [User: 511.2 ms, System: 3.5 ms]
  Range (min … max):   506.0 ms … 527.1 ms    10 runs

Benchmark 2: ./sapi/cli/php_old x.php
  Time (mean ± σ):     682.8 ms ±   6.5 ms    [User: 676.8 ms, System: 3.8 ms]
  Range (min … max):   675.8 ms … 695.6 ms    10 runs

Summary
  ./sapi/cli/php x.php ran
    1.32 ± 0.02 times faster than ./sapi/cli/php_old x.php
```

(And if you're interested: it takes over a second on my machine using the old DOMDocument class)

Future optimizations are certainly possible, but let's start here.
@nielsdos nielsdos changed the title Dom optimized serialize html Optimize DOM HTML serialization for UTF-8 Oct 11, 2024
@nielsdos nielsdos requested a review from Girgias October 20, 2024 20:05
Copy link
Member

@Girgias Girgias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small question, but looks fine to me

Comment on lines -548 to +552
size_t skip = buf_ref - buf_ref_backup; /* Skip invalid data, it's replaced by the UTF-8 replacement bytes */
if (!dom_process_parse_chunk(
ctx,
document,
parser,
buf_ref - last_output - skip,
buf_ref_backup - last_output,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this unrelated to the perf optimisation commit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fasr path has the same structure as this code, and I noticed the skip variable was useless. So yeah it's more like cleanup.

@nielsdos nielsdos merged commit 935fef2 into php:master Oct 22, 2024
9 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants