Skip to content

Commit b8a65d2

Browse files
committed
Add Rust HTML rewriter benchmark
Adds a benchmark testing HTML parsing and rewriting using lol_html, a streaming HTML rewriter developed by Cloudflare. The benchmark processes ~114KB of realistic HTML, performing multiple transformations: adding CSS classes, modifying links, injecting scripts, and adding security attributes. lol_html is used in production by Cloudflare Workers for edge HTML processing and represents real-world HTML transformation workloads.
1 parent d28bbb9 commit b8a65d2

File tree

11 files changed

+2918
-0
lines changed

11 files changed

+2918
-0
lines changed
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
benchmark.wasm
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
../Dockerfile.rust
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# Rust HTML Rewriter Benchmark
2+
3+
This benchmark tests HTML parsing and rewriting performance using `lol_html`, a streaming HTML rewriter/parser developed by Cloudflare and used in Cloudflare Workers.
4+
5+
## What it tests
6+
7+
The benchmark performs multiple HTML transformation operations on a realistic HTML document:
8+
9+
1. **Add CSS classes** - Adds a "rewritten" class to all `<div>` elements
10+
2. **Modify links** - Adds tracking parameters to all `<a href>` attributes
11+
3. **Security attributes** - Adds `rel="noopener noreferrer"` to external links
12+
4. **Script injection** - Injects an analytics script before `</head>`
13+
5. **Image processing** - Adds `data-processed="true"` to all `<img>` elements
14+
15+
These operations represent common use cases for HTML rewriting:
16+
- Adding analytics/tracking
17+
- Injecting scripts and content
18+
- Modifying attributes for security or functionality
19+
- Processing images for lazy loading or CDN rewriting
20+
21+
## Input Data
22+
23+
The `default.input` file (~114 KB) contains a realistic HTML document with:
24+
- Navigation menus with internal and external links
25+
- 50 article sections with images, text, and metadata
26+
- Sidebar widgets
27+
- Footer with contact information and links
28+
- Multiple link types (internal, external, mailto, tel)
29+
- Social sharing buttons
30+
31+
## Implementation
32+
33+
Uses:
34+
- `lol_html` 2.1 - Streaming HTML rewriter
35+
- Selector-based element matching (CSS selectors)
36+
- Low-memory streaming parser (doesn't load entire DOM)
37+
38+
## Performance Notes
39+
40+
lol_html is designed for:
41+
- Streaming HTML transformation (processes as it parses)
42+
- Low memory overhead (no DOM tree)
43+
- Production use in edge computing (Cloudflare Workers)
44+
45+
The benchmark shows realistic workload similar to edge function HTML transformations.
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
[rust-html-rewriter] input size: 113919 bytes
2+
[rust-html-rewriter] output size: 134151 bytes
3+
[rust-html-rewriter] size change: +20232 bytes

benchmarks/rust-html-rewriter/benchmark.stdout.expected

Whitespace-only changes.

benchmarks/rust-html-rewriter/default.input

Lines changed: 2243 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 166 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,166 @@
1+
#!/usr/bin/env python3
2+
"""Generate a realistic HTML file for benchmarking HTML rewriting."""
3+
4+
5+
def generate_html():
6+
"""Generate a realistic HTML document with various elements."""
7+
8+
html = """<!DOCTYPE html>
9+
<html lang="en">
10+
<head>
11+
<meta charset="UTF-8">
12+
<meta name="viewport" content="width=device-width, initial-scale=1.0">
13+
<title>Benchmark HTML Document</title>
14+
<link rel="stylesheet" href="/styles/main.css">
15+
<link rel="stylesheet" href="/styles/theme.css">
16+
</head>
17+
<body>
18+
<header>
19+
<nav class="navbar">
20+
<div class="container">
21+
<a href="/" class="logo">Benchmark Site</a>
22+
<ul class="nav-menu">
23+
<li><a href="/home">Home</a></li>
24+
<li><a href="/about">About</a></li>
25+
<li><a href="/products">Products</a></li>
26+
<li><a href="https://external.com">External Link</a></li>
27+
<li><a href="/contact">Contact</a></li>
28+
</ul>
29+
</div>
30+
</nav>
31+
</header>
32+
33+
<main class="content">
34+
"""
35+
36+
# Generate multiple article sections
37+
for i in range(50):
38+
html += f"""
39+
<article class="post" id="post-{i}">
40+
<div class="post-header">
41+
<h2><a href="/posts/{i}">Article Title {i}</a></h2>
42+
<div class="meta">
43+
<span class="author">Author {i % 10}</span>
44+
<span class="date">2024-12-{(i % 28) + 1:02d}</span>
45+
</div>
46+
</div>
47+
48+
<div class="post-content">
49+
<img src="/images/article-{i}.jpg" alt="Article {i} image" class="featured-image">
50+
<p>This is the introduction paragraph for article {i}. It contains some introductory text that describes what the article is about.</p>
51+
52+
<div class="post-body">
53+
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p>
54+
<p>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</p>
55+
56+
<div class="callout">
57+
<p>This is an important callout box with additional information.</p>
58+
</div>
59+
60+
<ul class="features">
61+
<li><a href="/feature/{i}-1">Feature 1</a></li>
62+
<li><a href="/feature/{i}-2">Feature 2</a></li>
63+
<li><a href="/feature/{i}-3">Feature 3</a></li>
64+
</ul>
65+
</div>
66+
67+
<div class="post-footer">
68+
<div class="tags">
69+
<a href="/tag/technology" class="tag">Technology</a>
70+
<a href="/tag/programming" class="tag">Programming</a>
71+
<a href="/tag/web" class="tag">Web</a>
72+
</div>
73+
<div class="social-share">
74+
<a href="https://twitter.com/share?url=post-{i}" class="share-btn twitter">Share</a>
75+
<a href="https://facebook.com/sharer?url=post-{i}" class="share-btn facebook">Share</a>
76+
<a href="https://linkedin.com/share?url=post-{i}" class="share-btn linkedin">Share</a>
77+
</div>
78+
</div>
79+
</div>
80+
</article>
81+
"""
82+
83+
# Add footer
84+
html += """
85+
</main>
86+
87+
<aside class="sidebar">
88+
<div class="widget">
89+
<h3>Popular Posts</h3>
90+
<ul>
91+
<li><a href="/popular/1">Popular Post 1</a></li>
92+
<li><a href="/popular/2">Popular Post 2</a></li>
93+
<li><a href="/popular/3">Popular Post 3</a></li>
94+
<li><a href="/popular/4">Popular Post 4</a></li>
95+
<li><a href="/popular/5">Popular Post 5</a></li>
96+
</ul>
97+
</div>
98+
99+
<div class="widget">
100+
<h3>Categories</h3>
101+
<ul>
102+
<li><a href="/category/tech">Technology</a></li>
103+
<li><a href="/category/science">Science</a></li>
104+
<li><a href="/category/business">Business</a></li>
105+
<li><a href="/category/lifestyle">Lifestyle</a></li>
106+
</ul>
107+
</div>
108+
109+
<div class="widget ad-widget">
110+
<img src="/ads/sidebar-ad.jpg" alt="Advertisement">
111+
<a href="https://advertiser.com/product">Learn More</a>
112+
</div>
113+
</aside>
114+
115+
<footer>
116+
<div class="container">
117+
<div class="footer-content">
118+
<div class="footer-section">
119+
<h4>About Us</h4>
120+
<p>We are a benchmark site for testing HTML rewriting performance.</p>
121+
<a href="/about">Read more</a>
122+
</div>
123+
124+
<div class="footer-section">
125+
<h4>Quick Links</h4>
126+
<ul>
127+
<li><a href="/privacy">Privacy Policy</a></li>
128+
<li><a href="/terms">Terms of Service</a></li>
129+
<li><a href="/sitemap">Sitemap</a></li>
130+
<li><a href="https://github.com/example">GitHub</a></li>
131+
</ul>
132+
</div>
133+
134+
<div class="footer-section">
135+
<h4>Contact</h4>
136+
<p>Email: <a href="mailto:[email protected]">[email protected]</a></p>
137+
<p>Phone: <a href="tel:+1234567890">+1 (234) 567-890</a></p>
138+
</div>
139+
</div>
140+
141+
<div class="footer-bottom">
142+
<p>&copy; 2024 Benchmark Site. All rights reserved.</p>
143+
</div>
144+
</div>
145+
</footer>
146+
147+
<script src="/js/main.js"></script>
148+
<script src="/js/analytics.js"></script>
149+
</body>
150+
</html>
151+
"""
152+
153+
return html
154+
155+
156+
def main():
157+
html = generate_html()
158+
159+
with open("default.input", "w") as f:
160+
f.write(html)
161+
162+
print(f"Generated HTML file: {len(html)} bytes ({len(html) / 1024:.1f} KB)")
163+
164+
165+
if __name__ == "__main__":
166+
main()
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
target

0 commit comments

Comments
 (0)