Skip to content

Commit 0959c9c

Browse files
authored
add callbacks emitter and update readme (#91)
1 parent 884e961 commit 0959c9c

19 files changed

+1428
-948
lines changed

.github/dependabot.yml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,12 @@ updates:
1010
schedule:
1111
interval: daily
1212
open-pull-requests-limit: 10
13+
14+
- package-ecosystem: cargo
15+
directory: "/fuzz"
16+
schedule:
17+
interval: daily
18+
open-pull-requests-limit: 10
1319

1420
- package-ecosystem: gitsubmodule
1521
directory: "/"

Cargo.toml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,12 @@ harness = false
5959
name = "build_tree"
6060
required-features = ["tree-builder"]
6161

62+
[[example]]
63+
name = "custom_emitter"
64+
65+
[[example]]
66+
name = "callback_emitter"
67+
6268
[[example]]
6369
name = "scraper"
6470
required-features = ["tree-builder"]

README.md

Lines changed: 12 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,13 @@ for token in Tokenizer::new(html).infallible() {
3030
assert_eq!(new_html, "<title>hello world</title>");
3131
```
3232

33+
`html5gum` provides multiple kinds of APIs:
34+
35+
* Iterating over tokens as shown above.
36+
* Implementing your own `Emitter` for maximum performance, see [the `custom_emitter.rs` example](examples/custom_emitter.rs).
37+
* A callbacks-based API for a middleground between convenience and performance, see [the `callback_emitter.rs` example](examples/callback_emitter.rs).
38+
* With the `tree-builder` feature, html5gum can be integrated with `html5ever` and `scraper`. See [the `scraper.rs` example](examples/scraper.rs).
39+
3340
## What a tokenizer does and what it does not do
3441

3542
`html5gum` fully implements [13.2.5 of the WHATWG HTML
@@ -42,9 +49,6 @@ test suite](https://github.com/html5lib/html5lib-tests/tree/master/tokenizer). S
4249
gracefully from invalid UTF-8.
4350
* `html5gum` **does not** [correct mis-nested
4451
tags.](https://html.spec.whatwg.org/#an-introduction-to-error-handling-and-strange-cases-in-the-parser)
45-
* `html5gum` **does not** recognize implicitly self-closing elements like
46-
`<img>`, as a tokenizer it will simply emit a start token. It does however
47-
emit a self-closing tag for `<img .. />`.
4852
* `html5gum` doesn't implement the DOM, and unfortunately in the HTML spec,
4953
constructing the DOM ("tree construction") influences how tokenization is
5054
done. For an example of which problems this causes see [this example
@@ -54,23 +58,9 @@ test suite](https://github.com/html5lib/html5lib-tests/tree/master/tokenizer). S
5458
21](https://github.com/untitaker/html5gum/issues/21).
5559

5660
With those caveats in mind, `html5gum` can pretty much ~parse~ _tokenize_
57-
anything that browsers can.
58-
59-
## The `Emitter` trait
60-
61-
A distinguishing feature of `html5gum` is that you can bring your own token
62-
datastructure and hook into token creation by implementing the `Emitter` trait.
63-
This allows you to:
64-
65-
* Rewrite all per-HTML-tag allocations to use a custom allocator or datastructure.
66-
67-
* Efficiently filter out uninteresting categories data without ever allocating
68-
for it. For example if any plaintext between tokens is not of interest to
69-
you, you can implement the respective trait methods as noop and therefore
70-
avoid any overhead creating plaintext tokens.
71-
72-
See [the `custom_emitter` example][examples/custom_emitter.rs] for how this
73-
looks like in practice.
61+
anything that browsers can. However, using the experimental `tree-builder`
62+
feature, html5gum can be integrated with `html5ever` and `scraper`. See [the
63+
`scraper.rs` example](examples/scraper.rs).
7464

7565
## Other features
7666

@@ -116,3 +106,5 @@ Licensed under the MIT license, see [`./LICENSE`][LICENSE].
116106
[LICENSE]: ./LICENSE
117107
[examples/tokenize_with_state_switches.rs]: ./examples/tokenize_with_state_switches.rs
118108
[examples/custom_emitter.rs]: ./examples/custom_emitter.rs
109+
[examples/callback_emitter.rs]: ./examples/callback_emitter.rs
110+
[examples/scraper.rs]: ./examples/scraper.rs

examples/build_tree.rs

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,16 +2,14 @@
22
/// building logic and DOM implementation. The result is a technically complete HTML5 parser.
33
///
44
/// You may want to refer to `examples/scraper.rs` for better ergonomics.
5-
use std::iter::repeat;
6-
75
use html5ever::tree_builder::TreeBuilder;
86
use html5gum::{Html5everEmitter, IoReader, Tokenizer};
97
use markup5ever_rcdom::{Handle, NodeData, RcDom};
108

119
fn walk(indent: usize, handle: &Handle) {
1210
let node = handle;
1311
// FIXME: don't allocate
14-
print!("{}", repeat(" ").take(indent).collect::<String>());
12+
print!("{}", " ".repeat(indent));
1513
match node.data {
1614
NodeData::Document => println!("#Document"),
1715

examples/callback_emitter.rs

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
//! A slightly simpler, but less performant version of the link extractor that can be found in
2+
//! `examples/custom_emitter.rs`.
3+
//!
4+
//! ```text
5+
//! printf '<h1>Hello world!</h1><a href="foo">bar</a>' | cargo run --example=custom_emitter
6+
//! ```
7+
//!
8+
//! Output:
9+
//!
10+
//! ```text
11+
//! link: foo
12+
//! ```
13+
use html5gum::callbacks::{CallbackEmitter, CallbackEvent};
14+
use html5gum::{Emitter, IoReader, Tokenizer};
15+
16+
fn get_emitter() -> impl Emitter<Token = String> {
17+
let mut is_anchor_tag = false;
18+
let mut is_href_attr = false;
19+
20+
CallbackEmitter::new(move |event: CallbackEvent<'_>| match event {
21+
CallbackEvent::OpenStartTag { name } => {
22+
is_anchor_tag = name == b"a";
23+
is_href_attr = false;
24+
None
25+
}
26+
CallbackEvent::AttributeName { name } => {
27+
is_href_attr = name == b"href";
28+
None
29+
}
30+
CallbackEvent::AttributeValue { value } if is_anchor_tag && is_href_attr => {
31+
Some(String::from_utf8_lossy(value).into_owned())
32+
}
33+
_ => None,
34+
})
35+
}
36+
37+
fn main() {
38+
for token in
39+
Tokenizer::new_with_emitter(IoReader::new(std::io::stdin().lock()), get_emitter()).flatten()
40+
{
41+
println!("link: {}", token);
42+
}
43+
}
44+
45+
#[test]
46+
fn basic() {
47+
let tokens: Vec<_> =
48+
Tokenizer::new_with_emitter("<h1>Hello world</h1><a href=foo>bar</a>", get_emitter())
49+
.flatten()
50+
.collect();
51+
52+
assert_eq!(tokens, vec!["foo".to_owned()]);
53+
}

examples/scraper.rs

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,9 @@
66
/// echo '<h1><span class=hello>Hello</span></h1>' | cargo run --all-features --example scraper
77
/// ```
88
///
9-
/// Essentially, your HTML parsing will be powered by a combination of html5gum and html5ever.
9+
/// Essentially, your HTML parsing will be powered by a combination of html5gum and html5ever. This
10+
/// has no immediate benefit over using scraper normally and is mostly done as a transitionary step
11+
/// until html5gum has its own implementation of tree building and the DOM.
1012
///
1113
/// Requires the tree-builder feature.
1214
use std::io::{stdin, Read};

0 commit comments

Comments
 (0)