1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
|
# html5tokenizer
[![docs.rs](https://img.shields.io/docsrs/html5tokenizer)](https://docs.rs/html5tokenizer)
[![crates.io](https://img.shields.io/crates/l/html5tokenizer.svg)](https://crates.io/crates/html5tokenizer)
`html5tokenizer` is a WHATWG-compliant HTML tokenizer
(forked from [html5gum] with added code span support).
```rust
use std::fmt::Write;
use html5tokenizer::{DefaultEmitter, Tokenizer, Token};
let html = "<title >hello world</title>";
let emitter = DefaultEmitter::default();
let mut new_html = String::new();
for token in Tokenizer::new(html, emitter).flatten() {
match token {
Token::StartTag(tag) => {
write!(new_html, "<{}>", tag.name).unwrap();
}
Token::String(hello_world) => {
write!(new_html, "{}", hello_world).unwrap();
}
Token::EndTag(tag) => {
write!(new_html, "</{}>", tag.name).unwrap();
}
_ => panic!("unexpected input"),
}
}
assert_eq!(new_html, "<title>hello world</title>");
```
## What a tokenizer does and what it does not do
`html5tokenizer` fully implements [13.2.5 of the WHATWG HTML spec][tokenization],
i.e. is able to tokenize HTML documents and passes [html5lib's tokenizer test suite].
Since it is just a tokenizer, this means:
* `html5tokenizer` **does not** implement [charset detection].
This implementation requires all input to be Rust strings and therefore valid UTF-8.
* `html5tokenizer` **does not** correct [misnested tags].
* `html5tokenizer` **does not** recognize implicitly self-closing elements like
`<img>`, as a tokenizer it will simply emit a start token. It does however
emit a self-closing tag for `<img .. />`.
* `html5tokenizer` **does not** generally qualify as a browser-grade HTML *parser* as
per the WHATWG spec. This can change in the future.
With those caveats in mind, `html5tokenizer` can pretty much ~parse~ _tokenize_
anything that browsers can.
## The `Emitter` trait
A distinguishing feature of `html5tokenizer` is that you can bring your own token
datastructure and hook into token creation by implementing the `Emitter` trait.
This allows you to:
* Rewrite all per-HTML-tag allocations to use a custom allocator or datastructure.
* Efficiently filter out uninteresting categories data without ever allocating
for it. For example if any plaintext between tokens is not of interest to
you, you can implement the respective trait methods as noop and therefore
avoid any overhead creating plaintext tokens.
## License
Licensed under the MIT license, see [`./LICENSE`](./LICENSE).
[html5gum]: https://crates.io/crates/html5gum
[tokenization]: https://html.spec.whatwg.org/multipage/parsing.html#tokenization
[html5lib's tokenizer test suite]: https://github.com/html5lib/html5lib-tests/tree/master/tokenizer
[charset detection]: https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding
[misnested tags]: https://html.spec.whatwg.org/multipage/parsing.html#an-introduction-to-error-handling-and-strange-cases-in-the-parser
|