aboutsummaryrefslogtreecommitdiff
path: root/README.md
blob: a1f8a82594078d1290c681d321243763d4c6e106 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
# html5tokenizer

[![docs.rs](https://img.shields.io/docsrs/html5tokenizer)](https://docs.rs/html5tokenizer)
[![crates.io](https://img.shields.io/crates/l/html5tokenizer.svg)](https://crates.io/crates/html5tokenizer)

Spec-compliant HTML parsing [requires both tokenization and tree-construction][parsing model].
While this crate implements a spec-compliant HTML tokenizer it does not implement any
tree-construction. Instead it just provides a `NaiveParser` that may be used as follows:

```
use std::fmt::Write;
use html5tokenizer::{NaiveParser, Token};

let html = "<title   >hello world</title>";
let mut new_html = String::new();

for token in NaiveParser::new(html).flatten() {
    match token {
        Token::StartTag(tag) => {
            write!(new_html, "<{}>", tag.name).unwrap();
        }
        Token::String(hello_world) => {
            write!(new_html, "{}", hello_world).unwrap();
        }
        Token::EndTag(tag) => {
            write!(new_html, "</{}>", tag.name).unwrap();
        }
        _ => panic!("unexpected input"),
    }
}

assert_eq!(new_html, "<title>hello world</title>");
```

## Compared to html5gum

`html5tokenizer` was forked from [html5gum] 0.2.1.

* Code span support has been added.
* The API has been revised.

For details please refer to the [changelog].

html5gum has since switched its parsing to operate on bytes,
which html5tokenizer doesn't yet support.
`html5tokenizer` **does not** implement [charset detection].
This implementation requires all input to be Rust strings and therefore valid UTF-8.

Both crates pass the [html5lib tokenizer test suite].

Both crates have an `Emitter` trait that lets you bring your own token data
structure and hook into token creation by implementing the `Emitter` trait.
This allows you to:

* Rewrite all per-HTML-tag allocations to use a custom allocator or data structure.

* Efficiently filter out uninteresting categories data without ever allocating
  for it. For example if any plaintext between tokens is not of interest to
  you, you can implement the respective trait methods as noop and therefore
  avoid any overhead creating plaintext tokens.

## License

Licensed under the MIT license, see [the LICENSE file].


[parsing model]: https://html.spec.whatwg.org/multipage/parsing.html#overview-of-the-parsing-model
[html5gum]: https://crates.io/crates/html5gum
[html5lib tokenizer test suite]: https://github.com/html5lib/html5lib-tests/tree/master/tokenizer
[charset detection]: https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding
[changelog]: ./CHANGELOG.md
[the LICENSE file]: ./LICENSE