aboutsummaryrefslogtreecommitdiff
path: root/README.md
blob: b60bdfaa560ce9554aa61667aa39ed431d1e2cc5 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
# html5tokenizer

[![docs.rs](https://img.shields.io/docsrs/html5tokenizer)](https://docs.rs/html5tokenizer)
[![crates.io](https://img.shields.io/crates/l/html5tokenizer.svg)](https://crates.io/crates/html5tokenizer)

Spec-compliant HTML parsing [requires both tokenization and tree-construction][parsing model].
While this crate implements a spec-compliant HTML tokenizer it does not implement any
tree-construction. Instead it just provides a `NaiveParser` that may be used as follows:

```rust
use std::fmt::Write;
use html5tokenizer::{NaiveParser, Token};

let html = "<title   >hello world</title>";
let mut new_html = String::new();

for token in NaiveParser::new(html).flatten() {
    match token {
        Token::StartTag(tag) => {
            write!(new_html, "<{}>", tag.name).unwrap();
        }
        Token::String(hello_world) => {
            write!(new_html, "{}", hello_world).unwrap();
        }
        Token::EndTag(tag) => {
            write!(new_html, "</{}>", tag.name).unwrap();
        }
        _ => panic!("unexpected input"),
    }
}

assert_eq!(new_html, "<title>hello world</title>");
```

## Limitations

* This crate does not yet implement tree construction  
  (which is necessary for spec-compliant HTML parsing).

* This crate does not yet implement [character encoding detection].

* The span logic assumes UTF-8 encoding.

* This crate does not yet implement spans for character tokens.

## Compliance & testing

The tokenizer passes the [html5lib tokenizer test suite].
The library is not yet fuzz tested.

## Compared to html5gum

`html5tokenizer` was forked from [html5gum] 0.2.1.

* Code span support has been added.
* The API has been revised.

For details please refer to the [changelog].

Both crates have an `Emitter` trait that lets you bring your own token data
structure and hook into token creation by implementing the `Emitter` trait.
This allows you to:

* Rewrite all per-HTML-tag allocations to use a custom allocator or data structure.

* Efficiently filter out uninteresting categories data without ever allocating
  for it. For example if any plaintext between tokens is not of interest to
  you, you can implement the respective trait methods as noop and therefore
  avoid any overhead creating plaintext tokens.

## License

Licensed under the MIT license, see [the LICENSE file].


[parsing model]: https://html.spec.whatwg.org/multipage/parsing.html#overview-of-the-parsing-model
[character encoding detection]: https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding
[html5lib tokenizer test suite]: https://github.com/html5lib/html5lib-tests/tree/master/tokenizer
[html5gum]: https://crates.io/crates/html5gum
[changelog]: ./CHANGELOG.md
[the LICENSE file]: ./LICENSE