README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88

# html5tokenizer

[![docs.rs](https://img.shields.io/docsrs/html5tokenizer)](https://docs.rs/html5tokenizer)
[![crates.io](https://img.shields.io/crates/l/html5tokenizer.svg)](https://crates.io/crates/html5tokenizer)

Spec-compliant HTML parsing [requires both tokenization and tree-construction][parsing model].
While this crate implements a spec-compliant HTML tokenizer it does not implement any
tree-construction. Instead it just provides a `NaiveParser` that may be used as follows:

```rust
use std::fmt::Write;
use html5tokenizer::{NaiveParser, Token};

let html = "<title   >hello world</title>";
let mut new_html = String::new();

for token in NaiveParser::new(html).flatten() {
    match token {
        Token::StartTag(tag) => {
            write!(new_html, "<{}>", tag.name).unwrap();
        }
        Token::Char(c) => {
            write!(new_html, "{c}").unwrap();
        }
        Token::EndTag(tag) => {
            write!(new_html, "</{}>", tag.name).unwrap();
        }
        Token::EndOfFile => {},
        _ => panic!("unexpected input"),
    }
}

assert_eq!(new_html, "<title>hello world</title>");
```

This library can provide source spans. For an example, see
[`examples/spans.rs`], which produces the following output:

```output id=spans
note:
  ┌─ file.html:1:2
  │
1 │ <img src=example.jpg alt="some description">
  │  ^^^ ^^^ ^^^^^^^^^^^ ^^^  ^^^^^^^^^^^^^^^^ attr value
  │  │   │   │           │
  │  │   │   │           attr name
  │  │   │   attr value
  │  │   attr name
  │  tag name
```

## Limitations

* This crate does not yet implement tree construction  
  (which is necessary for spec-compliant HTML parsing).

* This crate does not yet implement [character encoding detection].

* This crate does not yet implement spans for character tokens.

## Compliance & testing

The tokenizer passes the [html5lib tokenizer test suite].
The library is not yet fuzz tested.

## Credits

html5tokenizer was forked from [html5gum] 0.2.1, which was created by
Markus Unterwaditzer who deserves major props for implementing all 80 (!)
tokenizer states.

* Code span support has been added.
* The API has been revised.

For details please refer to the [changelog].

## License

Licensed under the MIT license, see [the LICENSE file].


[parsing model]: https://html.spec.whatwg.org/multipage/parsing.html#overview-of-the-parsing-model
[`examples/spans.rs`]: ./examples/spans.rs
[character encoding detection]: https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding
[html5lib tokenizer test suite]: https://github.com/html5lib/html5lib-tests/tree/master/tokenizer
[html5gum]: https://crates.io/crates/html5gum
[changelog]: ./CHANGELOG.md
[the LICENSE file]: ./LICENSE