# html5tokenizer
[![docs.rs](https://img.shields.io/docsrs/html5tokenizer)](https://docs.rs/html5tokenizer)
[![crates.io](https://img.shields.io/crates/l/html5tokenizer.svg)](https://crates.io/crates/html5tokenizer)
Spec-compliant HTML parsing [requires both tokenization and tree-construction][parsing model].
While this crate implements a spec-compliant HTML tokenizer it does not implement any
tree-construction. Instead it just provides a `NaiveParser` that may be used as follows:
```rust
use std::fmt::Write;
use html5tokenizer::{NaiveParser, Token};
let html = "
hello world";
let mut new_html = String::new();
for token in NaiveParser::new(html).flatten() {
match token {
Token::StartTag(tag) => {
write!(new_html, "<{}>", tag.name).unwrap();
}
Token::Char(c) => {
write!(new_html, "{c}").unwrap();
}
Token::EndTag(tag) => {
write!(new_html, "{}>", tag.name).unwrap();
}
_ => panic!("unexpected input"),
}
}
assert_eq!(new_html, "hello world");
```
This library can provide source spans. For an example, see
[`examples/spans.rs`], which produces the following output:
```output id=spans
note:
┌─ file.html:1:2
│
1 │
│ ^^^ ^^^ ^^^^^^^^^^^ ^^^ ^^^^^^^^^^^^^^^^ attr value
│ │ │ │ │
│ │ │ │ attr name
│ │ │ attr value
│ │ attr name
│ tag name
```
## Limitations
* This crate does not yet implement tree construction
(which is necessary for spec-compliant HTML parsing).
* This crate does not yet implement [character encoding detection].
* This crate does not yet implement spans for character tokens.
## Compliance & testing
The tokenizer passes the [html5lib tokenizer test suite].
The library is not yet fuzz tested.
## Credits
html5tokenizer was forked from [html5gum] 0.2.1, which was created by
Markus Unterwaditzer who deserves major props for implementing all 80 (!)
tokenizer states.
* Code span support has been added.
* The API has been revised.
For details please refer to the [changelog].
## License
Licensed under the MIT license, see [the LICENSE file].
[parsing model]: https://html.spec.whatwg.org/multipage/parsing.html#overview-of-the-parsing-model
[`examples/spans.rs`]: ./examples/spans.rs
[character encoding detection]: https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding
[html5lib tokenizer test suite]: https://github.com/html5lib/html5lib-tests/tree/master/tokenizer
[html5gum]: https://crates.io/crates/html5gum
[changelog]: ./CHANGELOG.md
[the LICENSE file]: ./LICENSE