# html5tokenizer [![docs.rs](https://img.shields.io/docsrs/html5tokenizer)](https://docs.rs/html5tokenizer) [![crates.io](https://img.shields.io/crates/l/html5tokenizer.svg)](https://crates.io/crates/html5tokenizer) Spec-compliant HTML parsing [requires both tokenization and tree-construction][parsing model]. While this crate implements a spec-compliant HTML tokenizer it does not implement any tree-construction. Instead it just provides a `NaiveParser` that may be used as follows: ```rust use std::fmt::Write; use html5tokenizer::{NaiveParser, Token}; let html = "hello world"; let mut new_html = String::new(); for token in NaiveParser::new(html).flatten() { match token { Token::StartTag(tag) => { write!(new_html, "<{}>", tag.name).unwrap(); } Token::Char(c) => { write!(new_html, "{c}").unwrap(); } Token::EndTag(tag) => { write!(new_html, "", tag.name).unwrap(); } Token::EndOfFile => {}, _ => panic!("unexpected input"), } } assert_eq!(new_html, "hello world"); ``` This library can provide source spans. For an example, see [`examples/spans.rs`], which produces the following output: ```output id=spans note: ┌─ file.html:1:2 │ 1 │ some description │ ^^^ ^^^ ^^^^^^^^^^^ ^^^ ^^^^^^^^^^^^^^^^ attr value │ │ │ │ │ │ │ │ │ attr name │ │ │ attr value │ │ attr name │ tag name ``` ## Limitations * This crate does not yet implement tree construction (which is necessary for spec-compliant HTML parsing). * This crate does not yet implement [character encoding detection]. ## Compliance & testing The tokenizer passes the [html5lib tokenizer test suite]. The library is not yet fuzz tested. ## Credits html5tokenizer was forked from [html5gum] 0.2.1, which was created by Markus Unterwaditzer who deserves major props for implementing all 80 (!) tokenizer states. * Code span support has been added. * The API has been revised. For details please refer to the [changelog]. ## License Licensed under the MIT license, see [the LICENSE file]. [parsing model]: https://html.spec.whatwg.org/multipage/parsing.html#overview-of-the-parsing-model [`examples/spans.rs`]: ./examples/spans.rs [character encoding detection]: https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding [html5lib tokenizer test suite]: https://github.com/html5lib/html5lib-tests/tree/master/tokenizer [html5gum]: https://crates.io/crates/html5gum [changelog]: ./CHANGELOG.md [the LICENSE file]: ./LICENSE