diff options
Diffstat (limited to 'README.md')
-rw-r--r-- | README.md | 26 |
1 files changed, 15 insertions, 11 deletions
@@ -3,8 +3,8 @@ [![docs.rs](https://img.shields.io/docsrs/html5tokenizer)](https://docs.rs/html5tokenizer) [![crates.io](https://img.shields.io/crates/l/html5tokenizer.svg)](https://crates.io/crates/html5tokenizer) -`html5tokenizer` is a WHATWG-compliant HTML tokenizer (forked from -[html5gum](https://crates.io/crates/html5gum) with added code span support). +`html5tokenizer` is a WHATWG-compliant HTML tokenizer +(forked from [html5gum] with added code span support). ```rust use std::fmt::Write; @@ -34,16 +34,13 @@ assert_eq!(new_html, "<title>hello world</title>"); ## What a tokenizer does and what it does not do -`html5tokenizer` fully implements [13.2.5 of the WHATWG HTML -spec](https://html.spec.whatwg.org/#tokenization), i.e. is able to tokenize HTML documents and passes [html5lib's tokenizer -test suite](https://github.com/html5lib/html5lib-tests/tree/master/tokenizer). Since it is just a tokenizer, this means: +`html5tokenizer` fully implements [13.2.5 of the WHATWG HTML spec][tokenization], +i.e. is able to tokenize HTML documents and passes [html5lib's tokenizer test suite]. +Since it is just a tokenizer, this means: -* `html5tokenizer` **does not** [implement charset - detection.](https://html.spec.whatwg.org/#determining-the-character-encoding) - This implementation requires all input to be Rust strings and therefore valid - UTF-8. -* `html5tokenizer` **does not** [correct mis-nested - tags.](https://html.spec.whatwg.org/#an-introduction-to-error-handling-and-strange-cases-in-the-parser) +* `html5tokenizer` **does not** implement [charset detection]. + This implementation requires all input to be Rust strings and therefore valid UTF-8. +* `html5tokenizer` **does not** correct [misnested tags]. * `html5tokenizer` **does not** recognize implicitly self-closing elements like `<img>`, as a tokenizer it will simply emit a start token. It does however emit a self-closing tag for `<img .. />`. @@ -69,3 +66,10 @@ This allows you to: ## License Licensed under the MIT license, see [`./LICENSE`](./LICENSE). + + +[html5gum]: https://crates.io/crates/html5gum +[tokenization]: https://html.spec.whatwg.org/#tokenization +[html5lib's tokenizer test suite]: https://github.com/html5lib/html5lib-tests/tree/master/tokenizer +[charset detection]: https://html.spec.whatwg.org/#determining-the-character-encoding +[misnested tags]: https://html.spec.whatwg.org/#an-introduction-to-error-handling-and-strange-cases-in-the-parser |