From cdfaceb7eb60f1458ca323f85bfe51bd46beea50 Mon Sep 17 00:00:00 2001 From: Markus Unterwaditzer Date: Sun, 28 Nov 2021 00:54:23 +0100 Subject: restructure readme --- README.md | 19 ++++++++++++++----- 1 file changed, 14 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 18752b7..c608735 100644 --- a/README.md +++ b/README.md @@ -30,22 +30,29 @@ for token in Tokenizer::new(html).infallible() { assert_eq!(new_html, "hello world"); ``` -It fully implements [13.2.5 of the WHATWG HTML +## What a tokenizer does and what it does not do + +`html5gum` fully implements [13.2.5 of the WHATWG HTML spec](https://html.spec.whatwg.org/#tokenization), i.e. is able to tokenize HTML documents and passes [html5lib's tokenizer test suite](https://github.com/html5lib/html5lib-tests/tree/master/tokenizer). Since it is just a tokenizer, this means: -* html5gum **does not** [implement charset +* `html5gum` **does not** [implement charset detection.](https://html.spec.whatwg.org/#determining-the-character-encoding) This implementation requires all input to be Rust strings and therefore valid UTF-8. -* html5gum **does not** [correct mis-nested +* `html5gum` **does not** [correct mis-nested tags.](https://html.spec.whatwg.org/#an-introduction-to-error-handling-and-strange-cases-in-the-parser) -* html5gum **does not** recognize implicitly self-closing elements like +* `html5gum` **does not** recognize implicitly self-closing elements like ``, as a tokenizer it will simply emit a start token. It does however emit a self-closing tag for ``. -* html5gum **does not** generally qualify as a browser-grade HTML *parser* as +* `html5gum` **does not** generally qualify as a browser-grade HTML *parser* as per the WHATWG spec. This can change in the future. +With those caveats in mind, `html5gum` can pretty much parse any syntactical +mess that browsers can, because that's what a tokenizer does. + +## The `Emitter` trait + A distinguishing feature of `html5gum` is that you can bring your own token datastructure and hook into token creation by implementing the `Emitter` trait. This allows you to: @@ -57,6 +64,8 @@ This allows you to: you, you can implement the respective trait methods as noop and therefore avoid any overhead creating plaintext tokens. +## Alternative HTML parsers + `html5gum` was created out of a need to parse HTML tag soup efficiently. Previous options were to: * use [quick-xml](https://github.com/tafia/quick-xml/) or -- cgit v1.2.3