restructure readme

author: Markus Unterwaditzer <markus-honeypot@unterwaditzer.net> 2021-11-28 00:54:23 +0100
committer: Markus Unterwaditzer <markus-honeypot@unterwaditzer.net> 2021-11-28 00:54:23 +0100
commit: cdfaceb7eb60f1458ca323f85bfe51bd46beea50 (patch)
tree: 63d81040b6ec3bb302e68b45e76cd3a5df89bae2
parent: 994cb8f1399c0a517c1cb54be2533c26d0f1c4bc (diff)
1 files changed, 14 insertions, 5 deletions
diff --git a/README.md b/README.md
index 18752b7..c608735 100644
--- a/README.md
+++ b/README.md
@@ -30,22 +30,29 @@ for token in Tokenizer::new(html).infallible() {
 assert_eq!(new_html, "<title>hello world</title>");
 ```
 
-It fully implements [13.2.5 of the WHATWG HTML
+## What a tokenizer does and what it does not do
+
+`html5gum` fully implements [13.2.5 of the WHATWG HTML
 spec](https://html.spec.whatwg.org/#tokenization), i.e. is able to tokenize HTML documents and passes [html5lib's tokenizer
 test suite](https://github.com/html5lib/html5lib-tests/tree/master/tokenizer). Since it is just a tokenizer, this means:
 
-* html5gum **does not** [implement charset
+* `html5gum` **does not** [implement charset
   detection.](https://html.spec.whatwg.org/#determining-the-character-encoding)
   This implementation requires all input to be Rust strings and therefore valid
   UTF-8.
-* html5gum **does not** [correct mis-nested
+* `html5gum` **does not** [correct mis-nested
   tags.](https://html.spec.whatwg.org/#an-introduction-to-error-handling-and-strange-cases-in-the-parser)
-* html5gum **does not** recognize implicitly self-closing elements like
+* `html5gum` **does not** recognize implicitly self-closing elements like
   `<img>`, as a tokenizer it will simply emit a start token. It does however
   emit a self-closing tag for `<img .. />`.
-* html5gum **does not** generally qualify as a browser-grade HTML *parser* as
+* `html5gum` **does not** generally qualify as a browser-grade HTML *parser* as
   per the WHATWG spec. This can change in the future.
 
+With those caveats in mind, `html5gum` can pretty much parse any syntactical
+mess that browsers can, because that's what a tokenizer does.
+
+## The `Emitter` trait
+
 A distinguishing feature of `html5gum` is that you can bring your own token
 datastructure and hook into token creation by implementing the `Emitter` trait.
 This allows you to:
@@ -57,6 +64,8 @@ This allows you to:
   you, you can implement the respective trait methods as noop and therefore
   avoid any overhead creating plaintext tokens.
 
+## Alternative HTML parsers
+
 `html5gum` was created out of a need to parse HTML tag soup efficiently. Previous options were to:
 
 * use [quick-xml](https://github.com/tafia/quick-xml/) or
author	Markus Unterwaditzer <markus-honeypot@unterwaditzer.net>	2021-11-28 00:54:23 +0100
committer	Markus Unterwaditzer <markus-honeypot@unterwaditzer.net>	2021-11-28 00:54:23 +0100
commit	cdfaceb7eb60f1458ca323f85bfe51bd46beea50 (patch)
tree	63d81040b6ec3bb302e68b45e76cd3a5df89bae2
parent	994cb8f1399c0a517c1cb54be2533c26d0f1c4bc (diff)