feat: introduce NaiveParser

author: Martin Fischer <martin@push-f.com> 2023-08-12 12:58:08 +0200
committer: Martin Fischer <martin@push-f.com> 2023-08-19 13:53:58 +0200
commit: 0d9cd9ed44b676ccd4991cea27dc620b94ebe7e7 (patch)
tree: aba2bff89958bbe4516a49caba5edffc866c64af /README.md
parent: b125bec9914bd211d77719bd60bc5a23bd9db579 (diff)
1 files changed, 20 insertions, 28 deletions
diff --git a/README.md b/README.md
index ce68663..54cf1bd 100644
--- a/README.md
+++ b/README.md
@@ -3,19 +3,18 @@
 [![docs.rs](https://img.shields.io/docsrs/html5tokenizer)](https://docs.rs/html5tokenizer)
 [![crates.io](https://img.shields.io/crates/l/html5tokenizer.svg)](https://crates.io/crates/html5tokenizer)
 
-`html5tokenizer` is a WHATWG-compliant HTML tokenizer
-(forked from [html5gum] with added code span support).
+Spec-compliant HTML parsing [requires both tokenization and tree-construction][parsing model].
+While this crate implements a spec-compliant HTML tokenizer it does not implement any
+tree-construction. Instead it just provides a `NaiveParser` that may be used as follows:
 
-<!-- TODO: update to use NaiveParser API -->
-```ignore
+```
 use std::fmt::Write;
-use html5tokenizer::{DefaultEmitter, Tokenizer, Token};
+use html5tokenizer::{NaiveParser, Token};
 
 let html = "<title   >hello world</title>";
-let emitter = DefaultEmitter::default();
 let mut new_html = String::new();
 
-for token in Tokenizer::new(html, emitter).flatten() {
+for token in NaiveParser::new(html).flatten() {
     match token {
         Token::StartTag(tag) => {
             write!(new_html, "<{}>", tag.name).unwrap();
@@ -33,31 +32,25 @@ for token in Tokenizer::new(html, emitter).flatten() {
 assert_eq!(new_html, "<title>hello world</title>");
 ```
 
-## What a tokenizer does and what it does not do
+## Compared to html5gum
 
-`html5tokenizer` fully implements [13.2.5 of the WHATWG HTML spec][tokenization],
-i.e. is able to tokenize HTML documents and passes [html5lib's tokenizer test suite].
-Since it is just a tokenizer, this means:
+`html5tokenizer` was forked from [html5gum] 0.2.1.
 
-* `html5tokenizer` **does not** implement [charset detection].
-  This implementation requires all input to be Rust strings and therefore valid UTF-8.
-* `html5tokenizer` **does not** correct [misnested tags].
-* `html5tokenizer` **does not** recognize implicitly self-closing elements like
-  `<img>`, as a tokenizer it will simply emit a start token. It does however
-  emit a self-closing tag for `<img .. />`.
-* `html5tokenizer` **does not** generally qualify as a browser-grade HTML *parser* as
-  per the WHATWG spec. This can change in the future.
+* Code span support has been added.
+* The API has been revised.
 
-With those caveats in mind, `html5tokenizer` can pretty much ~parse~ _tokenize_
-anything that browsers can.
+html5gum has since switched its parsing to operate on bytes,
+which html5tokenizer doesn't yet support.
+`html5tokenizer` **does not** implement [charset detection].
+This implementation requires all input to be Rust strings and therefore valid UTF-8.
 
-## The `Emitter` trait
+Both crates pass the [html5lib tokenizer test suite].
 
-A distinguishing feature of `html5tokenizer` is that you can bring your own token
-datastructure and hook into token creation by implementing the `Emitter` trait.
+Both crates have an `Emitter` trait that lets you bring your own token data
+structure and hook into token creation by implementing the `Emitter` trait.
 This allows you to:
 
-* Rewrite all per-HTML-tag allocations to use a custom allocator or datastructure.
+* Rewrite all per-HTML-tag allocations to use a custom allocator or data structure.
 
 * Efficiently filter out uninteresting categories data without ever allocating
   for it. For example if any plaintext between tokens is not of interest to
@@ -69,9 +62,8 @@ This allows you to:
 Licensed under the MIT license, see [the LICENSE file].
 
 
+[parsing model]: https://html.spec.whatwg.org/multipage/parsing.html#overview-of-the-parsing-model
 [html5gum]: https://crates.io/crates/html5gum
-[tokenization]: https://html.spec.whatwg.org/multipage/parsing.html#tokenization
-[html5lib's tokenizer test suite]: https://github.com/html5lib/html5lib-tests/tree/master/tokenizer
+[html5lib tokenizer test suite]: https://github.com/html5lib/html5lib-tests/tree/master/tokenizer
 [charset detection]: https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding
-[misnested tags]: https://html.spec.whatwg.org/multipage/parsing.html#an-introduction-to-error-handling-and-strange-cases-in-the-parser
 [the LICENSE file]: ./LICENSE
author	Martin Fischer <martin@push-f.com>	2023-08-12 12:58:08 +0200
committer	Martin Fischer <martin@push-f.com>	2023-08-19 13:53:58 +0200
commit	0d9cd9ed44b676ccd4991cea27dc620b94ebe7e7 (patch)
tree	aba2bff89958bbe4516a49caba5edffc866c64af /README.md
parent	b125bec9914bd211d77719bd60bc5a23bd9db579 (diff)