aboutsummaryrefslogtreecommitdiff
path: root/README.md
diff options
context:
space:
mode:
authorMartin Fischer <martin@push-f.com>2023-08-12 12:58:08 +0200
committerMartin Fischer <martin@push-f.com>2023-08-19 13:53:58 +0200
commit0d9cd9ed44b676ccd4991cea27dc620b94ebe7e7 (patch)
treeaba2bff89958bbe4516a49caba5edffc866c64af /README.md
parentb125bec9914bd211d77719bd60bc5a23bd9db579 (diff)
feat: introduce NaiveParser
Diffstat (limited to 'README.md')
-rw-r--r--README.md48
1 files changed, 20 insertions, 28 deletions
diff --git a/README.md b/README.md
index ce68663..54cf1bd 100644
--- a/README.md
+++ b/README.md
@@ -3,19 +3,18 @@
[![docs.rs](https://img.shields.io/docsrs/html5tokenizer)](https://docs.rs/html5tokenizer)
[![crates.io](https://img.shields.io/crates/l/html5tokenizer.svg)](https://crates.io/crates/html5tokenizer)
-`html5tokenizer` is a WHATWG-compliant HTML tokenizer
-(forked from [html5gum] with added code span support).
+Spec-compliant HTML parsing [requires both tokenization and tree-construction][parsing model].
+While this crate implements a spec-compliant HTML tokenizer it does not implement any
+tree-construction. Instead it just provides a `NaiveParser` that may be used as follows:
-<!-- TODO: update to use NaiveParser API -->
-```ignore
+```
use std::fmt::Write;
-use html5tokenizer::{DefaultEmitter, Tokenizer, Token};
+use html5tokenizer::{NaiveParser, Token};
let html = "<title >hello world</title>";
-let emitter = DefaultEmitter::default();
let mut new_html = String::new();
-for token in Tokenizer::new(html, emitter).flatten() {
+for token in NaiveParser::new(html).flatten() {
match token {
Token::StartTag(tag) => {
write!(new_html, "<{}>", tag.name).unwrap();
@@ -33,31 +32,25 @@ for token in Tokenizer::new(html, emitter).flatten() {
assert_eq!(new_html, "<title>hello world</title>");
```
-## What a tokenizer does and what it does not do
+## Compared to html5gum
-`html5tokenizer` fully implements [13.2.5 of the WHATWG HTML spec][tokenization],
-i.e. is able to tokenize HTML documents and passes [html5lib's tokenizer test suite].
-Since it is just a tokenizer, this means:
+`html5tokenizer` was forked from [html5gum] 0.2.1.
-* `html5tokenizer` **does not** implement [charset detection].
- This implementation requires all input to be Rust strings and therefore valid UTF-8.
-* `html5tokenizer` **does not** correct [misnested tags].
-* `html5tokenizer` **does not** recognize implicitly self-closing elements like
- `<img>`, as a tokenizer it will simply emit a start token. It does however
- emit a self-closing tag for `<img .. />`.
-* `html5tokenizer` **does not** generally qualify as a browser-grade HTML *parser* as
- per the WHATWG spec. This can change in the future.
+* Code span support has been added.
+* The API has been revised.
-With those caveats in mind, `html5tokenizer` can pretty much ~parse~ _tokenize_
-anything that browsers can.
+html5gum has since switched its parsing to operate on bytes,
+which html5tokenizer doesn't yet support.
+`html5tokenizer` **does not** implement [charset detection].
+This implementation requires all input to be Rust strings and therefore valid UTF-8.
-## The `Emitter` trait
+Both crates pass the [html5lib tokenizer test suite].
-A distinguishing feature of `html5tokenizer` is that you can bring your own token
-datastructure and hook into token creation by implementing the `Emitter` trait.
+Both crates have an `Emitter` trait that lets you bring your own token data
+structure and hook into token creation by implementing the `Emitter` trait.
This allows you to:
-* Rewrite all per-HTML-tag allocations to use a custom allocator or datastructure.
+* Rewrite all per-HTML-tag allocations to use a custom allocator or data structure.
* Efficiently filter out uninteresting categories data without ever allocating
for it. For example if any plaintext between tokens is not of interest to
@@ -69,9 +62,8 @@ This allows you to:
Licensed under the MIT license, see [the LICENSE file].
+[parsing model]: https://html.spec.whatwg.org/multipage/parsing.html#overview-of-the-parsing-model
[html5gum]: https://crates.io/crates/html5gum
-[tokenization]: https://html.spec.whatwg.org/multipage/parsing.html#tokenization
-[html5lib's tokenizer test suite]: https://github.com/html5lib/html5lib-tests/tree/master/tokenizer
+[html5lib tokenizer test suite]: https://github.com/html5lib/html5lib-tests/tree/master/tokenizer
[charset detection]: https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding
-[misnested tags]: https://html.spec.whatwg.org/multipage/parsing.html#an-introduction-to-error-handling-and-strange-cases-in-the-parser
[the LICENSE file]: ./LICENSE