aboutsummaryrefslogtreecommitdiff
path: root/README.md
diff options
context:
space:
mode:
authorMartin Fischer <martin@push-f.com>2021-12-05 03:19:03 +0100
committerMartin Fischer <martin@push-f.com>2021-12-05 03:52:22 +0100
commitb00714411306ee6500e4ee34a81bd7f4a111169e (patch)
tree66c4bfb3b6d898672b8bfe408080e3914b82af57 /README.md
parentb17d8055dfe0d57865fbad9419a07e30be378c67 (diff)
rename to html5tokenizer, bump versionv0.4.0
Diffstat (limited to 'README.md')
-rw-r--r--README.md55
1 files changed, 13 insertions, 42 deletions
diff --git a/README.md b/README.md
index 8f4567c..a97abb5 100644
--- a/README.md
+++ b/README.md
@@ -1,13 +1,14 @@
-# html5gum
+# html5tokenizer
-[![docs.rs](https://img.shields.io/docsrs/html5gum)](https://docs.rs/html5gum)
-[![crates.io](https://img.shields.io/crates/l/html5gum.svg)](https://crates.io/crates/html5gum)
+[![docs.rs](https://img.shields.io/docsrs/html5tokenizer)](https://docs.rs/html5tokenizer)
+[![crates.io](https://img.shields.io/crates/l/html5tokenizer.svg)](https://crates.io/crates/html5tokenizer)
-`html5gum` is a WHATWG-compliant HTML tokenizer.
+`html5tokenizer` is a WHATWG-compliant HTML tokenizer (forked from
+[html5gum](https://crates.io/crates/html5gum) with added code span support).
```rust
use std::fmt::Write;
-use html5gum::{Tokenizer, Token};
+use html5tokenizer::{Tokenizer, Token};
let html = "<title >hello world</title>";
let mut new_html = String::new();
@@ -32,28 +33,28 @@ assert_eq!(new_html, "<title>hello world</title>");
## What a tokenizer does and what it does not do
-`html5gum` fully implements [13.2.5 of the WHATWG HTML
+`html5tokenizer` fully implements [13.2.5 of the WHATWG HTML
spec](https://html.spec.whatwg.org/#tokenization), i.e. is able to tokenize HTML documents and passes [html5lib's tokenizer
test suite](https://github.com/html5lib/html5lib-tests/tree/master/tokenizer). Since it is just a tokenizer, this means:
-* `html5gum` **does not** [implement charset
+* `html5tokenizer` **does not** [implement charset
detection.](https://html.spec.whatwg.org/#determining-the-character-encoding)
This implementation requires all input to be Rust strings and therefore valid
UTF-8.
-* `html5gum` **does not** [correct mis-nested
+* `html5tokenizer` **does not** [correct mis-nested
tags.](https://html.spec.whatwg.org/#an-introduction-to-error-handling-and-strange-cases-in-the-parser)
-* `html5gum` **does not** recognize implicitly self-closing elements like
+* `html5tokenizer` **does not** recognize implicitly self-closing elements like
`<img>`, as a tokenizer it will simply emit a start token. It does however
emit a self-closing tag for `<img .. />`.
-* `html5gum` **does not** generally qualify as a browser-grade HTML *parser* as
+* `html5tokenizer` **does not** generally qualify as a browser-grade HTML *parser* as
per the WHATWG spec. This can change in the future.
-With those caveats in mind, `html5gum` can pretty much ~parse~ _tokenize_
+With those caveats in mind, `html5tokenizer` can pretty much ~parse~ _tokenize_
anything that browsers can.
## The `Emitter` trait
-A distinguishing feature of `html5gum` is that you can bring your own token
+A distinguishing feature of `html5tokenizer` is that you can bring your own token
datastructure and hook into token creation by implementing the `Emitter` trait.
This allows you to:
@@ -64,36 +65,6 @@ This allows you to:
you, you can implement the respective trait methods as noop and therefore
avoid any overhead creating plaintext tokens.
-## Alternative HTML parsers
-
-`html5gum` was created out of a need to parse HTML tag soup efficiently. Previous options were to:
-
-* use [quick-xml](https://github.com/tafia/quick-xml/) or
- [xmlparser](https://github.com/RazrFalcon/xmlparser) with some hacks to make
- either one not choke on bad HTML. For some (rather large) set of HTML input
- this works well (particularly `quick-xml` can be configured to be very
- lenient about parsing errors) and parsing speed is stellar. But neither can
- parse all HTML.
-
- For my own usecase `html5gum` is about 2x slower than `quick-xml`.
-
-* use [html5ever's own
- tokenizer](https://docs.rs/html5ever/0.25.1/html5ever/tokenizer/index.html)
- to avoid as much tree-building overhead as possible. This was functional but
- had poor performance for my own usecase (10-15x slower than `quick-xml`).
-
-* use [lol-html](https://github.com/cloudflare/lol-html), which would probably
- perform at least as well as `html5gum`, but comes with a closure-based API
- that I didn't manage to get working for my usecase.
-
-## Etymology
-
-Why is this library called `html5gum`?
-
-* G.U.M: **G**iant **U**nreadable **M**atch-statement
-
-* \<insert "how it feels to <s>chew 5 gum</s> _parse HTML_" meme here\>
-
## License
Licensed under the MIT license, see [`./LICENSE`](./LICENSE).