aboutsummaryrefslogtreecommitdiff
path: root/README.md
diff options
context:
space:
mode:
authorMarkus Unterwaditzer <markus-honeypot@unterwaditzer.net>2021-11-24 20:44:08 +0100
committerMarkus Unterwaditzer <markus-honeypot@unterwaditzer.net>2021-11-24 20:51:21 +0100
commit9909fc4580855a58a10eb84f0d143d1b3b3f464a (patch)
tree36941a6a714a10b9ce554ba249975108e6a17274 /README.md
hello world
Diffstat (limited to 'README.md')
-rw-r--r--README.md81
1 files changed, 81 insertions, 0 deletions
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..d7cbf50
--- /dev/null
+++ b/README.md
@@ -0,0 +1,81 @@
+# html5gum
+
+[![docs.rs](https://img.shields.io/docsrs/html5gum)](https://docs.rs/html5gum)
+[![crates.io](https://img.shields.io/crates/l/html5gum.svg)](https://crates.io/crates/html5gum)
+
+`html5gum` is a WHATWG-compliant HTML tokenizer.
+
+```rust
+use std::fmt::Write;
+use html5gum::{Tokenizer, Token};
+
+let html = "<title >hello world</title>";
+let mut new_html = String::new();
+
+for token in Tokenizer::new(html) {
+ match token {
+ Token::StartTag(tag) => {
+ write!(new_html, "<{}>", tag.name).unwrap();
+ }
+ Token::String(hello_world) => {
+ write!(new_html, "{}", hello_world).unwrap();
+ }
+ Token::EndTag(tag) => {
+ write!(new_html, "</{}>", tag.name).unwrap();
+ }
+ _ => panic!("unexpected input"),
+ }
+}
+
+assert_eq!(new_html, "<title>hello world</title>");
+```
+
+It fully implements [13.2 of the WHATWG HTML
+spec](https://html.spec.whatwg.org/#parsing) and passes [html5lib's tokenizer
+test suite](https://github.com/html5lib/html5lib-tests/tree/master/tokenizer),
+except that this implementation requires all input to be Rust strings and
+therefore valid UTF-8. There is no charset detection or handling of invalid
+surrogates, and the relevant html5lib tests are skipped in CI.
+
+A distinguishing feature of `html5gum` is that you can bring your own token
+datastructure and hook into token creation by implementing the `Emitter` trait.
+This allows you to:
+
+* Rewrite all per-HTML-tag allocations to use a custom allocator or datastructure.
+
+* Efficiently filter out uninteresting categories data without ever allocating
+ for it. For example if any plaintext between tokens is not of interest to
+ you, you can implement the respective trait methods as noop and therefore
+ avoid any overhead creating plaintext tokens.
+
+`html5gum` was created out of a need to parse HTML tag soup efficiently. Previous options were to:
+
+* use [quick-xml](https://github.com/tafia/quick-xml/) or
+ [xmlparser](https://github.com/RazrFalcon/xmlparser) with some hacks to make
+ either one not choke on bad HTML. For some (rather large) set of HTML input
+ this works well (particularly `quick-xml` can be configured to be very
+ lenient about parsing errors) and parsing speed is stellar. But neither can
+ parse all HTML.
+
+ For my own usecase `html5gum` is about 2x slower than `quick-xml`.
+
+* use [html5ever's own
+ tokenizer](https://docs.rs/html5ever/0.25.1/html5ever/tokenizer/index.html)
+ to avoid as much tree-building overhead as possible. This was functional but
+ had poor performance for my own usecase (10-15x slower than `quick-xml`).
+
+* use [lol-html](https://github.com/cloudflare/lol-html), which would probably
+ perform at least as well as `html5gum`, but comes with a closure-based API
+ that I didn't manage to get working for my usecase.
+
+## Etymology
+
+Why is this library called `html5gum`?
+
+* G.U.M: **G**iant **U**nreadable **M**atch-statement
+
+* \<insert "how it feels to <s>chew 5 gum</s> _parse HTML_" meme here\>
+
+## License
+
+Licensed under the MIT license, see [`./LICENSE`](./LICENSE).