aboutsummaryrefslogtreecommitdiff
path: root/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'README.md')
-rw-r--r--README.md16
1 files changed, 8 insertions, 8 deletions
diff --git a/README.md b/README.md
index 425ee18..c27190c 100644
--- a/README.md
+++ b/README.md
@@ -30,16 +30,16 @@ for token in Tokenizer::new(html).infallible() {
assert_eq!(new_html, "<title>hello world</title>");
```
-It fully implements [13.2 of the WHATWG HTML
-spec](https://html.spec.whatwg.org/#parsing) and passes [html5lib's tokenizer
-test suite](https://github.com/html5lib/html5lib-tests/tree/master/tokenizer),
-except that:
+It fully implements [13.2.5 of the WHATWG HTML
+spec](https://html.spec.whatwg.org/#tokenization), i.e. is able to tokenize HTML documents and passes [html5lib's tokenizer
+test suite](https://github.com/html5lib/html5lib-tests/tree/master/tokenizer). Most importantly it does not:
-* this implementation requires all input to be Rust strings and therefore valid
- UTF-8. There is no charset detection or handling of invalid surrogates, and
- the relevant html5lib tests are skipped in CI.
+* [Implement charset detection.](https://html.spec.whatwg.org/#determining-the-character-encoding) This implementation requires all input to be
+ Rust strings and therefore valid UTF-8.
-* there's some remaining testcases to be decided on at [issue 5](https://github.com/untitaker/html5gum/issues/5).
+* [Correct mis-nested tags](https://html.spec.whatwg.org/#an-introduction-to-error-handling-and-strange-cases-in-the-parser)
+
+* Generally qualify as a complete HTML *parser* as per the WHATWG spec (yet).
A distinguishing feature of `html5gum` is that you can bring your own token
datastructure and hook into token creation by implementing the `Emitter` trait.