aboutsummaryrefslogtreecommitdiff
path: root/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'README.md')
-rw-r--r--README.md18
1 files changed, 12 insertions, 6 deletions
diff --git a/README.md b/README.md
index 5da2675..61801d1 100644
--- a/README.md
+++ b/README.md
@@ -32,6 +32,17 @@ for token in NaiveParser::new(html).flatten() {
assert_eq!(new_html, "<title>hello world</title>");
```
+## Limitations
+
+* This crate does not yet implement tree construction
+ (which is necessary for spec-compliant HTML parsing).
+
+* This crate does not yet implement [character encoding detection].
+
+* The span logic assumes UTF-8 encoding.
+
+* This crate does not yet implement spans for character tokens.
+
## Compared to html5gum
`html5tokenizer` was forked from [html5gum] 0.2.1.
@@ -41,11 +52,6 @@ assert_eq!(new_html, "<title>hello world</title>");
For details please refer to the [changelog].
-html5gum has since switched its parsing to operate on bytes,
-which html5tokenizer doesn't yet support.
-`html5tokenizer` **does not** implement [charset detection].
-This implementation requires all input to be Rust strings and therefore valid UTF-8.
-
Both crates pass the [html5lib tokenizer test suite].
Both crates have an `Emitter` trait that lets you bring your own token data
@@ -65,8 +71,8 @@ Licensed under the MIT license, see [the LICENSE file].
[parsing model]: https://html.spec.whatwg.org/multipage/parsing.html#overview-of-the-parsing-model
+[character encoding detection]: https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding
[html5gum]: https://crates.io/crates/html5gum
[html5lib tokenizer test suite]: https://github.com/html5lib/html5lib-tests/tree/master/tokenizer
-[charset detection]: https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding
[changelog]: ./CHANGELOG.md
[the LICENSE file]: ./LICENSE