diff options
author | Martin Fischer <martin@push-f.com> | 2023-09-03 12:40:42 +0200 |
---|---|---|
committer | Martin Fischer <martin@push-f.com> | 2023-09-03 22:59:25 +0200 |
commit | a2be091994247181086eb34dcda0857bd5435fe4 (patch) | |
tree | d7d8ae95025c8546365022a8d00ef3553ef1d9b8 /README.md | |
parent | efd25742b9a24152873a4d066943bdcecfcfd2ab (diff) |
docs: add Limitations section to readme
Diffstat (limited to 'README.md')
-rw-r--r-- | README.md | 18 |
1 files changed, 12 insertions, 6 deletions
@@ -32,6 +32,17 @@ for token in NaiveParser::new(html).flatten() { assert_eq!(new_html, "<title>hello world</title>"); ``` +## Limitations + +* This crate does not yet implement tree construction + (which is necessary for spec-compliant HTML parsing). + +* This crate does not yet implement [character encoding detection]. + +* The span logic assumes UTF-8 encoding. + +* This crate does not yet implement spans for character tokens. + ## Compared to html5gum `html5tokenizer` was forked from [html5gum] 0.2.1. @@ -41,11 +52,6 @@ assert_eq!(new_html, "<title>hello world</title>"); For details please refer to the [changelog]. -html5gum has since switched its parsing to operate on bytes, -which html5tokenizer doesn't yet support. -`html5tokenizer` **does not** implement [charset detection]. -This implementation requires all input to be Rust strings and therefore valid UTF-8. - Both crates pass the [html5lib tokenizer test suite]. Both crates have an `Emitter` trait that lets you bring your own token data @@ -65,8 +71,8 @@ Licensed under the MIT license, see [the LICENSE file]. [parsing model]: https://html.spec.whatwg.org/multipage/parsing.html#overview-of-the-parsing-model +[character encoding detection]: https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding [html5gum]: https://crates.io/crates/html5gum [html5lib tokenizer test suite]: https://github.com/html5lib/html5lib-tests/tree/master/tokenizer -[charset detection]: https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding [changelog]: ./CHANGELOG.md [the LICENSE file]: ./LICENSE |