diff options
author | Martin Fischer <martin@push-f.com> | 2023-08-19 22:03:13 +0200 |
---|---|---|
committer | Martin Fischer <martin@push-f.com> | 2023-09-03 13:21:37 +0200 |
commit | 975b2206adb0250cedcfd28598e5b3098b239754 (patch) | |
tree | 00aefdc17a98c9cd5975893acd3f8db007867340 | |
parent | 330b802d5fb6dbdfd9b7f12de6e5d5acb31ed560 (diff) |
docs: add changelog
-rw-r--r-- | CHANGELOG.md | 128 | ||||
-rw-r--r-- | README.md | 3 | ||||
-rw-r--r-- | src/lib.rs | 1 |
3 files changed, 132 insertions, 0 deletions
diff --git a/CHANGELOG.md b/CHANGELOG.md new file mode 100644 index 0000000..8af61ff --- /dev/null +++ b/CHANGELOG.md @@ -0,0 +1,128 @@ +# html5tokenizer changelog + +### 0.5.0 - 2023-08-19 + +#### Features + +* Added the `NaiveParser` API. + +* Added the `AttributeMap` type + (with some convenient `IntoIter` and `FromIterator` implementations). + +* Added spans to comments and doctypes. + +* Added all-inclusive spans to tags. + +* The attribute value syntax is now recognizable. + +#### Breaking changes + +Many ... but the API is now much better. + +* `Emitter` trait + + * is now generic over offset rather than reader + + * methods now get the offset directly rather than having to get it + from the reader (which was very awkward since the current reader + position depends on machine implementation details) + + * Removed `set_last_start_tag`. + (was only used internally for testing the DefaultEmitter + and should never have been part of the trait) + + * Added `Emitter::adjusted_current_node_present_and_not_in_html_namespace`. + (Previously the machine just hard-coded this.) Implementing this method + based on tree construction is necessary for correct CDATA handling. + + * `current_is_appropriate_end_tag_token` has been removed. + So correct state transitions (aside from the CDATA caveat mentioned + above) no longer depend on a correct implementation of this method. + + * The methods for doctype identifiers have been renamed. + `set_` has become `init_` and no longer gets passed a string + (since it was only ever the empty string anyway). + And the method names now end in `_id` instead of `_identifier`. + +* The `DefaultEmitter` has been made private since it now implements + `adjusted_current_node_present_and_not_in_html_namespace` by always returning + false, which results in all CDATA sections being tokenized as bogus comments. + +* Likewise `StartTag::next_state` has been removed since having to + manually match yielded tokens for start tags and remembering + to call that function, is just too easy-to-forget. + +* `Tokenizer::new` now requires you to specify the emitter and + `Tokenizer::new_with_emitter` has been removed since this change + made it redundant. + +* Removed the `Span` trait in favor of just using `Range<O>`, + where `O` is a type implementing the new `Offset` trait. + `Offset` is currently implemented for `usize` and + `NoopOffset`, which is a zero-sized no-op implementation. + +* The `GetPos` trait has been renamed to `Position` and made + generic over an offset type. `Position<NoopOffset>` is implemented + for every reader via a blanket implementation. + +* Removed `Default` implementations for `StartTag` and `EndTag` + (since tags with empty tag names are invalid and Default + implementations shouldn't create invalid values). + +* Removed `Display` implementation for `Error` which returned + the kebap-case error code in favor of a new `Error::code` method. + (So that we can introduce a proper human-friendly `Display` + implementation in the future without that being a breaking change.) + +* Renamed the `Readable` trait to `IntoReader`. + The reader traits are now no longer re-exported at the crate-level + but have to be used from the `reader` module (which has been made public). + +* Renamed `PosTracker` to `PosTrackingReader`. + +And more ... for details please refer to the git log. + +#### Internal changes + +* `cargo test` now just works. + (previously you had to supply `--features integration-tests` or `--all-features`) + +### 0.4.0 - 2021-12-05 + +Started over by forking [html5gum] instead (which at the time was 0.2.1). +The html5ever tokenizer code was littered with macros, which made it +quite unreadable. The "giant unreadable match" (G.U.M) expression +that Markus Unterwaditzer had implemented was much more readable. + +I made PR to add support for code spans but we came to a disagreement +about commit practice. I didn't want my commits to be squashed. +In hindsight my commits weren't that beautiful back then but I still +think that I made the right call in preserving most of these changes +individually in the git history (by forking html5gum). + +[html5gum]: https://crates.io/crates/html5gum + +## html5tokenizer forked from html5ever + +The git history before the switch to html5gum +can be found in the [html5ever-fork] branch. + +[html5ever-fork]: https://git.push-f.com/html5tokenizer/log/?h=html5ever-fork + +### 0.3.0 - 2021-11-30 + +Added some naive state switching based on +start tag name and cleaned up the API a bit. + +### 0.2.0 - 2021-11-19 + +Fixed that named entities weren't resolved +(which again added a dependency on phf). + +### 0.1.0 - 2021-04-08 + +I forked the tokenizer from [html5ever] and removed all +of its dependencies (markup5ever, tendril, mac & log), +which spared you 56 build dependencies. + +[html5ever]: https://crates.io/crates/html5ever @@ -39,6 +39,8 @@ assert_eq!(new_html, "<title>hello world</title>"); * Code span support has been added. * The API has been revised. +For details please refer to the [changelog]. + html5gum has since switched its parsing to operate on bytes, which html5tokenizer doesn't yet support. `html5tokenizer` **does not** implement [charset detection]. @@ -66,4 +68,5 @@ Licensed under the MIT license, see [the LICENSE file]. [html5gum]: https://crates.io/crates/html5gum [html5lib tokenizer test suite]: https://github.com/html5lib/html5lib-tests/tree/master/tokenizer [charset detection]: https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding +[changelog]: ./CHANGELOG.md [the LICENSE file]: ./LICENSE @@ -1,6 +1,7 @@ #![warn(missing_docs)] // This is an HTML parser. HTML can be untrusted input from the internet. #![forbid(unsafe_code)] +#![doc = concat!("[changelog]: ", file_url!("CHANGELOG.md"))] #![doc = concat!("[the LICENSE file]: ", file_url!("LICENSE"))] #![doc = include_str!("../README.md")] |