docs: add changelog

author: Martin Fischer <martin@push-f.com> 2023-08-19 22:03:13 +0200
committer: Martin Fischer <martin@push-f.com> 2023-09-03 13:21:37 +0200
commit: 975b2206adb0250cedcfd28598e5b3098b239754 (patch)
tree: 00aefdc17a98c9cd5975893acd3f8db007867340
parent: 330b802d5fb6dbdfd9b7f12de6e5d5acb31ed560 (diff)
3 files changed, 132 insertions, 0 deletions
diff --git a/CHANGELOG.md b/CHANGELOG.md
new file mode 100644
index 0000000..8af61ff
--- /dev/null
+++ b/CHANGELOG.md
@@ -0,0 +1,128 @@
+# html5tokenizer changelog
+
+### 0.5.0 - 2023-08-19
+
+#### Features
+
+* Added the `NaiveParser` API.
+
+* Added the `AttributeMap` type
+  (with some convenient `IntoIter` and `FromIterator` implementations).
+
+* Added spans to comments and doctypes.
+
+* Added all-inclusive spans to tags.
+
+* The attribute value syntax is now recognizable.
+
+#### Breaking changes
+
+Many ... but the API is now much better.
+
+* `Emitter` trait
+
+  * is now generic over offset rather than reader
+
+  * methods now get the offset directly rather than having to get it
+    from the reader (which was very awkward since the current reader
+    position depends on machine implementation details)
+
+  * Removed `set_last_start_tag`.  
+    (was only used internally for testing the DefaultEmitter
+    and should never have been part of the trait)
+
+  * Added `Emitter::adjusted_current_node_present_and_not_in_html_namespace`.
+    (Previously the machine just hard-coded this.) Implementing this method
+    based on tree construction is necessary for correct CDATA handling.
+
+  * `current_is_appropriate_end_tag_token` has been removed.
+    So correct state transitions (aside from the CDATA caveat mentioned
+    above) no longer depend on a correct implementation of this method.
+
+  * The methods for doctype identifiers have been renamed.
+    `set_` has become `init_` and no longer gets passed a string
+    (since it was only ever the empty string anyway).
+    And the method names now end in `_id` instead of `_identifier`.
+
+* The `DefaultEmitter` has been made private since it now implements
+  `adjusted_current_node_present_and_not_in_html_namespace` by always returning
+  false, which results in all CDATA sections being tokenized as bogus comments.
+
+* Likewise `StartTag::next_state` has been removed since having to
+  manually match yielded tokens for start tags and remembering
+  to call that function, is just too easy-to-forget.
+
+* `Tokenizer::new` now requires you to specify the emitter and
+  `Tokenizer::new_with_emitter` has been removed since this change
+  made it redundant.
+
+* Removed the `Span` trait in favor of just using `Range<O>`,
+  where `O` is a type implementing the new `Offset` trait.
+  `Offset` is currently implemented for `usize` and
+  `NoopOffset`, which is a zero-sized no-op implementation.
+
+* The `GetPos` trait has been renamed to `Position` and made
+  generic over an offset type. `Position<NoopOffset>` is implemented
+  for every reader via a blanket implementation.
+
+* Removed `Default` implementations for `StartTag` and `EndTag`
+  (since tags with empty tag names are invalid and Default
+  implementations shouldn't create invalid values).
+
+* Removed `Display` implementation for `Error` which returned
+  the kebap-case error code in favor of a new `Error::code` method.
+  (So that we can introduce a proper human-friendly `Display`
+  implementation in the future without that being a breaking change.)
+
+* Renamed the `Readable` trait to `IntoReader`.
+  The reader traits are now no longer re-exported at the crate-level
+  but have to be used from the `reader` module (which has been made public).
+
+* Renamed `PosTracker` to `PosTrackingReader`.
+
+And more ... for details please refer to the git log.
+
+#### Internal changes
+
+* `cargo test` now just works.  
+  (previously you had to supply `--features integration-tests` or `--all-features`)
+
+### 0.4.0 - 2021-12-05
+
+Started over by forking [html5gum] instead (which at the time was 0.2.1).
+The html5ever tokenizer code was littered with macros, which made it
+quite unreadable. The "giant unreadable match" (G.U.M) expression
+that Markus Unterwaditzer had implemented was much more readable.
+
+I made PR to add support for code spans but we came to a disagreement
+about commit practice. I didn't want my commits to be squashed.
+In hindsight my commits weren't that beautiful back then but I still
+think that I made the right call in preserving most of these changes
+individually in the git history (by forking html5gum).
+
+[html5gum]: https://crates.io/crates/html5gum
+
+## html5tokenizer forked from html5ever
+
+The git history before the switch to html5gum
+can be found in the [html5ever-fork] branch.
+
+[html5ever-fork]: https://git.push-f.com/html5tokenizer/log/?h=html5ever-fork
+
+### 0.3.0 - 2021-11-30
+
+Added some naive state switching based on
+start tag name and cleaned up the API a bit.
+
+### 0.2.0 - 2021-11-19
+
+Fixed that named entities weren't resolved
+(which again added a dependency on phf).
+
+### 0.1.0 - 2021-04-08
+
+I forked the tokenizer from [html5ever] and removed all
+of its dependencies (markup5ever, tendril, mac & log),
+which spared you 56 build dependencies.
+
+[html5ever]: https://crates.io/crates/html5ever
diff --git a/README.md b/README.md
index 54cf1bd..a1f8a82 100644
--- a/README.md
+++ b/README.md
@@ -39,6 +39,8 @@ assert_eq!(new_html, "<title>hello world</title>");
 * Code span support has been added.
 * The API has been revised.
 
+For details please refer to the [changelog].
+
 html5gum has since switched its parsing to operate on bytes,
 which html5tokenizer doesn't yet support.
 `html5tokenizer` **does not** implement [charset detection].
@@ -66,4 +68,5 @@ Licensed under the MIT license, see [the LICENSE file].
 [html5gum]: https://crates.io/crates/html5gum
 [html5lib tokenizer test suite]: https://github.com/html5lib/html5lib-tests/tree/master/tokenizer
 [charset detection]: https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding
+[changelog]: ./CHANGELOG.md
 [the LICENSE file]: ./LICENSE
diff --git a/src/lib.rs b/src/lib.rs
index 1cfb7c9..151bb98 100644
--- a/src/lib.rs
+++ b/src/lib.rs
@@ -1,6 +1,7 @@
 #![warn(missing_docs)]
 // This is an HTML parser. HTML can be untrusted input from the internet.
 #![forbid(unsafe_code)]
+#![doc = concat!("[changelog]: ", file_url!("CHANGELOG.md"))]
 #![doc = concat!("[the LICENSE file]: ", file_url!("LICENSE"))]
 #![doc = include_str!("../README.md")]
author	Martin Fischer <martin@push-f.com>	2023-08-19 22:03:13 +0200
committer	Martin Fischer <martin@push-f.com>	2023-09-03 13:21:37 +0200
commit	975b2206adb0250cedcfd28598e5b3098b239754 (patch)
tree	00aefdc17a98c9cd5975893acd3f8db007867340
parent	330b802d5fb6dbdfd9b7f12de6e5d5acb31ed560 (diff)