# html5tokenizer changelog ### [unreleased] #### Features * `BufReadReader` can now operate on any `std::io::Read` implementation and no longer requires the reader to implement `std::io::BufRead`. * The `DefaultEmitter` is now public again. (Since `adjusted_current_node_present_and_not_in_html_namespace` has been removed, the DefaultEmitter is now spec-compliant and can be exposed in good conscience.) #### Breaking changes * Iterating over `Tokenizer` now yields values of a new `Event` enum. `Event::CdataOpen` signals that `Tokenizer::handle_cdata_open` has to be called. * `Emitter` trait * Removed `adjusted_current_node_present_and_not_in_html_namespace`. * `emit_error` now takes a span instead of an offset. * token types * `AttributeOwned`: The `value_offset` field has been replaced with `value_span`. * Added missing `R: Position` bounds for `Tokenizer`/`NaiveParser` constructors. (If you are able to construct a Tokenizer/NaiveParser, you should be able to iterate over it.) #### Fixes * Fixed `BufReadReader` skipping the whole line if it contained invalid UTF-8. * Fixed attribute value spans being wrong for values containing character references. * Fixed most error spans mistakenly being empty. * Fixed some error spans being off-by-one (`eof-*`). ### 0.5.0 - 2023-08-19 #### Features * Added the `NaiveParser` API. * Added the `AttributeMap` type (with some convenient `IntoIter` and `FromIterator` implementations). * Added spans to comments and doctypes. * Added all-inclusive spans to tags. * The attribute value syntax is now recognizable. #### Breaking changes Many ... but the API is now much better. * `Emitter` trait * is now generic over offset rather than reader * methods now get the offset directly rather than having to get it from the reader (which was very awkward since the current reader position depends on machine implementation details) * Removed `set_last_start_tag`. (was only used internally for testing the DefaultEmitter and should never have been part of the trait) * Added `Emitter::adjusted_current_node_present_and_not_in_html_namespace`. (Previously the machine just hard-coded this.) Implementing this method based on tree construction is necessary for correct CDATA handling. * `current_is_appropriate_end_tag_token` has been removed. So correct state transitions (aside from the CDATA caveat mentioned above) no longer depend on a correct implementation of this method. * The methods for doctype identifiers have been renamed. `set_` has become `init_` and no longer gets passed a string (since it was only ever the empty string anyway). And the method names now end in `_id` instead of `_identifier`. * The `DefaultEmitter` has been made private since it now implements `adjusted_current_node_present_and_not_in_html_namespace` by always returning false, which results in all CDATA sections being tokenized as bogus comments. * Likewise `StartTag::next_state` has been removed since having to manually match yielded tokens for start tags and remembering to call that function, is just too easy-to-forget. * `Tokenizer::new` now requires you to specify the emitter and `Tokenizer::new_with_emitter` has been removed since this change made it redundant. * Removed the `Span` trait in favor of just using `Range`, where `O` is a type implementing the new `Offset` trait. `Offset` is currently implemented for `usize` and `NoopOffset`, which is a zero-sized no-op implementation. * The `GetPos` trait has been renamed to `Position` and made generic over an offset type. `Position` is implemented for every reader via a blanket implementation. * Removed `Default` implementations for `StartTag` and `EndTag` (since tags with empty tag names are invalid and Default implementations shouldn't create invalid values). * Removed `Display` implementation for `Error` which returned the kebap-case error code in favor of a new `Error::code` method. (So that we can introduce a proper human-friendly `Display` implementation in the future without that being a breaking change.) * Renamed the `Readable` trait to `IntoReader`. The reader traits are now no longer re-exported at the crate-level but have to be used from the `reader` module (which has been made public). * Renamed `PosTracker` to `PosTrackingReader`. And more ... for details please refer to the git log. #### Internal changes * `cargo test` now just works. (previously you had to supply `--features integration-tests` or `--all-features`) ### 0.4.0 - 2021-12-05 Started over by forking [html5gum] instead (which at the time was 0.2.1). The html5ever tokenizer code was littered with macros, which made it quite unreadable. The "giant unreadable match" (G.U.M) expression that Markus Unterwaditzer had implemented was much more readable. I made PR to add support for code spans but we came to a disagreement about commit practice. I didn't want my commits to be squashed. In hindsight my commits weren't that beautiful back then but I still think that I made the right call in preserving most of these changes individually in the git history (by forking html5gum). [html5gum]: https://crates.io/crates/html5gum ## html5tokenizer forked from html5ever The git history before the switch to html5gum can be found in the [html5ever-fork] branch. [html5ever-fork]: https://git.push-f.com/html5tokenizer/log/?h=html5ever-fork ### 0.3.0 - 2021-11-30 Added some naive state switching based on start tag name and cleaned up the API a bit. ### 0.2.0 - 2021-11-19 Fixed that named entities weren't resolved (which again added a dependency on phf). ### 0.1.0 - 2021-04-08 I forked the tokenizer from [html5ever] and removed all of its dependencies (markup5ever, tendril, mac & log), which spared you 56 build dependencies. [html5ever]: https://crates.io/crates/html5ever