html5tokenizer changelog

0.5.2 - 2023-09-28

This is quite a big release that prepares everything for the introduction of the tree builder in the next release, which will however be under a different crate name, since at that point the crate has become a full-blown parser. (There will still be one more html5tokenizer release to point users to the new crate.)

Features

Added spans for character tokens.
Added offsets for end-of-file tokens.
Implemented Clone for Token (and all contained types) and Event.
Prettier Debug formatting for AttributeMap and Attribute.
Added a blanket implementation to implement Reader for boxed readers.

Fixes

Removed incorrect lowercasing of char tokens when an eof-in-tag error occurred in a </script> tag.
This changelog file is now included in the published .crate file.

Breaking changes

Byte offsets were moved out of the Token enum into a new Trace enum.
Token enum
- Removed the Error variant.
  (Errors now have to be queried separately with BasicEmitter::drain_errors or TracingEmitter::drain_errors.)
- Replaced the String variant with a new Char variant.
  (The tokenizer now emits chars instead of strings.)
- Added the EndOfFile variant.
The DefaultEmitter has been removed, there now is:
- the BasicEmitter which yields just Token
- the TracingEmitter which yields (Token, Trace)
Emitter trait
- Removed pop_token method and Token associated type. std::iter::Iterator is used instead now.
- Renamed emit_error to report_error.
- Replaced emit_string with emit_char.
- Added an offset parameter to emit_eof.
Removed CdataAction and changed handle_cdata_open to just take a boolean instead.
AttributeMap::get now just returns Option<&str> as you would expect. To obtain the value and attribute trace, you now have to use the newly added AttributeMap::value_and_trace_idx method.
Three variants of the State enum have been renamed according to the Rust API guidelines (RcData to Rcdata, RawText to Rawtext and PlainText to Plaintext).
Removed State::ScriptDataEscaped and State::ScriptDataDoubleEscaped variants.
NaiveParser: Removed new_with_spans.

0.5.1 - 2023-09-03

Features

BufReadReader can now operate on any std::io::Read implementation and no longer requires the reader to implement std::io::BufRead.
The DefaultEmitter is now public again.
(Since adjusted_current_node_present_and_not_in_html_namespace has been removed, the DefaultEmitter is now spec-compliant and can be exposed in good conscience.)
Added Doctype::name_span.

Breaking changes

Iterating over Tokenizer now yields values of a new Event enum. Event::CdataOpen signals that Tokenizer::handle_cdata_open has to be called.
Emitter trait
- Removed adjusted_current_node_present_and_not_in_html_namespace.
- emit_error and set_self_closing now take a span instead of an offset.
- Added a name_offset parameter to init_start_tag and init_end_tag.
- Several provided offsets have been changed to be more sensible.
  Affected are: init_start_tag, init_end_tag, emit_current_tag, emit_current_comment
token types
- StartTag/EndTag: Added name_span fields (and removed the same-named methods).
- Comment: The data_offset field has been replaced with data_span.
- Doctype: The name field is now optional.
- AttributeOwned: The name_offset and value_offset fields have been replaced with name_span and value_span respectively.
Added required len_of_char_in_current_encoding method to Reader trait.
Added missing R: Position<O> bounds for Tokenizer/NaiveParser constructors.
(If you are able to construct a Tokenizer/NaiveParser, you should be able to iterate over it.)

Fixes

Fixed BufReadReader skipping the whole line if it contained invalid UTF-8.
Fixed span logic assuming UTF-8 (the logic is now character encoding independent).
Fixed attribute value spans being wrong for values containing character references.
Fixed most error spans mistakenly being empty.
Fixed some error spans being off-by-one
(eof-*, end-tag-with-trailing-solidus, missing-semicolon-after-character-reference).
Fixed most error spans about character references being too small.

0.5.0 - 2023-08-19

Features

Added the NaiveParser API.
Added the AttributeMap type (with some convenient IntoIter and FromIterator implementations).
Added spans to comments and doctypes.
Added all-inclusive spans to tags.
The attribute value syntax is now recognizable.

Breaking changes

Many ... but the API is now much better.

Emitter trait
- is now generic over offset rather than reader
- methods now get the offset directly rather than having to get it from the reader (which was very awkward since the current reader position depends on machine implementation details)
- Removed set_last_start_tag.
  (was only used internally for testing the DefaultEmitter and should never have been part of the trait)
- Added Emitter::adjusted_current_node_present_and_not_in_html_namespace. (Previously the machine just hard-coded this.) Implementing this method based on tree construction is necessary for correct CDATA handling.
- current_is_appropriate_end_tag_token has been removed. So correct state transitions (aside from the CDATA caveat mentioned above) no longer depend on a correct implementation of this method.
- The methods for doctype identifiers have been renamed. set_ has become init_ and no longer gets passed a string (since it was only ever the empty string anyway). And the method names now end in _id instead of _identifier.
The DefaultEmitter has been made private since it now implements adjusted_current_node_present_and_not_in_html_namespace by always returning false, which results in all CDATA sections being tokenized as bogus comments.
Likewise StartTag::next_state has been removed since having to manually match yielded tokens for start tags and remembering to call that function, is just too easy-to-forget.
Tokenizer::new now requires you to specify the emitter and Tokenizer::new_with_emitter has been removed since this change made it redundant.
Removed the Span trait in favor of just using Range<O>, where O is a type implementing the new Offset trait. Offset is currently implemented for usize and NoopOffset, which is a zero-sized no-op implementation.
The GetPos trait has been renamed to Position and made generic over an offset type. Position<NoopOffset> is implemented for every reader via a blanket implementation.
Removed Default implementations for StartTag and EndTag (since tags with empty tag names are invalid and Default implementations shouldn't create invalid values).
Removed Display implementation for Error which returned the kebap-case error code in favor of a new Error::code method. (So that we can introduce a proper human-friendly Display implementation in the future without that being a breaking change.)
Renamed the Readable trait to IntoReader. The reader traits are now no longer re-exported at the crate-level but have to be used from the reader module (which has been made public).
Renamed PosTracker to PosTrackingReader.

And more ... for details please refer to the git log.

Internal changes

cargo test now just works.
(previously you had to supply --features integration-tests or --all-features)

0.4.0 - 2021-12-05

Started over by forking html5gum instead (which at the time was 0.2.1). The html5ever tokenizer code was littered with macros, which made it quite unreadable. The "giant unreadable match" (G.U.M) expression that Markus Unterwaditzer had implemented was much more readable.

I made a PR to add support for code spans but we came to a disagreement about commit practice. I didn't want my commits to be squashed. In hindsight my commits weren't that beautiful back then but I still think that I made the right call in preserving most of these changes individually in the git history (by forking html5gum).

Features

Added code spans to StartTag, EndTag and Error tokens and attributes.

html5tokenizer forked from html5ever

The git history before the switch to html5gum can be found in the html5ever-fork branch.

0.3.0 - 2021-11-30

Added some naive state switching based on start tag name and cleaned up the API a bit.

2021-11-24

Markus Unterwaditzer published the first version of html5gum.

0.2.0 - 2021-11-19

Fixed that named entities weren't resolved (which again added a dependency on phf).

0.1.0 - 2021-04-08

I forked the tokenizer from html5ever and removed all of its dependencies (markup5ever, tendril, mac & log), which spared you 56 build dependencies.