html5tokenizer changelog
0.5.2 - 2023-09-28
This is quite a big release that prepares everything for the introduction of the tree builder in the next release, which will however be under a different crate name, since at that point the crate has become a full-blown parser. (There will still be one more html5tokenizer release to point users to the new crate.)
Features
Added spans for character tokens.
Added offsets for end-of-file tokens.
Implemented
Clone
forToken
(and all contained types) andEvent
.Prettier
Debug
formatting forAttributeMap
andAttribute
.Added a blanket implementation to implement
Reader
for boxed readers.
Fixes
Removed incorrect lowercasing of char tokens when an eof-in-tag error occurred in a
</script>
tag.This changelog file is now included in the published
.crate
file.
Breaking changes
Byte offsets were moved out of the
Token
enum into a newTrace
enum.Token
enumRemoved the
Error
variant.
(Errors now have to be queried separately withBasicEmitter::drain_errors
orTracingEmitter::drain_errors
.)Replaced the
String
variant with a newChar
variant.
(The tokenizer now emits chars instead of strings.)Added the
EndOfFile
variant.
The
DefaultEmitter
has been removed, there now is:- the
BasicEmitter
which yields justToken
- the
TracingEmitter
which yields(Token, Trace)
- the
Emitter
traitRemoved
pop_token
method andToken
associated type.std::iter::Iterator
is used instead now.Renamed
emit_error
toreport_error
.Replaced
emit_string
withemit_char
.Added an offset parameter to
emit_eof
.
Removed
CdataAction
and changedhandle_cdata_open
to just take a boolean instead.AttributeMap::get
now just returnsOption<&str>
as you would expect. To obtain the value and attribute trace, you now have to use the newly addedAttributeMap::value_and_trace_idx
method.Three variants of the
State
enum have been renamed according to the Rust API guidelines (RcData
toRcdata
,RawText
toRawtext
andPlainText
toPlaintext
).Removed
State::ScriptDataEscaped
andState::ScriptDataDoubleEscaped
variants.NaiveParser
: Removednew_with_spans
.
0.5.1 - 2023-09-03
Features
BufReadReader
can now operate on anystd::io::Read
implementation and no longer requires the reader to implementstd::io::BufRead
.The
DefaultEmitter
is now public again.
(Sinceadjusted_current_node_present_and_not_in_html_namespace
has been removed, the DefaultEmitter is now spec-compliant and can be exposed in good conscience.)Added
Doctype::name_span
.
Breaking changes
Iterating over
Tokenizer
now yields values of a newEvent
enum.Event::CdataOpen
signals thatTokenizer::handle_cdata_open
has to be called.Emitter
traitRemoved
adjusted_current_node_present_and_not_in_html_namespace
.emit_error
andset_self_closing
now take a span instead of an offset.Added a
name_offset
parameter toinit_start_tag
andinit_end_tag
.Several provided offsets have been changed to be more sensible.
Affected are:init_start_tag
,init_end_tag
,emit_current_tag
,emit_current_comment
token types
StartTag
/EndTag
: Addedname_span
fields (and removed the same-named methods).Comment
: Thedata_offset
field has been replaced withdata_span
.Doctype
: Thename
field is now optional.AttributeOwned
: Thename_offset
andvalue_offset
fields have been replaced withname_span
andvalue_span
respectively.
Added required
len_of_char_in_current_encoding
method toReader
trait.Added missing
R: Position<O>
bounds forTokenizer
/NaiveParser
constructors.
(If you are able to construct a Tokenizer/NaiveParser, you should be able to iterate over it.)
Fixes
Fixed
BufReadReader
skipping the whole line if it contained invalid UTF-8.Fixed span logic assuming UTF-8 (the logic is now character encoding independent).
Fixed attribute value spans being wrong for values containing character references.
Fixed most error spans mistakenly being empty.
Fixed some error spans being off-by-one
(eof-*
,end-tag-with-trailing-solidus
,missing-semicolon-after-character-reference
).Fixed most error spans about character references being too small.
0.5.0 - 2023-08-19
Features
Added the
NaiveParser
API.Added the
AttributeMap
type (with some convenientIntoIter
andFromIterator
implementations).Added spans to comments and doctypes.
Added all-inclusive spans to tags.
The attribute value syntax is now recognizable.
Breaking changes
Many ... but the API is now much better.
Emitter
traitis now generic over offset rather than reader
methods now get the offset directly rather than having to get it from the reader (which was very awkward since the current reader position depends on machine implementation details)
Removed
set_last_start_tag
.
(was only used internally for testing the DefaultEmitter and should never have been part of the trait)Added
Emitter::adjusted_current_node_present_and_not_in_html_namespace
. (Previously the machine just hard-coded this.) Implementing this method based on tree construction is necessary for correct CDATA handling.current_is_appropriate_end_tag_token
has been removed. So correct state transitions (aside from the CDATA caveat mentioned above) no longer depend on a correct implementation of this method.The methods for doctype identifiers have been renamed.
set_
has becomeinit_
and no longer gets passed a string (since it was only ever the empty string anyway). And the method names now end in_id
instead of_identifier
.
The
DefaultEmitter
has been made private since it now implementsadjusted_current_node_present_and_not_in_html_namespace
by always returning false, which results in all CDATA sections being tokenized as bogus comments.Likewise
StartTag::next_state
has been removed since having to manually match yielded tokens for start tags and remembering to call that function, is just too easy-to-forget.Tokenizer::new
now requires you to specify the emitter andTokenizer::new_with_emitter
has been removed since this change made it redundant.Removed the
Span
trait in favor of just usingRange<O>
, whereO
is a type implementing the newOffset
trait.Offset
is currently implemented forusize
andNoopOffset
, which is a zero-sized no-op implementation.The
GetPos
trait has been renamed toPosition
and made generic over an offset type.Position<NoopOffset>
is implemented for every reader via a blanket implementation.Removed
Default
implementations forStartTag
andEndTag
(since tags with empty tag names are invalid and Default implementations shouldn't create invalid values).Removed
Display
implementation forError
which returned the kebap-case error code in favor of a newError::code
method. (So that we can introduce a proper human-friendlyDisplay
implementation in the future without that being a breaking change.)Renamed the
Readable
trait toIntoReader
. The reader traits are now no longer re-exported at the crate-level but have to be used from thereader
module (which has been made public).Renamed
PosTracker
toPosTrackingReader
.
And more ... for details please refer to the git log.
Internal changes
cargo test
now just works.
(previously you had to supply--features integration-tests
or--all-features
)
0.4.0 - 2021-12-05
Started over by forking html5gum instead (which at the time was 0.2.1). The html5ever tokenizer code was littered with macros, which made it quite unreadable. The "giant unreadable match" (G.U.M) expression that Markus Unterwaditzer had implemented was much more readable.
I made a PR to add support for code spans but we came to a disagreement about commit practice. I didn't want my commits to be squashed. In hindsight my commits weren't that beautiful back then but I still think that I made the right call in preserving most of these changes individually in the git history (by forking html5gum).
Features
- Added code spans to
StartTag
,EndTag
andError
tokens and attributes.
html5tokenizer forked from html5ever
The git history before the switch to html5gum can be found in the html5ever-fork branch.
0.3.0 - 2021-11-30
Added some naive state switching based on start tag name and cleaned up the API a bit.
2021-11-24
Markus Unterwaditzer published the first version of html5gum.
0.2.0 - 2021-11-19
Fixed that named entities weren't resolved (which again added a dependency on phf).
0.1.0 - 2021-04-08
I forked the tokenizer from html5ever and removed all of its dependencies (markup5ever, tendril, mac & log), which spared you 56 build dependencies.