CHANGELOG.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250

# html5tokenizer changelog

### [unreleased]

#### Features

* Added spans for character tokens.

* Added offsets for end-of-file tokens.

* Implemented `Clone` for `Token` (and all contained types) and `Event`.

* Prettier `Debug` formatting for `AttributeMap` and `Attribute`.

* Added a blanket implementation to implement `Reader` for boxed readers.

#### Breaking changes

* Byte offsets were moved out of the `Token` enum into a new `Trace` enum.

* `Token` enum

  * Removed the `Error` variant.  
    (Errors now have to be queried separately with
    `BasicEmitter::drain_errors` or `TracingEmitter::drain_errors`.)

  * Replaced the `String` variant with a new `Char` variant.  
    (The tokenizer now emits chars instead of strings.)

  * Added the `EndOfFile` variant.

* The `DefaultEmitter` has been removed, there now is:

  * the `BasicEmitter` which yields just `Token`
  * the `TracingEmitter` which yields `(Token, Trace)`

* `Emitter` trait

  * Removed `pop_token` method and `Token` associated type.
    `std::iter::Iterator` is used instead now.

  * Renamed `emit_error` to `report_error`.

  * Replaced `emit_string` with `emit_char`.

  * Added an offset parameter to `emit_eof`.

* Removed `CdataAction` and changed `handle_cdata_open` to just take a boolean instead.

* `AttributeMap::get` now just returns `Option<&str>` as you would expect.
   To obtain the value and attribute trace, you now have to use the
   newly added `AttributeMap::value_and_trace_idx` method.

* Three variants of the `State` enum have been renamed according to the Rust API guidelines
  (`RcData` to `Rcdata`, `RawText` to `Rawtext` and `PlainText` to `Plaintext`).

* Removed `State::ScriptDataEscaped` and `State::ScriptDataDoubleEscaped` variants.

* `NaiveParser`: Removed `new_with_spans`.

### 0.5.1 - 2023-09-03

#### Features

* `BufReadReader` can now operate on any `std::io::Read` implementation
  and no longer requires the reader to implement `std::io::BufRead`.

* The `DefaultEmitter` is now public again.  
  (Since `adjusted_current_node_present_and_not_in_html_namespace` has been removed,
  the DefaultEmitter is now spec-compliant and can be exposed in good conscience.)

* Added `Doctype::name_span`.

#### Breaking changes

* Iterating over `Tokenizer` now yields values of a new `Event` enum.
  `Event::CdataOpen` signals that `Tokenizer::handle_cdata_open` has to be called.

* `Emitter` trait

  * Removed `adjusted_current_node_present_and_not_in_html_namespace`.

  * `emit_error` and `set_self_closing` now take a span instead of an offset.

  * Added a `name_offset` parameter to `init_start_tag` and `init_end_tag`.

  * Several provided offsets have been changed to be more sensible.  
    Affected are: `init_start_tag`, `init_end_tag`, `emit_current_tag`,
    `emit_current_comment`

* token types

  * `StartTag`/`EndTag`: Added `name_span` fields
    (and removed the same-named methods).

  * `Comment`: The `data_offset` field has been replaced with `data_span`.

  * `Doctype`: The `name` field is now optional.

  * `AttributeOwned`: The `name_offset` and `value_offset` fields have
    been replaced with `name_span` and `value_span` respectively.

* Added required `len_of_char_in_current_encoding` method to `Reader` trait.

* Added missing `R: Position<O>` bounds for `Tokenizer`/`NaiveParser` constructors.  
  (If you are able to construct a Tokenizer/NaiveParser,
  you should be able to iterate over it.)

#### Fixes

* Fixed `BufReadReader` skipping the whole line if it contained invalid UTF-8.

* Fixed span logic assuming UTF-8 (the logic is now character encoding independent).

* Fixed attribute value spans being wrong for values containing character references.

* Fixed most error spans mistakenly being empty.

* Fixed some error spans being off-by-one  
  (`eof-*`, `end-tag-with-trailing-solidus`,
   `missing-semicolon-after-character-reference`).

* Fixed most error spans about character references being too small.

### 0.5.0 - 2023-08-19

#### Features

* Added the `NaiveParser` API.

* Added the `AttributeMap` type
  (with some convenient `IntoIter` and `FromIterator` implementations).

* Added spans to comments and doctypes.

* Added all-inclusive spans to tags.

* The attribute value syntax is now recognizable.

#### Breaking changes

Many ... but the API is now much better.

* `Emitter` trait

  * is now generic over offset rather than reader

  * methods now get the offset directly rather than having to get it
    from the reader (which was very awkward since the current reader
    position depends on machine implementation details)

  * Removed `set_last_start_tag`.  
    (was only used internally for testing the DefaultEmitter
    and should never have been part of the trait)

  * Added `Emitter::adjusted_current_node_present_and_not_in_html_namespace`.
    (Previously the machine just hard-coded this.) Implementing this method
    based on tree construction is necessary for correct CDATA handling.

  * `current_is_appropriate_end_tag_token` has been removed.
    So correct state transitions (aside from the CDATA caveat mentioned
    above) no longer depend on a correct implementation of this method.

  * The methods for doctype identifiers have been renamed.
    `set_` has become `init_` and no longer gets passed a string
    (since it was only ever the empty string anyway).
    And the method names now end in `_id` instead of `_identifier`.

* The `DefaultEmitter` has been made private since it now implements
  `adjusted_current_node_present_and_not_in_html_namespace` by always returning
  false, which results in all CDATA sections being tokenized as bogus comments.

* Likewise `StartTag::next_state` has been removed since having to
  manually match yielded tokens for start tags and remembering
  to call that function, is just too easy-to-forget.

* `Tokenizer::new` now requires you to specify the emitter and
  `Tokenizer::new_with_emitter` has been removed since this change
  made it redundant.

* Removed the `Span` trait in favor of just using `Range<O>`,
  where `O` is a type implementing the new `Offset` trait.
  `Offset` is currently implemented for `usize` and
  `NoopOffset`, which is a zero-sized no-op implementation.

* The `GetPos` trait has been renamed to `Position` and made
  generic over an offset type. `Position<NoopOffset>` is implemented
  for every reader via a blanket implementation.

* Removed `Default` implementations for `StartTag` and `EndTag`
  (since tags with empty tag names are invalid and Default
  implementations shouldn't create invalid values).

* Removed `Display` implementation for `Error` which returned
  the kebap-case error code in favor of a new `Error::code` method.
  (So that we can introduce a proper human-friendly `Display`
  implementation in the future without that being a breaking change.)

* Renamed the `Readable` trait to `IntoReader`.
  The reader traits are now no longer re-exported at the crate-level
  but have to be used from the `reader` module (which has been made public).

* Renamed `PosTracker` to `PosTrackingReader`.

And more ... for details please refer to the git log.

#### Internal changes

* `cargo test` now just works.  
  (previously you had to supply `--features integration-tests` or `--all-features`)

### 0.4.0 - 2021-12-05

Started over by forking [html5gum] instead (which at the time was 0.2.1).
The html5ever tokenizer code was littered with macros, which made it
quite unreadable. The "giant unreadable match" (G.U.M) expression
that Markus Unterwaditzer had implemented was much more readable.

I made PR to add support for code spans but we came to a disagreement
about commit practice. I didn't want my commits to be squashed.
In hindsight my commits weren't that beautiful back then but I still
think that I made the right call in preserving most of these changes
individually in the git history (by forking html5gum).

[html5gum]: https://crates.io/crates/html5gum

## html5tokenizer forked from html5ever

The git history before the switch to html5gum
can be found in the [html5ever-fork] branch.

[html5ever-fork]: https://git.push-f.com/html5tokenizer/log/?h=html5ever-fork

### 0.3.0 - 2021-11-30

Added some naive state switching based on
start tag name and cleaned up the API a bit.

### 0.2.0 - 2021-11-19

Fixed that named entities weren't resolved
(which again added a dependency on phf).

### 0.1.0 - 2021-04-08

I forked the tokenizer from [html5ever] and removed all
of its dependencies (markup5ever, tendril, mac & log),
which spared you 56 build dependencies.

[html5ever]: https://crates.io/crates/html5ever