Age | Commit message (Collapse) | Author |
|
The HTML spec specifies that the tokenizer emits character tokens.
That html5gum always emitted strings instead was probably just done
to make the token consumption more convenient. When it comes to tree
construction character tokens are however actually more convenient
than string tokens since the spec defines that specific character
tokens should be ignored in specific states (and character tokens
let us avoid string manipulation for these conditions).
This should also make the DefaultEmitter more performant for cases
where you don't actually need the strings at all (or only a few)
since it avoids string allocations. Though I haven't benchmarked it.
|
|
|
|
|
|
|
|
Implementing Emitter methods as no-ops works great with the NaiveParser
but less so when you want spec-compliant HTML parsing since that
requires tree construction and most Emitter methods to be implemented.
Ideally we'll implement both tree construction and a new way of avoiding
unnecessary allocations (without having to implement your own Emitter).
|
|
|
|
|
|
I accidentally lost it in b125bec9914bd211d77719bd60bc5a23bd9db579.
(I should have changed the info string to ```rust ignore.)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Emitters should not have access to the reader at all. Also the
current position of the reader, at the time an Emitted method is
called, very much depends on machine implementation details such
as if `Tokenizer::unread_char` is used. Having the Emitter
methods take offsets lets the machine take care of providing
the right offsets, as evidenced by the next commit.
|
|
|
|
The Tokenizer does not perform any state switching, since
proper state switching requires a feedback loop between
tokenization and DOM tree building. Using the Tokenizer
directly therefore is a bit of a pitfall, since you might
not expect it to e.g. tokenize `<script><b>` as:
StartTag(StartTag { name: "script", .. })
StartTag(StartTag { name: "b", .. })
Since we don't want to make walking into pitfalls
particularly easy, this commit changes the Tokenizer::new
method so that you have to specify the Emitter.
Since this makes new_with_emitter redundant it is removed.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|