aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2023-08-19perf: only store start offsets for attribute spansMartin Fischer
This spares us two usizes per AttrInternal<Range<usize>>. So on a 64 bit target where a usize is 8 bytes this spares us 16 bytes of memory per attribute (if spans are enabled, ... for Token<()> this obviously doesn't change anything). And the DefaultEmitter now also no longer has to update the spans on each Emitter::push_attribute_(name|value) call. The spans are now calculated on demand by the Attribute methods, which is fine since the assumption is that API users are only interested in a few specific spans (rather than all spans).
2023-08-19chore: clarify variable namesMartin Fischer
2023-08-19feat: impl IntoIterator for AttributeMapMartin Fischer
Making this change made me realize that adding an `impl IntoIterator for T` can be a breaking change if `impl IntoIterator for &T` already exists. See also the cargo-semver-checks issue[1] I filed about that. [1]: https://github.com/obi1kenobi/cargo-semver-checks/issues/518
2023-08-19break!: introduce AttributeMapMartin Fischer
This has a number of benefits: * it hides the implementation of the map * it hides the type used for the map values (which lets us e.g. change name_span to name_offset while still being able to provide a convenient `Attribute::name_span` method.) * it lets us provide convenience impls for the map such as `FromIterator<(String, String)>`
2023-08-19chore: move Attribute to attr moduleMartin Fischer
This is done separately so that the following commit has a cleaner diff.
2023-08-19feat!: add all-inclusive spans to tagsMartin Fischer
Also more performant since we no longer have to update the name span on every Emitter::push_tag_name call.
2023-08-19docs: add warning to DefaultEmitterMartin Fischer
2023-08-19fix: fix lots of position off-by-onesMartin Fischer
Previously the PosTrackingReader always mysteriously subtracted 1 from the current position ... this wasn't sound at all ... the machine just happens to often call `Tokenizer::unread_char` ... but not always. E.g. for proper comments it didn't which resulted in their offset and spans being off-by-one, which is fixed by this commit (see test_spans.rs).
2023-08-19refactor!: make Emitter generic over offset instead of readerMartin Fischer
Emitters should not have access to the reader at all. Also the current position of the reader, at the time an Emitted method is called, very much depends on machine implementation details such as if `Tokenizer::unread_char` is used. Having the Emitter methods take offsets lets the machine take care of providing the right offsets, as evidenced by the next commit.
2023-08-19chore: move type param bounds to where clauseMartin Fischer
2023-08-19feat!: add offset to commentsMartin Fischer
2023-08-19refactor!: remove Span trait, just use RangeMartin Fischer
`std::mem::size_of::<Range<NoopOffset>>()` is 0 so there's no need to abstract over Range.
2023-08-19refactor!: make Position generic over offset typeMartin Fischer
Previously Span was generic over R just so that it could provide the method: fn from_reader(reader: &R) -> Self; and properly implementing that method again relied on R implementing the Position trait: impl<P: Position> Span<P> for Range<usize> { .. } which was a very roundabout and awkward way of doing things. It makes much more sense to make the Position trait generic over the return type of its method (which previously always had to be usize). Which lets us provide a blanket implementation: impl<R: Reader> Position<NoopOffset> for R { .. }
2023-08-19chore: demote missing_docs lint to warnMartin Fischer
`#![deny(missing_docs)]` makes `cargo test` abort immediately if any public API member is missing a doc comment ... which is quite annoying when experimenting with API designs. Also sometimes refactor commits (such as the very next commit) introduce new types that are then immediately removed afterwards, this should be possible without having to add a `/// TODO``` (which contrary to a compiler warning is easy to miss).
2023-08-19break!: rename GetPos trait to PositionMartin Fischer
More in line with RFC 344.[1] [1]: https://rust-lang.github.io/rfcs/0344-conventions-galore.html#gettersetter-apis
2023-08-19refactor: add default for S type param of DefaultEmitterMartin Fischer
2023-08-19test: split up span testMartin Fischer
2023-08-19refactor!: remove current_is_appropriate_end_tag_token from EmitterMartin Fischer
2023-08-19refactor: proxy essential Emitter methods through TokenizerMartin Fischer
2023-08-19break!: stop re-exporting reader traits & typesMartin Fischer
This is primarily done to make the rustdoc more readable (by grouping Reader, IntoReader, StringReader and BufReadReader in the reader module). Ideally IntoReader is already implemented for your input type and you don't have to concern yourself with these traits / types at all.
2023-08-19docs: remove `crate::` from link labelsMartin Fischer
2023-08-19docs: move `produce ("emit")` clue to Emitter docMartin Fischer
2023-08-19break!: merge Tokenizer::new_with_emitter into Tokenizer::newMartin Fischer
The Tokenizer does not perform any state switching, since proper state switching requires a feedback loop between tokenization and DOM tree building. Using the Tokenizer directly therefore is a bit of a pitfall, since you might not expect it to e.g. tokenize `<script><b>` as: StartTag(StartTag { name: "script", .. }) StartTag(StartTag { name: "b", .. }) Since we don't want to make walking into pitfalls particularly easy, this commit changes the Tokenizer::new method so that you have to specify the Emitter. Since this makes new_with_emitter redundant it is removed.
2023-08-19docs: move note about Reader impls to Reader traitMartin Fischer
2023-08-19break!: remove Default impl for AttributeMartin Fischer
2023-08-19break!: remove Default impls for StartTag and EndTagMartin Fischer
2023-08-19refactor: decouple html5lib_tests from html5tokenizerMartin Fischer
Previously we mapped the test tokens to our own token type. Now we do the reverse, which makes more sense as it enables us to easily add more detailed fields to our own token variants without having to worry about these fields not being present in the html5lib test data. (An alternative would be to normalize the values of these fields to some arbitrary value so that PartialEq still holds but seeing such normalized fields in the diff printed by pretty_assertions on a test failure would be quite confusing).
2023-08-19chore(html5lib_tests): simplify control flowMartin Fischer
2023-08-19refactor: split off reusable html5lib_tests crateMartin Fischer
2023-08-19refactor: separate test logic from html5lib-test parsingMartin Fischer
2023-08-19break!: privatize PosTrackingReader fieldsMartin Fischer
2023-08-19break!: rename PosTracker to PosTrackingReaderMartin Fischer
2023-08-19break!: remove Never in favor of std::convert::InfallibleMartin Fischer
This change is a backport of 04e6cbe[1] from html5gum. [1]: https://github.com/untitaker/html5gum/commit/04e6cbe44bb7a388bd61d1c9cfe4c618eb3b0e29
2023-08-19break!: remove InfallibleTokenizer in favor of Iterator::flattenMartin Fischer
2023-08-19docs: remove Tokenizer::new examples from Reader docsMartin Fischer
2023-08-19break!: rename Readable to IntoReaderMartin Fischer
The trait of the standard library is also called IntoIterator and not Iterable.
2023-08-19fix(docs): remove outdated list of Readable implsMartin Fischer
dced8066f77f570dd3e396ec3570c71aa86c454e introduced a Readable impl for std::io::BufReader. Manually listing impls in a doc comment is a bad idea since such lists will just get out of date and there's no need for that since rustdoc automatically lists all implementations on the trait page.
2023-08-19fix(docs): fix Error variant doc saying `$literal`Martin Fischer
2023-08-19fix(docs): Span is a byte range (not character range)Martin Fischer
2023-08-19fix(docs): StartTag is a start tagMartin Fischer
2023-08-19fix(docs): Error::EndTagWithAttributes should be emitted by emit_current_tagMartin Fischer
2023-08-19test: enable previously skipped tokenizer testMartin Fischer
2023-08-19break!: remove StartTag::next_stateMartin Fischer
You shouldn't manually have to match tokens yielded by the tokenizer iterator just to correctly handle state transitions. A better NaiveParser API will be introduced.
2023-08-19break!: remove set_last_start_tag from EmitterMartin Fischer
2023-08-19refactor: move html5lib test to own crate to fix `cargo test`Martin Fischer
Previously `cargo test` failed because it ran the test_html5lib integration test, which depends on the integration-tests feature (so you always had to run `cargo test` with `--features integration-tests` or `--all-features`, which was annoying). This commit moves the integration tests to another crate, so that the dependency on the feature can be properly defined in a way so that `cargo test` just works and runs the test.
2023-08-19chore: drop test-generator dev-dependencyMartin Fischer
I want to move the test_html5lib integration test to a separate crate so that it can properly depend on the integration-tests feature in a way so that `cargo test` just works and runs the integration test. (Currently `cargo test` fails since test_html5lib depends on that feature.) However test_html5lib currently depends on the test-generator crate and test-generator doesn't support Cargo workspaces[1] and appears to be unmaintained. This commit therefore drops the test-generator dev-dependency. [1]: https://github.com/frehberg/test-generator/issues/6
2021-12-05rename to html5tokenizer, bump versionv0.4.0Martin Fischer
2021-12-05spans: get rid of code duplication by introducing Span traitMartin Fischer
2021-12-05spans: refactor to avoid one clone()Martin Fischer
2021-12-05rename internal emit_error to push_error (to avoid confusion with trait method)Martin Fischer