diff options
author | Markus Unterwaditzer <markus-honeypot@unterwaditzer.net> | 2021-11-24 20:44:08 +0100 |
---|---|---|
committer | Markus Unterwaditzer <markus-honeypot@unterwaditzer.net> | 2021-11-24 20:51:21 +0100 |
commit | 9909fc4580855a58a10eb84f0d143d1b3b3f464a (patch) | |
tree | 36941a6a714a10b9ce554ba249975108e6a17274 /tests/html5lib-tests/tokenizer/README.md |
hello world
Diffstat (limited to 'tests/html5lib-tests/tokenizer/README.md')
-rw-r--r-- | tests/html5lib-tests/tokenizer/README.md | 107 |
1 files changed, 107 insertions, 0 deletions
diff --git a/tests/html5lib-tests/tokenizer/README.md b/tests/html5lib-tests/tokenizer/README.md new file mode 100644 index 0000000..66b81e8 --- /dev/null +++ b/tests/html5lib-tests/tokenizer/README.md @@ -0,0 +1,107 @@ +Tokenizer tests +=============== + +The test format is [JSON](http://www.json.org/). This has the advantage +that the syntax allows backward-compatible extensions to the tests and +the disadvantage that it is relatively verbose. + +Basic Structure +--------------- + + {"tests": [ + {"description": "Test description", + "input": "input_string", + "output": [expected_output_tokens], + "initialStates": [initial_states], + "lastStartTag": last_start_tag, + "errors": [parse_errors] + } + ]} + +Multiple tests per file are allowed simply by adding more objects to the +"tests" list. + +Each parse error is an object that contains error `code` and one-based +error location indices: `line` and `col`. + +`description`, `input` and `output` are always present. The other values +are optional. + +### Test set-up + +`test.input` is a string containing the characters to pass to the +tokenizer. Specifically, it represents the characters of the **input +stream**, and so implementations are expected to perform the processing +described in the spec's **Preprocessing the input stream** section +before feeding the result to the tokenizer. + +If `test.doubleEscaped` is present and `true`, then `test.input` is not +quite as described above. Instead, it must first be subjected to another +round of unescaping (i.e., in addition to any unescaping involved in the +JSON import), and the result of *that* represents the characters of the +input stream. Currently, the only unescaping required by this option is +to convert each sequence of the form \\uHHHH (where H is a hex digit) +into the corresponding Unicode code point. (Note that this option also +affects the interpretation of `test.output`.) + +`test.initialStates` is a list of strings, each being the name of a +tokenizer state which can be one of the following: + +- `Data state` +- `PLAINTEXT state` +- `RCDATA state` +- `RAWTEXT state` +- `Script data state` +- `CDATA section state` + + The test should be run once for each string, using it +to set the tokenizer's initial state for that run. If +`test.initialStates` is omitted, it defaults to `["Data state"]`. + +`test.lastStartTag` is a lowercase string that should be used as "the +tag name of the last start tag to have been emitted from this +tokenizer", referenced in the spec's definition of **appropriate end tag +token**. If it is omitted, it is treated as if "no start tag has been +emitted from this tokenizer". + +### Test results + +`test.output` is a list of tokens, ordered with the first produced by +the tokenizer the first (leftmost) in the list. The list must mach the +**complete** list of tokens that the tokenizer should produce. Valid +tokens are: + + ["DOCTYPE", name, public_id, system_id, correctness] + ["StartTag", name, {attributes}*, true*] + ["StartTag", name, {attributes}] + ["EndTag", name] + ["Comment", data] + ["Character", data] + +`public_id` and `system_id` are either strings or `null`. `correctness` +is either `true` or `false`; `true` corresponds to the force-quirks flag +being false, and vice-versa. + +When the self-closing flag is set, the `StartTag` array has `true` as +its fourth entry. When the flag is not set, the array has only three +entries for backwards compatibility. + +All adjacent character tokens are coalesced into a single +`["Character", data]` token. + +If `test.doubleEscaped` is present and `true`, then every string within +`test.output` must be further unescaped (as described above) before +comparing with the tokenizer's output. + +xmlViolation tests +------------------ + +`tokenizer/xmlViolation.test` differs from the above in a couple of +ways: + +- The name of the single member of the top-level JSON object is + "xmlViolationTests" instead of "tests". +- Each test's expected output assumes that implementation is applying + the tweaks given in the spec's "Coercing an HTML DOM into an + infoset" section. + |