From 57e7eefcbe6fb8c3dc4b01c707be9de4c34963a7 Mon Sep 17 00:00:00 2001 From: Martin Fischer Date: Thu, 8 Apr 2021 08:42:01 +0200 Subject: import https://github.com/servo/html5ever commit d1206daa740305f55a5fa159e43eb33afc359cb4 --- data/bench/medium-fragment.html | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+) create mode 100644 data/bench/medium-fragment.html (limited to 'data/bench/medium-fragment.html') diff --git a/data/bench/medium-fragment.html b/data/bench/medium-fragment.html new file mode 100644 index 0000000..570bef2 --- /dev/null +++ b/data/bench/medium-fragment.html @@ -0,0 +1,24 @@ +

History[edit]

+

By early 1992 the search was on for a good byte-stream encoding of multi-byte character sets. The draft ISO 10646 standard contained a non-required annex called UTF-1 + that provided a byte-stream encoding of its 32-bit code points. This +encoding was not satisfactory on performance grounds, but did introduce +the notion that bytes in the range of 0–127 continue representing the +ASCII characters in UTF, thereby providing backward compatibility with +ASCII.

+

In July 1992, the X/Open committee XoJIG was looking for a better encoding. Dave Prosser of Unix System Laboratories + submitted a proposal for one that had faster implementation +characteristics and introduced the improvement that 7-bit ASCII +characters would only represent themselves; all multibyte +sequences would include only bytes where the high bit was set. This +original proposal, FSS-UTF (File System Safe UCS Transformation Format), + was similar in concept to UTF-8, but lacked the crucial property of self-synchronization.[7][8]

+

In August 1992, this proposal was circulated by an IBM X/Open representative to interested parties. Ken Thompson of the Plan 9 operating system group at Bell Labs + then made a small but crucial modification to the encoding, making it +very slightly less bit-efficient than the previous proposal but allowing + it to be self-synchronizing, + meaning that it was no longer necessary to read from the beginning of +the string to find code point boundaries. Thompson's design was outlined + on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. In the following days, Pike and Thompson implemented it and updated Plan 9 to use it throughout, and then communicated their success back to X/Open.[7]

+

UTF-8 was first officially presented at the USENIX conference in San Diego, from January 25 to 29, 1993.

+

Google reported that in 2008 UTF-8 (misleadingly labelled "Unicode") became the most common encoding for HTML files.[9][10]

+

Description[edit]

-- cgit v1.2.3