summaryrefslogtreecommitdiff
path: root/data/bench
diff options
context:
space:
mode:
Diffstat (limited to 'data/bench')
-rw-r--r--data/bench/lipsum-zh.html19
-rw-r--r--data/bench/lipsum.html40
-rw-r--r--data/bench/medium-fragment.html24
-rw-r--r--data/bench/small-fragment.html7
-rw-r--r--data/bench/strong.html1
-rw-r--r--data/bench/tiny-fragment.html1
6 files changed, 92 insertions, 0 deletions
diff --git a/data/bench/lipsum-zh.html b/data/bench/lipsum-zh.html
new file mode 100644
index 0000000..1efe2fa
--- /dev/null
+++ b/data/bench/lipsum-zh.html
@@ -0,0 +1,19 @@
+甀 曒檃檑 糲蘥蠩 櫋瀩, 嗢 剆坲姏 齸圞趲 葠蜄蛖 砎粁 擙樲橚 噅尰崺 廘榙榾 誙 煘煓, 腶 敔耜 逯郹酟 蝪蝩覤 顲鱭鸋, 趍 櫱瀯灂 碄碆碃 矠筸 砫粍 耜僇鄗 搋朠楟 溔 齝囃 槏 鼏噳墺 滭滹漇, 骱 翀胲胵 蝑蝞蝢 鑅鷖
+
+痯 荾莯 驧鬤鸕 梪涫湴, 踙 黈龠懱 椼毸溠 蠬襱覾 滱漮, 耜僇鄗 沀皯竻 饇馦 蒏 斠 墐墆墏 艎艑蔉 貕貔 廑憀慡 嫬廙彯 鳻嶬 跿, 飹勫嫢 熤熡磎 慛 賗跿, 灂瀿 綧 摿斠榱 橀槶澉 碄碆碃 鯦鯢鯡 踾踶輵 鍌鍗鍷 溿 滭滹, 綧 藙藨 蝪蝩覤 渮湸湤, 輗 鰝鰨 犌犐瑆 櫞氌瀙 鵳齖齘 塝 寁崏 摨敹暯 檌檒濦 滭滹漇, 撖 輈鄍 婸媥媕 漦澌潬, 膣 姛帡恦 莃荶衒 昢炾
+
+儮嬼懫 馦騧騜 覛谼貆 墏壾 鋱, 緦 豥趍 翍脝艴 絟缾臮 摲 輴郺 篧糑縒 獧瞝瞣 袀豇貣, 廞 鶄鵳 肒芅邥 泏狔狑 覛谼貆 儋圚墝 滭滹漇 鰝鰨 蔰, 忁曨曣 蝪蝩覤 埱娵徖 萴葂 跬, 緷 巂鞪 晛桼桾 踥踕踛 翣聜蒢 虥諰諨 箄縴儳 磼簎 殠, 銇 烺焆琀 鱐鱍鱕 垽娭屔 齫儽, 蒮 靮傿 烍烚珜 蒝蒧蓏 璈皞緪 圪妀 綧 溮煡煟 轛轝酅 濷瓂癚, 篧糑縒 谾踘遳 讘麡 腶, 鯦鯢鯡 邆錉霋 鋱 蛚袲 鋱鋟鋈 瀷瀹藶 騉鬵 嗢
+
+蝺 鰔鶟 濇燖燏 梪涫湴 齫儽戃, 馺 髬魆 齴讘麡 袟袘觕, 甀瞂硾 鍹餳駷 邆錉霋 曮禷 瑽 虰豖 瀿犨皫 蜬蝁蜠 檹瀔濼 榯, 獝瘝磈 輣鋄銶 抏旲 諃 褌 緳廞徲 轞騹鼚 瘵瘲 媥媕 踙 簎艜薤 鸙讟钃
+
+滘 鐩闤鞿 轞騹鼚 絟缾臮 碃稘, 鮥鴮 輗 渳湥牋 獿譿躐 趉軨鄇 鋑鋡髬 嶜憃撊 磑 棳棔 滜溙 蔏 烺焆琀 鱐鱍鱕 撌斳暩 緅 彃慔 釢髟偛 礯籔羻
+
+鏾鐇闠 擙樲橚 塓塕 慔 笢笣 壾 婸媥媕 奫嫮嫳, 愄揎揇 趡趛踠 瑽 秎穾, 腤萰 蛃袚觙 玝甿虮 濆澓澋 魦 綧 瘱瘵瘲 擙樲橚 瞵瞷矰 璈皞, 腠腶舝 翣聜蒢 魵 潧潣, 慖摲摓 橍殧澞 蟷蠉蟼 摮 嗢嗂 誙賗跿 磏磑禠 蝩覤 穊 鷕黰戄 鼀齕櫌 殔湝 緦, 緁 瘱瘵瘲 鸃鼞欘 窞綆腤 嗼嗹 輷邆 壿 櫱瀯灂 鶭黮齥 鏙闛颾, 眊砎粁 硻禂稢 薢蟌 鋈, 榎榯槄 墂嫫嵾 毄滱漮 豥 髟偛
+
+掭掝 暲 瞵瞷矰 鬄鵊鵙 鍎鞚韕, 齞齝囃 脬舑莕 蔍 嫳嫬 絼綒 縸縩薋 毊灚襳 珝砯砨 嵧 裌覅詵 崸嵀惉 慛 碞碠 蒮 橁橖澭 摨敹暯 罫蓱蒆 嵥嵧 蟷蠉 滆 櫱瀯灂 鶟儹巏 瘑睯碫
+
+滈 簎艜薤 廑憀慡 鑴鱱爧 屼汆, 歅 彔抳 鏾鐇闠 桏毢涒 垽娭屔 磝磢磭 袟袘觕 鍌鍗鍷 鋈 氠洷, 棳棔 雈靮傿 臡虈觿 氃濈瀄 槄 橀槶澉 麷劻穋 嘽 簅縭, 狑玝 垥娀庣 僤凘墈 岯岪弨 摲, 馺骱魡 抩枎殀 迗俀侹 蓪 錛鍆 蔰 暯樧 璸瓁穟 瘑睯碫 濍燂犝, 犵艿邔 獧瞝瞣 馻噈嫶 蝢褗 僣, 嬨嶵 壿 蠝襭譸 痑祣筇 觛詏貁 蜙 珶珸珿 濷瓂癚 箑箖 嗼嗹墋 峷敊浭 阰刲 鄜, 柦柋牬 寁崏庲 礯籔羻 鋍鞎 鉾 跠跬 蜸 勯噚嶢 礌簨繖 醳鏻鐆
+
+蟷蠉蟼 熩熝犚 摓 髽鮛 顤鰩鷎 駍駔鳿 鸃鼞欘 褅 牬玾 殍涾烰 誽賚賧 鴸鼢曘 搋朠 殟 蟼襛 溔 嶵嶯幯 蒘蝆蜪, 蟣襋 溿煔煃 銇韎餀 蹸蹪鏂 摮 踸躽 踣 廦廥彋 鼀齕櫌, 靾鞂 虥諰諨 婸媥媕 毄滱漮 魆 蒛 裧頖 鍆錌雔 枅杺枙 堔埧娾, 蓂蓌蓖 噾噿嚁 洷炟砏 砎粁 鋱, 嬼懫 杍肜阰 麷劻穋 蔊蓴蔖 豥
+
+暕 忀瀸蘌 褣諝趥 髽鮛 滍 噾噿 顤鰩鷎 逯郹酟 樏殣氀 煻獌 蚔趵郚 枲柊氠 鄃鈌鈅 暕, 禖穊稯 鄨鎷闒 鏾鐇闠 蒝蒧 誙 賌輈鄍 鶊鵱鶆 毊灚襳 珋疧 滘 瀗犡礝 簻臗藱 駔鳿 磑, 墐 圩芰敔 婂崥崣 溹溦滜 鍗鍷
diff --git a/data/bench/lipsum.html b/data/bench/lipsum.html
new file mode 100644
index 0000000..27dc14a
--- /dev/null
+++ b/data/bench/lipsum.html
@@ -0,0 +1,40 @@
+Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer eu arcu varius, fringilla nisi non, pulvinar lorem. Nam et sollicitudin nisi, eget tempus sapien. Suspendisse ac libero velit. Proin semper lacinia posuere. Morbi sollicitudin lacinia urna, eget aliquet risus lobortis sit amet. Fusce rhoncus sodales mauris, a rutrum erat elementum id. Integer nec sapien sit amet nisl convallis vehicula eu eu augue. Etiam nec elit ac nibh lacinia porta. Integer dapibus feugiat magna, eget varius ante vestibulum vel. Vestibulum vitae felis quis est tristique varius quis eget libero. Nullam tincidunt magna eros, nec luctus ante pretium at. Aenean laoreet justo vitae risus fringilla convallis. In malesuada scelerisque lorem, sed luctus tortor varius at. Morbi odio ligula, commodo eu sodales vitae, bibendum eget leo. In odio est, laoreet sit amet eleifend at, placerat in elit.
+
+Nullam ac viverra elit. Vestibulum et massa vel justo bibendum imperdiet. Donec elementum vitae nibh sit amet pellentesque. Ut id fringilla sem, in tincidunt quam. In a dui dignissim, gravida magna in, porta ante. Integer adipiscing porta risus. Nulla facilisi. Cras erat leo, tempor a ligula in, posuere ullamcorper nulla. Maecenas id auctor elit, imperdiet sagittis augue. Curabitur consectetur suscipit lorem porta sollicitudin. Etiam turpis orci, eleifend eu felis in, placerat consequat est. Sed ultrices, tellus ut volutpat venenatis, metus lectus malesuada diam, id ornare risus lectus sed massa. Vivamus mauris diam, lobortis ut interdum eget, porta a elit. Suspendisse potenti.
+
+Donec tincidunt nisi sed mollis feugiat. Mauris ultricies risus non eros feugiat tempor. In aliquam ut nunc id tempor. Curabitur vel elit dolor. Mauris ullamcorper tortor ac nisl feugiat, quis gravida nisl ullamcorper. Pellentesque a ligula quis erat rutrum sollicitudin in a metus. Aliquam ligula massa, cursus in libero a, blandit feugiat tortor. In ac auctor lorem. Ut faucibus leo nec egestas tristique.
+
+Nulla adipiscing consectetur odio, a iaculis eros aliquam at. Nullam dapibus ac ante et convallis. Phasellus tempor arcu velit. Donec adipiscing neque eu molestie mattis. Vestibulum id elit fringilla, ultrices orci eu, rhoncus purus. Mauris ornare nisi massa, et luctus tortor tincidunt vel. Maecenas eu ultrices enim, et varius est. Integer ipsum nunc, suscipit eu dapibus ac, ornare vitae sapien. Vestibulum posuere, nulla sed dictum tempus, magna metus commodo turpis, a aliquet orci tellus eu lectus. Mauris nulla magna, malesuada vitae iaculis ut, facilisis varius sem. In tristique sapien urna, et tristique dolor lacinia non. Suspendisse eu tincidunt eros. Pellentesque dignissim elit vitae purus auctor, non malesuada dolor scelerisque.
+
+Cras commodo tortor at risus ornare euismod a et risus. Sed rutrum, justo vel mollis condimentum, mi elit consectetur mi, non ultricies quam orci mollis sapien. Donec tincidunt, lacus molestie porttitor elementum, enim ligula hendrerit lacus, quis porttitor magna velit sed nisi. Quisque pretium eros id sem posuere consequat id sit amet nunc. Fusce pulvinar commodo ipsum, quis congue tellus faucibus eu. Sed bibendum dolor vitae ante porttitor pretium. Integer id malesuada eros, sed tristique metus. Nunc vitae turpis eu risus sodales vestibulum quis ut magna. In eget metus elit. Donec gravida libero risus, eget tempus erat varius eu. Vestibulum id dignissim sapien. Fusce pretium posuere lacus. Aliquam ac arcu sollicitudin, lacinia tellus vitae, pellentesque tortor. Mauris viverra velit ac lacus egestas sagittis. Duis auctor interdum tincidunt. Aenean eu ullamcorper sapien, sit amet sollicitudin magna.
+
+Nam vel lorem a quam sollicitudin fringilla sit amet quis nibh. Quisque commodo molestie augue. Vivamus ut erat aliquet, gravida ante at, suscipit arcu. Fusce nulla massa, lobortis vel dictum non, vehicula ac lorem. Etiam blandit sodales urna, at aliquet libero dapibus a. Cras odio mauris, porta at enim vitae, aliquam tincidunt libero. Praesent at tortor eu eros cursus consequat vel non elit. Mauris risus urna, sagittis eget turpis eu, malesuada semper nisl. Nunc posuere placerat ligula, in tristique urna pharetra et. Duis consectetur mauris nulla. Etiam auctor tincidunt molestie. Fusce eu faucibus diam, nec fermentum felis. Curabitur non lacinia quam, non luctus neque. Morbi sed ultrices diam.
+
+Fusce accumsan nisl sed nibh fringilla euismod. In ut arcu cursus erat imperdiet porttitor. Pellentesque tempus, nisi quis viverra convallis, eros sem dapibus magna, ut aliquet quam urna vitae dolor. Aenean id tortor turpis. Etiam lacinia arcu lorem, in consectetur arcu placerat sed. Duis non est ornare, dictum mi sit amet, cursus nunc. Suspendisse at venenatis massa. Etiam eget lorem diam. Donec tristique sapien at scelerisque porta. Aenean ornare ligula sed nibh gravida, vel commodo erat ultrices. Donec id enim purus. Vivamus malesuada tristique sapien id tempus. Morbi nec nunc dolor.
+
+Aliquam molestie turpis cursus blandit blandit. Integer imperdiet ullamcorper arcu, a fermentum nisi. Cras hendrerit quam id mollis elementum. Etiam ut erat ac leo posuere aliquet eget non tortor. Nam vel velit sed dui tincidunt gravida eget eget risus. Suspendisse adipiscing sed nulla vel molestie. Aliquam suscipit, sem sed volutpat sagittis, magna enim feugiat erat, pharetra feugiat magna neque a ante. Duis at metus eget leo congue molestie. Vivamus id massa ornare, rutrum ante nec, ullamcorper lacus. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Vestibulum lobortis arcu eu arcu hendrerit convallis. Integer mollis velit at ante consequat, eu pharetra erat venenatis. Integer tincidunt sit amet massa vel hendrerit. Morbi malesuada facilisis augue sed congue. Phasellus porttitor vel mi eu imperdiet. Aenean tincidunt, massa et tristique mollis, nisl metus vulputate est, quis sollicitudin metus ipsum vel felis.
+
+Suspendisse nec feugiat dui. Proin nec lorem semper, dignissim leo et, suscipit turpis. In posuere sem ut blandit scelerisque. Fusce vel ultricies augue, adipiscing pretium lacus. Mauris ac dui non odio convallis pellentesque. Curabitur posuere nec odio ut sodales. Morbi varius risus lacinia, convallis mauris in, tristique turpis.
+
+Vivamus hendrerit justo augue, et molestie ligula aliquam ac. Nunc nec vehicula felis. Donec quam lacus, commodo sollicitudin aliquet eu, aliquam ut leo. Donec vulputate arcu urna, in molestie orci faucibus non. Praesent ut ullamcorper ante. Quisque sollicitudin libero in arcu gravida, quis scelerisque tortor volutpat. Nulla ornare mi ac odio sagittis rutrum. Sed quis sagittis felis. Praesent bibendum orci sed risus elementum, malesuada posuere massa condimentum. Sed velit nunc, pulvinar eu feugiat at, ultrices eu odio. Mauris lacinia ut odio eget ornare. Nullam commodo mollis lorem, ac vehicula justo tristique a.
+
+Morbi est ipsum, egestas a urna sed, aliquet tempus ipsum. In eget fermentum libero. Nullam a sodales dui. Nam imperdiet condimentum luctus. Morbi bibendum at nulla sed aliquam. Quisque nibh nibh, sollicitudin non ullamcorper commodo, viverra non metus. Suspendisse eleifend turpis massa. Cras tortor metus, rutrum sit amet tellus a, sodales suscipit eros. Sed in vulputate ligula. Integer posuere velit sed nisl tristique suscipit. Quisque bibendum eleifend enim in sollicitudin. Phasellus tincidunt orci pretium, molestie felis eu, sodales metus.
+
+Vestibulum consectetur orci ut blandit aliquet. Sed posuere cursus lacus vestibulum posuere. Phasellus ut risus sem. Vivamus et purus non felis pellentesque lacinia. Phasellus aliquam, diam eget vestibulum lobortis, purus tortor porttitor eros, vitae auctor lorem velit a turpis. Integer eu metus vel nisi porta lobortis sollicitudin eget arcu. Maecenas ac blandit dolor. In et sapien ornare, dignissim nulla quis, tempor odio.
+
+Ut nec quam ligula. Ut euismod, nisi nec iaculis faucibus, nisi arcu dignissim neque, a fringilla dolor tellus ut arcu. Curabitur iaculis rhoncus orci sed fermentum. Cras augue elit, eleifend sodales pellentesque ac, varius bibendum nulla. Etiam id diam non purus porta lobortis. Cras fringilla metus in ipsum laoreet placerat. Integer vel quam nec libero varius mattis in non nibh.
+
+Pellentesque adipiscing feugiat neque, vitae imperdiet dui. Duis pharetra elit a dictum laoreet. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Nulla vulputate malesuada nisi, vel egestas nulla mollis ut. Nunc faucibus pharetra leo ac ultricies. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus in odio a magna convallis molestie ut at mauris. Morbi bibendum id dui id imperdiet. Curabitur volutpat et erat quis venenatis. Integer tincidunt et felis sed rutrum. Donec vitae porttitor enim. Sed nisi nunc, auctor ac ullamcorper quis, eleifend id metus.
+
+Morbi felis est, tincidunt at eros at, interdum tempor tortor. Nam et semper metus. Vivamus lacinia pulvinar magna, a lacinia ligula condimentum vitae. Donec vitae ullamcorper diam. Aenean auctor mollis tincidunt. Mauris hendrerit eros quis nulla posuere, non mattis tellus venenatis. Fusce et ligula nec arcu consequat pulvinar. Maecenas sagittis odio justo, at ultrices velit aliquet quis. In hac habitasse platea dictumst. Suspendisse viverra nunc vitae lectus bibendum, vel pretium arcu pretium. Curabitur iaculis condimentum magna ac rutrum. Aenean placerat massa nunc, id vehicula magna vulputate eget. Integer dignissim nunc in enim bibendum consequat vitae id leo. Mauris quis aliquam quam. Suspendisse vel fringilla purus. Mauris sodales dui vitae lacus pellentesque tincidunt a eget nunc.
+
+Nullam imperdiet vestibulum magna nec dictum. Vestibulum scelerisque vestibulum congue. Phasellus fermentum pulvinar elit, eget fringilla arcu vestibulum sed. Mauris pretium nulla in consectetur cursus. Cras malesuada est vulputate hendrerit bibendum. Aenean a tristique diam, ac convallis ipsum. Nunc ac justo ut ante tristique pulvinar. Donec ornare leo sed iaculis rutrum. Integer tincidunt vestibulum massa scelerisque accumsan. Maecenas malesuada, orci at tincidunt faucibus, ipsum velit condimentum odio, vitae cursus risus justo vel orci. Interdum et malesuada fames ac ante ipsum primis in faucibus. Vivamus eu tincidunt leo. Nam a faucibus ipsum, in convallis ligula. Fusce urna lorem, iaculis ut pharetra a, laoreet a mauris. Maecenas molestie justo enim, vitae tincidunt nulla dictum quis.
+
+Ut ac purus ut velit feugiat tincidunt nec sit amet lorem. Mauris nulla sapien, rhoncus a condimentum et, tincidunt ut enim. Nullam eu rhoncus ante. Proin eget erat est. Vivamus suscipit fringilla metus, ut scelerisque urna. Vivamus id porta nibh, ac tincidunt nisl. Vivamus commodo tincidunt turpis a molestie. Phasellus nec interdum enim. Cras accumsan tristique massa.
+
+Cras vitae blandit dolor. Sed purus sem, pharetra sed orci eu, fermentum porttitor magna. Morbi dictum gravida sodales. Pellentesque varius non quam in ullamcorper. Sed in mauris sit amet sapien tempus gravida. Aliquam suscipit nulla a risus ullamcorper, et pharetra leo pharetra. Pellentesque neque lectus, molestie et eros id, consequat sagittis arcu. Nullam suscipit ipsum id lacus tincidunt sollicitudin. Fusce eget leo non massa tempor scelerisque ut a enim. Vestibulum a elementum ligula. Aliquam vehicula semper nibh nec imperdiet. Interdum et malesuada fames ac ante ipsum primis in faucibus. Etiam pretium ante eget lectus rutrum auctor.
+
+Sed pharetra quam metus. Aenean ac rutrum arcu. Donec sit amet pharetra nulla, vitae porttitor eros. Nullam accumsan cursus dolor, ut sodales magna tincidunt quis. Quisque egestas pellentesque velit id fringilla. Duis vel nisi libero. Vivamus ultrices ligula vel tempor lacinia. Cras dictum ut nunc vel suscipit. Duis convallis tortor varius consectetur tempor. Maecenas sed pharetra quam. Nunc malesuada risus justo, et vehicula quam placerat at. Vestibulum non orci eu felis viverra convallis.
+
+Nulla accumsan ultrices ligula, id commodo odio interdum sed. Fusce sit amet varius tortor. Integer non mattis eros. Curabitur vulputate massa non ante lacinia sodales. Aenean a feugiat ligula. Fusce ultricies molestie lectus auctor dignissim. Duis eu lorem feugiat, varius quam vel, volutpat magna. Pellentesque nec nisl ut lorem interdum condimentum scelerisque eu purus. Vestibulum porttitor elementum lectus quis lobortis. Vestibulum non sem ultricies, elementum risus non, aliquet ipsum. Phasellus pellentesque lacinia purus et tristique. Aenean lacinia, mi vel rutrum dapibus, nibh lacus hendrerit velit, ac faucibus massa erat sodales dui. Etiam in enim varius, auctor risus vel, blandit quam.
+
diff --git a/data/bench/medium-fragment.html b/data/bench/medium-fragment.html
new file mode 100644
index 0000000..570bef2
--- /dev/null
+++ b/data/bench/medium-fragment.html
@@ -0,0 +1,24 @@
+<h2><span class="mw-headline" id="History">History</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="http://en.wikipedia.org/w/index.php?title=UTF-8&amp;action=edit&amp;section=1" title="Edit section: History">edit</a><span class="mw-editsection-bracket">]</span></span></h2>
+<p>By early 1992 the search was on for a good byte-stream encoding of multi-byte character sets. The draft <a href="http://en.wikipedia.org/wiki/Universal_Character_Set" title="Universal Character Set">ISO 10646</a> standard contained a non-required <a href="http://en.wikipedia.org/wiki/Addendum" title="Addendum">annex</a> called <a href="http://en.wikipedia.org/wiki/UTF-1" title="UTF-1">UTF-1</a>
+ that provided a byte-stream encoding of its 32-bit code points. This
+encoding was not satisfactory on performance grounds, but did introduce
+the notion that bytes in the range of 0–127 continue representing the
+ASCII characters in UTF, thereby providing backward compatibility with
+ASCII.</p>
+<p>In July 1992, the <a href="http://en.wikipedia.org/wiki/X/Open" title="X/Open">X/Open</a> committee XoJIG was looking for a better encoding. Dave Prosser of <a href="http://en.wikipedia.org/wiki/Unix_System_Laboratories" title="Unix System Laboratories">Unix System Laboratories</a>
+ submitted a proposal for one that had faster implementation
+characteristics and introduced the improvement that 7-bit ASCII
+characters would <i>only</i> represent themselves; all multibyte
+sequences would include only bytes where the high bit was set. This
+original proposal, FSS-UTF (File System Safe UCS Transformation Format),
+ was similar in concept to UTF-8, but lacked the crucial property of <a href="http://en.wikipedia.org/wiki/Self-synchronizing_code" title="Self-synchronizing code">self-synchronization</a>.<sup id="cite_ref-pikeviacambridge_7-0" class="reference"><a href="#cite_note-pikeviacambridge-7"><span>[</span>7<span>]</span></a></sup><sup id="cite_ref-8" class="reference"><a href="#cite_note-8"><span>[</span>8<span>]</span></a></sup></p>
+<p>In August 1992, this proposal was circulated by an <a href="http://en.wikipedia.org/wiki/IBM" title="IBM">IBM</a> X/Open representative to interested parties. <a href="http://en.wikipedia.org/wiki/Ken_Thompson" title="Ken Thompson">Ken Thompson</a> of the <a href="http://en.wikipedia.org/wiki/Plan_9_from_Bell_Labs" title="Plan 9 from Bell Labs">Plan 9</a> <a href="http://en.wikipedia.org/wiki/Operating_system" title="Operating system">operating system</a> group at <a href="http://en.wikipedia.org/wiki/Bell_Labs" title="Bell Labs">Bell Labs</a>
+ then made a small but crucial modification to the encoding, making it
+very slightly less bit-efficient than the previous proposal but allowing
+ it to be <a href="http://en.wikipedia.org/wiki/Self-synchronizing_code" title="Self-synchronizing code">self-synchronizing</a>,
+ meaning that it was no longer necessary to read from the beginning of
+the string to find code point boundaries. Thompson's design was outlined
+ on September 2, 1992, on a placemat in a New Jersey diner with <a href="http://en.wikipedia.org/wiki/Rob_Pike" title="Rob Pike">Rob Pike</a>. In the following days, Pike and Thompson implemented it and updated <a href="http://en.wikipedia.org/wiki/Plan_9_from_Bell_Labs" title="Plan 9 from Bell Labs">Plan 9</a> to use it throughout, and then communicated their success back to X/Open.<sup id="cite_ref-pikeviacambridge_7-1" class="reference"><a href="#cite_note-pikeviacambridge-7"><span>[</span>7<span>]</span></a></sup></p>
+<p>UTF-8 was first officially presented at the <a href="http://en.wikipedia.org/wiki/USENIX" title="USENIX">USENIX</a> conference in <a href="http://en.wikipedia.org/wiki/San_Diego" title="San Diego">San Diego</a>, from January 25 to 29, 1993.</p>
+<p>Google reported that in 2008 UTF-8 (misleadingly labelled "Unicode") became the most common encoding for HTML files.<sup id="cite_ref-markdavis_9-0" class="reference"><a href="#cite_note-markdavis-9"><span>[</span>9<span>]</span></a></sup><sup id="cite_ref-davidgoodger_10-0" class="reference"><a href="#cite_note-davidgoodger-10"><span>[</span>10<span>]</span></a></sup></p>
+<h2><span class="mw-headline" id="Description">Description</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="http://en.wikipedia.org/w/index.php?title=UTF-8&amp;action=edit&amp;section=2" title="Edit section: Description">edit</a><span class="mw-editsection-bracket">]</span></span></h2>
diff --git a/data/bench/small-fragment.html b/data/bench/small-fragment.html
new file mode 100644
index 0000000..a0b9643
--- /dev/null
+++ b/data/bench/small-fragment.html
@@ -0,0 +1,7 @@
+<p>In July 1992, the <a href="http://en.wikipedia.org/wiki/X/Open" title="X/Open">X/Open</a> committee XoJIG was looking for a better encoding. Dave Prosser of <a href="http://en.wikipedia.org/wiki/Unix_System_Laboratories" title="Unix System Laboratories">Unix System Laboratories</a>
+ submitted a proposal for one that had faster implementation
+characteristics and introduced the improvement that 7-bit ASCII
+characters would <i>only</i> represent themselves; all multibyte
+sequences would include only bytes where the high bit was set. This
+original proposal, FSS-UTF (File System Safe UCS Transformation Format),
+ was similar in concept to UTF-8, but lacked the crucial property of <a href="http://en.wikipedia.org/wiki/Self-synchronizing_code" title="Self-synchronizing code">self-synchronization</a>.
diff --git a/data/bench/strong.html b/data/bench/strong.html
new file mode 100644
index 0000000..0ef665e
--- /dev/null
+++ b/data/bench/strong.html
@@ -0,0 +1 @@
+<strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong> \ No newline at end of file
diff --git a/data/bench/tiny-fragment.html b/data/bench/tiny-fragment.html
new file mode 100644
index 0000000..7ce5354
--- /dev/null
+++ b/data/bench/tiny-fragment.html
@@ -0,0 +1 @@
+<p>Hello, world!</p>