[webkit-dev] HTML5 tokenizer landing soon

Geoffrey Garen ggaren at apple.com
Mon Jun 14 15:38:19 PDT 2010

> On Mon, Jun 14, 2010 at 12:44 PM, Geoffrey Garen <ggaren at apple.com> wrote:
>> Measurements like this are more valuable.
>> Not all HTML on the web is like the HTML in the HTML5 spec, though. Am I
>> right that the parser test you're using doesn't test invalid HTML at all?
> If you have another corpus you'd to include in the benchmark, feel
> free to either add it yourself or send me a link.  We need a large
> document to make the code spend a measurable amount of time in the
> parser, and we also need a document under a WebKit-compatible license.

http://bclary.com/2004/11/07/ could be helpful.

Its license seems compatible:

> According to Ecma formal publications,
>> Ecma Standards and Technical Reports are made available to all interested persons or organizations, free of charge and copyright, in printed form and, as files in Acrobat ® PDF format.
> This version was created by Bob Clary and includes errata. It is released under the same terms as the original ECMAScript Language Specification and is free of charge and copyright.

However it, too, is valid XHTML 1.0 Transitional. Where's some invalid HTML when you need it?! :)

Here are two suggestions for performance-testing invalid HTML before turning on the new parser:

1. Write a fuzzer-like script to invalidate these valid HTML files by leaving off some closing tags, adding invalid parent nodes, removing the <head> tag and putting its contents directly in the <body>, and things like that.

2. When you're ready, ask someone at Apple to try the PLT with the new parser. It's not redistributable, but there's plenty of invalid HTML in it.


More information about the webkit-dev mailing list