[webkit-dev] HTML5 tokenizer landing soon

Mon Jun 14 15:57:14 PDT 2010

On Mon, Jun 14, 2010 at 3:38 PM, Geoffrey Garen <ggaren at apple.com> wrote:
>> On Mon, Jun 14, 2010 at 12:44 PM, Geoffrey Garen <ggaren at apple.com> wrote:
>>> Measurements like this are more valuable.
>>>
>>> Not all HTML on the web is like the HTML in the HTML5 spec, though. Am I
>>> right that the parser test you're using doesn't test invalid HTML at all?
>>
>> If you have another corpus you'd to include in the benchmark, feel
>> free to either add it yourself or send me a link.  We need a large
>> document to make the code spend a measurable amount of time in the
>> parser, and we also need a document under a WebKit-compatible license.
>
> http://bclary.com/2004/11/07/ could be helpful.
>
> Its license seems compatible:
>
>> According to Ecma formal publications,
>>
>>> Ecma Standards and Technical Reports are made available to all interested persons or organizations, free of charge and copyright, in printed form and, as files in Acrobat ® PDF format.
>>
>> This version was created by Bob Clary and includes errata. It is released under the same terms as the original ECMAScript Language Specification and is free of charge and copyright.
>
> However it, too, is valid XHTML 1.0 Transitional. Where's some invalid HTML when you need it?! :)

Thanks.  That document looks machine generated, which isn't great
because it will only hit a very narrow slice of the parser.  The HTML5
spec, at least, was authored by hand.

> Here are two suggestions for performance-testing invalid HTML before turning on the new parser:
>
> 1. Write a fuzzer-like script to invalidate these valid HTML files by leaving off some closing tags, adding invalid parent nodes, removing the <head> tag and putting its contents directly in the <body>, and things like that.

That might be worth doing when we're ready to work on the tree
constructor.  At the moment, we're re-using the old tree constructor,
so these code paths aren't changing.  What is changing is the
tokenizer, so we'd need some token-level modifications.

> 2. When you're ready, ask someone at Apple to try the PLT with the new parser. It's not redistributable, but there's plenty of invalid HTML in it.

Thanks, I appreciate the offer.  Rather than bother someone at Apple,
another option is to land the change and make use of the public PTL
benchmark's that the Chromium project runs on web WebKit checkin:

http://build.chromium.org/buildbot/perf/linux-release-webkit-latest/moz/report.html?history=150
http://build.chromium.org/buildbot/perf/linux-release-webkit-latest/morejs/report.html?history=150
http://build.chromium.org/buildbot/perf/linux-release-webkit-latest/intl1/report.html?history=150
http://build.chromium.org/buildbot/perf/linux-release-webkit-latest/intl2/report.html?history=150

There are similar graphs for the other platforms.  I've often wished
build.webkit.org had similar graphs so we could keep more detailed
history about PTL regressions.

Adam