[webkit-dev] HTML5 tokenizer landing soon

Mon Jun 14 13:48:02 PDT 2010

On Mon, Jun 14, 2010 at 12:44 PM, Geoffrey Garen <ggaren at apple.com> wrote:
> Measurements like this are more valuable.
>
> Not all HTML on the web is like the HTML in the HTML5 spec, though. Am I
> right that the parser test you're using doesn't test invalid HTML at all?

If you have another corpus you'd to include in the benchmark, feel
free to either add it yourself or send me a link.  We need a large
document to make the code spend a measurable amount of time in the
parser, and we also need a document under a WebKit-compatible license.

On Mon, Jun 14, 2010 at 12:58 PM, Mike Marchywka <marchywka at hotmail.com> wrote:
> I'm starting to fear that the next blink of my disk light was cause me
> to go into a fit. One thing you can consider right away is,
> "plays nice with the other kids on a variety of playground equipment.."
> That is, it may be great when it has unlimited memory but does
> it start thrashing as soon as part of it is in VM. Not
> sure how to test this entirely but this is such a huge problem I
> just thought I would mention it again. Essentially it
> comes down to memory coherence.

I believe the new tokenizer has a similar memory footprint to the old
code, but I don't have a good way to measure that.  The bulk of the
memory is used by the "data buffer," which is about 2k bytes in both.

On Mon, Jun 14, 2010 at 1:06 PM, David Hyatt <hyatt at apple.com> wrote:
> I really do consider the current code to be "barely hackable," so any new
> code that follows the HTML5 spec (especially one that has a document.write /
> pending script model that is easier to understand) is a huge win in my book.

We ended up using the same algorithm as the old tokenizer to manage
insertion points, however, we moved all the work into a separate
InputStream data structure:

http://trac.webkit.org/browser/trunk/WebCore/html/HTML5DocumentParser.h#L75

The old code was actually pretty clever once I figured out what it was
doing.  We're considering moving InputStream into its own file instead
of keeping it as an inner class of the document parser.

Adam