[webkit-dev] Should we create an 8-bit path from the network stack to the parser?

Mon Mar 11 09:54:17 PDT 2013

Oh, Ok.  I misunderstood your original message to say that the project
as a whole had reached this conclusion, which certainly isn't the
case, rather than that you personally had reached that conclusion.

As for the long-term direction of the HTML parser, my guess is that
the optimum design will be to deliver the network bytes to the parser
directly on the parser thread.  On the parser thread, we can merge
charset decoding, input stream pre-processing, and tokenization to
move directly from network bytes to CompactHTMLTokens.  That approach
removes a number of copies, 8-bit-to-16-bit, and 16-bit-to-8-bit
conversions.  Parsing directly into CompactHTMLTokens also means we
won't have to do any copies or conversions at all for well-known
strings (e.g., "div" and friends from HTMLNames).

If you're about to reply complaining about the above, please save your
complaints for another time.  I realize that some parts of that design
will be difficult or impossible to implement on some ports due to
limitations on how then interact with their networking stack.  In any
case, I don't plan to implement that design anytime soon, and I'm sure
we'll have plenty of time to discuss its merits in the future.

Adam

On Mon, Mar 11, 2013 at 8:56 AM, Michael Saboff <msaboff at apple.com> wrote:
> Maciej,
>
> *I* deemed using a character type template for the HTMLTokenizer as being
> unwieldy.  Given there was the existing SegmentedString input abstraction,
> it made logical sense to put the 8/16 bit coding there.  If I would have
> moved the 8/16 logic into the tokenizer itself, we might have needed to do
> 8->16 up conversions when a SegmentedStrings had mixed bit-ness in the
> contained substrings.  Even if that wasn't the case, the patch would have
> been far larger and likely include tricky code for escapes.
>
> As I got into the middle of the 8-bit strings, I realized that not only
> could I keep performance parity, but some of the techniques I came up with
> offered good performance improvement.  The HTMLTokenizer ended up being one
> of those cases.  This patch required a couple of reworks for performance
> reasons and garnered a lot of discussion from various parts of the webkit
> community.  See https://bugs.webkit.org/show_bug.cgi?id=90321 for the trail.
> Ryosuke noted that this patch was responsible for a 24% improvement in the
> url-parser test in their bots (comment 47).  My performance final results
> are in comment 43 and show between 1 and 9% progression on the various HTML
> parser tests.
>
> Adam, If you believe there is more work to be done in the HTMLTokenizer,
> file a bug and cc me.  I'm interested in hearing your thoughts.
>
> - Michael
>
> On Mar 9, 2013, at 4:24 PM, Maciej Stachowiak <mjs at apple.com> wrote:
>
>
> On Mar 9, 2013, at 3:05 PM, Adam Barth <abarth at webkit.org> wrote:
>
>
> In retrospect, I think what I was reacting to was msaboff statement
> that an unnamed group of people had decided that the HTML tokenizer
> was too unwieldy to have a dedicated 8-bit path.  In particular, it's
> unclear to me who made that decision.  I certainly do not consider the
> matter decided.
>
>
> It would be good to find out who it was that said that (or more
> specifically: "Using a character type template approach was deemed to be too
> unwieldy for the HTML tokenizer.") so you can talk to them about it.
>
> Michael?
>
> Regards,
> Maciej
>
>