[webkit-dev] Should we create an 8-bit path from the network stack to the parser?

Mon Mar 11 09:59:32 PDT 2013

No complaints with the long term direction.  I agree that it is a tall order to implement.

- Michael

On Mar 11, 2013, at 9:54 AM, Adam Barth <abarth at webkit.org> wrote:

> Oh, Ok.  I misunderstood your original message to say that the project
> as a whole had reached this conclusion, which certainly isn't the
> case, rather than that you personally had reached that conclusion.
> 
> As for the long-term direction of the HTML parser, my guess is that
> the optimum design will be to deliver the network bytes to the parser
> directly on the parser thread.  On the parser thread, we can merge
> charset decoding, input stream pre-processing, and tokenization to
> move directly from network bytes to CompactHTMLTokens.  That approach
> removes a number of copies, 8-bit-to-16-bit, and 16-bit-to-8-bit
> conversions.  Parsing directly into CompactHTMLTokens also means we
> won't have to do any copies or conversions at all for well-known
> strings (e.g., "div" and friends from HTMLNames).
> 
> If you're about to reply complaining about the above, please save your
> complaints for another time.  I realize that some parts of that design
> will be difficult or impossible to implement on some ports due to
> limitations on how then interact with their networking stack.  In any
> case, I don't plan to implement that design anytime soon, and I'm sure
> we'll have plenty of time to discuss its merits in the future.
> 
> Adam
> 
> 
> On Mon, Mar 11, 2013 at 8:56 AM, Michael Saboff <msaboff at apple.com> wrote:
>> Maciej,
>> 
>> *I* deemed using a character type template for the HTMLTokenizer as being
>> unwieldy.  Given there was the existing SegmentedString input abstraction,
>> it made logical sense to put the 8/16 bit coding there.  If I would have
>> moved the 8/16 logic into the tokenizer itself, we might have needed to do
>> 8->16 up conversions when a SegmentedStrings had mixed bit-ness in the
>> contained substrings.  Even if that wasn't the case, the patch would have
>> been far larger and likely include tricky code for escapes.
>> 
>> As I got into the middle of the 8-bit strings, I realized that not only
>> could I keep performance parity, but some of the techniques I came up with
>> offered good performance improvement.  The HTMLTokenizer ended up being one
>> of those cases.  This patch required a couple of reworks for performance
>> reasons and garnered a lot of discussion from various parts of the webkit
>> community.  See https://bugs.webkit.org/show_bug.cgi?id=90321 for the trail.
>> Ryosuke noted that this patch was responsible for a 24% improvement in the
>> url-parser test in their bots (comment 47).  My performance final results
>> are in comment 43 and show between 1 and 9% progression on the various HTML
>> parser tests.
>> 
>> Adam, If you believe there is more work to be done in the HTMLTokenizer,
>> file a bug and cc me.  I'm interested in hearing your thoughts.
>> 
>> - Michael
>> 
>> On Mar 9, 2013, at 4:24 PM, Maciej Stachowiak <mjs at apple.com> wrote:
>> 
>> 
>> On Mar 9, 2013, at 3:05 PM, Adam Barth <abarth at webkit.org> wrote:
>> 
>> 
>> In retrospect, I think what I was reacting to was msaboff statement
>> that an unnamed group of people had decided that the HTML tokenizer
>> was too unwieldy to have a dedicated 8-bit path.  In particular, it's
>> unclear to me who made that decision.  I certainly do not consider the
>> matter decided.
>> 
>> 
>> It would be good to find out who it was that said that (or more
>> specifically: "Using a character type template approach was deemed to be too
>> unwieldy for the HTML tokenizer.") so you can talk to them about it.
>> 
>> Michael?
>> 
>> Regards,
>> Maciej
>> 
>>