[webkit-dev] Should we create an 8-bit path from the network stack to the parser?

Thu Mar 7 14:37:37 PST 2013

> On Thu, Mar 7, 2013 at 2:14 PM, Michael Saboff <msaboff at apple.com> wrote:
>> The various tokenizers / lexers work various ways to handle LChar versus UChar input streams.  Most of the other tokenizers are templatized on input character type. In the case of HTML, the tokenizer handles a UChar character at a time.  For 8 bit input streams, the zero extension of a LChar to a UChar is zero cost.  There may be additional performance to be gained by doing all other possible handling in 8 bits, but an 8 bit stream can still contain escapes that need a UChar representation as you point out.  Using a character type template approach was deemed to be too unwieldy for the HTML tokenizer.  The HTML tokenizer uses SegmentedString's that can consist of sub strings with either LChar and UChar.  That is where the LChar to UChar zero extension happens for an 8 bit sub string.
>> 
>> My research showed that at the time showed that there were very few UTF-16 only resources (<<5% IIRC), although I expect the number to grow.

On Mar 7, 2013, at 2:16 PM, Adam Barth <abarth at webkit.org> wrote:
> Yes, I understand how the HTML tokenizer works.  :)

I didn't understand these details, and I really appreciate Michael describing them.  I'm also glad others on the mailing list had an opportunity to get something out of this.

~Brady