[webkit-dev] Should we create an 8-bit path from the network stack to the parser?

Adam Barth abarth at webkit.org
Sat Mar 9 15:05:12 PST 2013

On Sat, Mar 9, 2013 at 12:48 PM, Luis de Bethencourt
<luis at debethencourt.com> wrote:
> On Mar 7, 2013 10:37 PM, "Brady Eidson" <beidson at apple.com> wrote:
>> > On Thu, Mar 7, 2013 at 2:14 PM, Michael Saboff <msaboff at apple.com>
>> > wrote:
>> >> The various tokenizers / lexers work various ways to handle LChar
>> >> versus UChar input streams.  Most of the other tokenizers are templatized on
>> >> input character type. In the case of HTML, the tokenizer handles a UChar
>> >> character at a time.  For 8 bit input streams, the zero extension of a LChar
>> >> to a UChar is zero cost.  There may be additional performance to be gained
>> >> by doing all other possible handling in 8 bits, but an 8 bit stream can
>> >> still contain escapes that need a UChar representation as you point out.
>> >> Using a character type template approach was deemed to be too unwieldy for
>> >> the HTML tokenizer.  The HTML tokenizer uses SegmentedString's that can
>> >> consist of sub strings with either LChar and UChar.  That is where the LChar
>> >> to UChar zero extension happens for an 8 bit sub string.
>> >>
>> >> My research showed that at the time showed that there were very few
>> >> UTF-16 only resources (<<5% IIRC), although I expect the number to grow.
>> On Mar 7, 2013, at 2:16 PM, Adam Barth <abarth at webkit.org> wrote:
>> > Yes, I understand how the HTML tokenizer works.  :)
>> I didn't understand these details, and I really appreciate Michael
>> describing them.  I'm also glad others on the mailing list had an
>> opportunity to get something out of this.
> I agree with Brady. I got some interesting learning out of this thread.
> Always nice to read explanations and documentation about how things work.
> Valuable content.

In retrospect, I think what I was reacting to was msaboff statement
that an unnamed group of people had decided that the HTML tokenizer
was too unwieldy to have a dedicated 8-bit path.  In particular, it's
unclear to me who made that decision.  I certainly do not consider the
matter decided.


More information about the webkit-dev mailing list