[webkit-dev] Should we create an 8-bit path from the network stack to the parser?

Thu Mar 7 14:14:30 PST 2013

The various tokenizers / lexers work various ways to handle LChar versus UChar input streams.  Most of the other tokenizers are templatized on input character type. In the case of HTML, the tokenizer handles a UChar character at a time.  For 8 bit input streams, the zero extension of a LChar to a UChar is zero cost.  There may be additional performance to be gained by doing all other possible handling in 8 bits, but an 8 bit stream can still contain escapes that need a UChar representation as you point out.  Using a character type template approach was deemed to be too unwieldy for the HTML tokenizer.  The HTML tokenizer uses SegmentedString's that can consist of sub strings with either LChar and UChar.  That is where the LChar to UChar zero extension happens for an 8 bit sub string.

My research showed that at the time showed that there were very few UTF-16 only resources (<<5% IIRC), although I expect the number to grow.

- Michael

On Mar 7, 2013, at 11:11 AM, Adam Barth <abarth at webkit.org> wrote:

> The HTMLTokenizer still works in UChars.  There's likely some
> performance to be gained by moving it to an 8-bit character type.
> There's some trickiness involved because HTML entities can expand to
> characters outside of Latin-1. Also, it's unclear if we want two
> tokenizers (one that's 8 bits wide and another that's 16 bits wide) or
> if we should find a way for the 8-bit tokenizer to handle, for
> example, UTF-16 encoded network responses.
> 
> Adam
> 
> 
> On Thu, Mar 7, 2013 at 10:11 AM, Darin Adler <darin at apple.com> wrote:
>> No. I retract my question. Sounds like we already have it right! thanks for setting me straight.
>> 
>> Maybe some day we could make a non copying code path that points directly at the data in the SharedBuffer, but I have no idea if that'd be beneficial.
>> 
>> -- Darin
>> 
>> Sent from my iPhone
>> 
>> On Mar 7, 2013, at 10:01 AM, Michael Saboff <msaboff at apple.com> wrote:
>> 
>>> There is an all-ASCII case in TextCodecUTF8::decode().  It should be keeping all ASCII data as 8 bit.  TextCodecWindowsLatin1::decode() has not only an all-ASCII case, but it only up converts to 16 bit in a couple of rare cases.  Is there some other case you don't think we are handling?
>>> 
>>> - Michael
>>> 
>>> On Mar 7, 2013, at 9:29 AM, Darin Adler <darin at apple.com> wrote:
>>> 
>>>> Hi folks.
>>>> 
>>>> Today, bytes that come in from the network get turned into UTF-16 by the decoding process. We then turn some of them back into Latin-1 during the parsing process. Should we make changes so there’s an 8-bit path? It might be as simple as writing code that has more of an all-ASCII special case in TextCodecUTF8 and something similar in TextCodecWindowsLatin1.
>>>> 
>>>> Is there something significant to be gained here? I’ve been wondering this for a while, so I thought I’d ask the rest of the WebKit contributors.
>>>> 
>>>> -- Darin
>>>> _______________________________________________
>>>> webkit-dev mailing list
>>>> webkit-dev at lists.webkit.org
>>>> https://lists.webkit.org/mailman/listinfo/webkit-dev
>>> 
>> _______________________________________________
>> webkit-dev mailing list
>> webkit-dev at lists.webkit.org
>> https://lists.webkit.org/mailman/listinfo/webkit-dev