[webkit-dev] Feature Announcement: Moving HTML Parser off the Main Thread

Wed Jan 9 19:07:32 PST 2013

On Wed, Jan 9, 2013 at 6:38 PM, Oliver Hunt <oliver at apple.com> wrote:
> How will we ensure thread safety?  Even at just the tokenizing level don't we use AtomicString?  AtromicString isn't threadsafe wrt StringImpl IIRC so this seems like it sould add a world of hurt.

AtomicString is already usable from other threads
(http://trac.webkit.org/changeset/38094), but are correct this is the
core concern!  PickledToken (or whatever it's called) will have to be
written very carefully in order to minimize/eliminate copies, while
still guaranteeing thread safety.  The correct design and handling of
PickledToken is the entire question of this whole endeavor.

> I realise it's been a long time since I've worked on this so it's completely possible that I'm not aware of the current behavior.
>
> That aside I question what the benefit of this will be.  All those cases where we've started parsing html are intrinsically tied to the web's general "single thread of execution" model, which implies that even if we do push parsing into a separate thread we'll just end up with the ui thread blocked on the parsing thread which doesn't seem hugely superior.
>
> What is the objective here? To improve performance, add parallelism, or reduce latency?

The core goal is to reduce latency -- to free up the main thread for
JavaScript and UI interaction -- which as you correctly note, cannot
be moved off of the main thread due to the "single thread of
execution" model of the web.

One could view the pre-load scanner as a lay-man's attempt at this
type of "tokenize asynchronously" approach.  This model gets preload
scanning for free, as well as can easily answer wkb.ug/90751 request
to speculative tokenizing of the entire document.  (We just have to
save markers before every <script> token, as if the script uses
document.write, any tokens after </script> become invalid.)

I should also note that not all HTML parsing can be moved off of the
main thread.  innerHTML for example, would still be done entirely on
the main thread.  I would imagine that when we were to land this on
trunk it would be behind a feature flag and ports could opt-in to the
threaded-parsing path, as we must maintain the main-thread parsing
ability for innerHTML anyway.

> --Oliver
>
> On Jan 9, 2013, at 6:10 PM, Adam Barth <abarth at webkit.org> wrote:
>
>> On Wed, Jan 9, 2013 at 6:00 PM, Eric Seidel <eric at webkit.org> wrote:
>>> We're planning to move parts of the HTML Parser off of the main thread:
>>> https://bugs.webkit.org/show_bug.cgi?id=106127
>>>
>>> This is driven by our testing showing that HTML parsing on mobile is
>>> be slow, and long (causing user-visible delays averaging 10 frames /
>>> 150ms).
>>> https://bug-106127-attachments.webkit.org/attachment.cgi?id=182002
>>> Complete data can be found at [1].
>>
>> In case it's not clear from that link, the "ParseHTML" column is the
>> total amount of time the web inspector attributes to HTML parsing when
>> loading those URLs on a Nexus 7 using a top-of-tree build of
>> Chromium's content_shell (similar to WebKitTestRunner).
>>
>> The HTML parser parses data a chunk at a time, which means the total
>> time doesn't tell the whole story.  The "ParseHTML_max" column shows
>> the largest single block of time spent in the HTML parser, which is
>> more of a measure of the main thread "jank" caused by the parser.
>>
>> Antti has pointed out that the inspector isn't the best source of
>> data.  He measured total time using instruments, and got numbers that
>> are consistent (within a factor of 2) of the inspector measurements.
>> (We were using different data sets, so we wouldn't expect perfect
>> agreement even if we were measuring precisely the same thing.)
>>
>> Adam
>>
>>
>>> Mozilla moved their parser onto a separate thread during their HTML5
>>> parser re-write:
>>> https://developer.mozilla.org/en-US/docs/Mozilla/Gecko/HTML_parser_threading
>>>
>>> We plan to take a slightly simpler approach, moving only Tokenizing
>>> off of the main thread:
>>> https://docs.google.com/drawings/d/1hwYyvkT7HFLAtTX_7LQp2lxA6LkaEWkXONmjtGCQjK0/edit
>>> The left is our current design, the middle is a tokenizer-only design,
>>> and the right is more like mozilla's threaded-parser design.
>>>
>>> Profiling shows Tokenizing accounts for about 10x the number of
>>> samples as TreeBuilding.  Including Antti's recent testing (.5% vs.
>>> 3%):
>>> https://bugs.webkit.org/show_bug.cgi?id=106127#c10
>>> If after we do this we measure and find ourselves still spending a lot
>>> of main-thread time parsing, we'll move the TreeBuilder too. :)  (This
>>> work is a nicely separable sub-set of larger work needed to move the
>>> TreeBuilder.)
>>>
>>> We welcome your thoughts and comments.
>>>
>>>
>>> 1. https://docs.google.com/spreadsheet/ccc?key=0AlC4tS7Ao1fIdGtJTWlSaUItQ1hYaDFDcWkzeVAxOGc#gid=0
>>> (Epic thanks to Nat Duca for helping us collect that data.)
>> _______________________________________________
>> webkit-dev mailing list
>> webkit-dev at lists.webkit.org
>> http://lists.webkit.org/mailman/listinfo/webkit-dev
>