[webkit-dev] Feature Announcement: Moving HTML Parser off the Main Thread

Fri Jan 11 02:02:00 PST 2013

Your comments here make me feel more positively towards this project. In particular, I'm happy that:
- There actually will be meaningful testing.
- You're prepared to abandon this approach if it doesn't meet its perf goals (presumably at minimum no regression to page load time or responsiveness while loading, and meaningful improvement to at least one of these).
- The shared-nothing message-passing approach to threading sounds likely to be a relatively less complex/fragile approach to threading than most others.

Thanks for following up. 

I have a comment on a tangential point that I'll split into another thread.

Cheers,
Maciej

On Jan 11, 2013, at 12:21 AM, Adam Barth <abarth at webkit.org> wrote:

> On Thu, Jan 10, 2013 at 9:19 PM, Maciej Stachowiak <mjs at apple.com> wrote:
>> On Jan 10, 2013, at 12:07 PM, Adam Barth <abarth at webkit.org> wrote:
>>> On Thu, Jan 10, 2013 at 12:37 AM, Maciej Stachowiak <mjs at apple.com> wrote:
>>>> I presume from your other comments that the goal of this work is responsiveness, rather than page load speed as such. I'm excited about the potential to improve responsiveness during page loading.
>>> 
>>> The goals are described in the first link Eric gave in his email:
>>> <https://bugs.webkit.org/show_bug.cgi?id=106127#c0>.  Specifically:
>>> 
>>> ---8<---
>>> 1) Moving parsing off the main thread could make web pages more
>>> responsive because the main thread is available for handling input
>>> events and executing JavaScript.
>>> 2) Moving parsing off the main thread could make web pages load more
>>> quickly because WebCore can do other work in parallel with parsing
>>> HTML (such as parsing CSS or attaching elements to the render tree).
>>> --->8---
>> 
>> OK - what test (if any) will be used to test whether the page load speed goal is achieved?
> 
> All of them.  :)
> 
> More seriously, Chromium runs a very large battery of performance
> tests continuously on a matrix of different platforms, including
> desktop and mobile.  You can see one of the overview dashboards here:
> 
> http://build.chromium.org/f/chromium/perf/dashboard/overview.html
> 
> The ones that are particularly relevant to this work are the various
> page load tests, both with simulated network delays and without
> network delays.  For iterative benchmarking, we plan to use Chromium's
> Telemetry framework <http://www.chromium.org/developers/telemetry>.
> Specifically, I expect we plan to work with the top_25 dataset
> <http://src.chromium.org/viewvc/chrome/trunk/src/tools/perf/page_sets/top_25.json?view=markup>,
> but we might use some other data sets if there are particular areas we
> want to measure more carefully.
> 
>>>> One question: what tests are you planning to use to validate whether this approach achieves its goals of better responsiveness?
>>> 
>>> The tests we've run so far are also described in the first link Eric
>>> gave in his email: <https://bugs.webkit.org/show_bug.cgi?id=106127>.
>>> They suggest that there's a good deal of room for improvement in this
>>> area.  After we have a working implementation, we'll likely re-run
>>> those experiments and run other experiments to do an A/B comparison of
>>> the two approaches.  As Filip points out, we'll likely end up with a
>>> hybrid of the two designs that's optimized for handling various work
>>> loads.
>> 
>> I agree the test suggests there is room for improvement. From the description of how the test is run, I can think of two potential ways to improve how well it correlates with actual user-perceived responsiveness:
>> 
>> (1) It seems to look at the max parsing pause time without considering whether there's any content being shown that it's possible to interact with. If the longest pauses happen before meaningful content is visible, then reducing those pauses is unlikely to actually materially improve responsiveness, at least in models where web content processing happens in a separate process or thread from the UI. One possibility is to track the max parsing pause time starting from the first visually non-empty layout. That would better approximate how much actual user interaction is blocked.
> 
> Consider, also, that pages might be parsing in the same process in
> another tab, or in a frame in the current tab.
> 
>> (2) It might be helpful to track max and average pause time from non-parsing sources, for the sake of comparison.
> 
> If you looked at the information Eric provided in his initial email,
> you might have noticed
> <https://docs.google.com/spreadsheet/ccc?key=0AlC4tS7Ao1fIdGtJTWlSaUItQ1hYaDFDcWkzeVAxOGc#gid=0>,
> which is precisely that.
> 
>> These might result in a more accurate assessment of the benfits.
>> 
>>>> The reason I ask is that this sounds like a significant increase in complexity, so we should be very confident that there is a real and major benefit. One thing I wonder about is how common it is to have enough of the page processed that the user could interact with it in principle, yet still have large parsing chunks remaining which would prevent that interaction from being smooth.
>>> 
>>> If you're interested in reducing the complexity of the parser, I'd
>>> recommend removing the NEW_XML code.  As previously discussed, that
>>> code creates significant complexity for zero benefit.
>> 
>> Tu quoque fallacy. From your glib reply, I get the impression that you are not giving the complexity cost of multithreading due consideration. I hope that is not actually the case and I merely caught you at a bad moment or something.
> 
> I'm quite aware of the complexity of multithreaded code having written
> a great deal of it for Chromium.
> 
> One of the things I hope comes out of this project is a good example
> of how to do multithreaded processing in WebCore.  Currently, every
> subsystem seems "rolls their own" threading abstractions, I think
> largely because there hasn't been a strong technical example of how to
> do threading well.  By contrast, Chromium uses multithreading quite
> extensively and has put a lot of engineering effort into building
> high-quality threading abstractions:
> 
> http://dev.chromium.org/developers/design-documents/threading
> 
> I hope that this work will let us transfer some of that knowledge into
> WebCore.  Obviously cleaning up all of WebCore's threading
> abstractions is a big task, and one that will probably happen slowly
> over time.
> 
> Anyway, I can understand why you're worried about introducing more
> threading into WebCore given the current mishmash of threading
> abstractions that we have already.  The good news is that the folks
> working on this project probably have more experience writing
> multithreaded code than the average WebKit developer.  :)
> 
>> (And also we agreed to a drop dead date to remove the code which has either passed or is very close.)
>> 
>>>> Another thing I wonder about is whether yielding to the event loop more aggressively could achieve a similar benefit at a much lower complexity cost.
>>> 
>>> Yielding to the event loop more could reduce the "ParseHTML_max" time,
>>> but it cannot reduce the "ParseHTML" time.  Generally speaking,
>>> yielding to the event loop is a trade-off between throughput (i.e.,
>>> page load time) and responsiveness.  Moving work to a background
>>> thread should let us achieve a better trade-off between these
>>> quantities than we're likely to be able to achieve by tuning the yield
>>> parameter alone.
>> 
>> I agree that is possible. But it also seems like making the improvements that don't impose the complexity and hazards of multithreading in this area are worth trying first. Things such as retuning yielding and replacing the preload scanner with (non-threaded) speculative pre-tokenizing as suggested by Antti. That would let us better assess the benefits of the threading itself.
> 
> Given what I know about HTML parsing, I don't believe tuning the yield
> parameter will give nearly as good a result as moving tokenization to
> a background thread.  After we switched over to the new parser, we
> went through and tuned a bunch of these parameters (albeit not for
> maximum stop time as that wasn't on our radar at the time), so I have
> a good sense for what you can and can't achieve with these knobs.
> 
>>>> Having a test to drive the work would allow us to answer these types of questions. (It may also be that the test data you cited would already answer these questions but I didn't sufficiently understand it; if so, further explanation would be appreciated.)
>>> 
>>> If you're interested in building such a test, I would be interested in
>>> hearing the results.  We don't plan to build such a test at this time.
>> 
>> If you're actually planning to make a significant complexity-imposing architectural change for performance reasons, without any way to test whether it delivers the claimed performance benefits, or how it compares to less complex approaches, then why should any rational person agree with that approach? When attempting to improve performance, the burden of proof is on the person proposing the performance improvement, not on others to create a test to figure out if the performance improvement works. It's not valid to respond to a request for performance testing info with the equivalent of "patches welcome".
> 
> Is that really the case?  If so, I'm surprised that the patches for
> the shared memory cache and the NetworkProcess landed.  I raised
> similar questions to the ones you're raising now, but the folks
> purposing those changes basically ignored me and landed their patches
> anyway.
> 
>> But you and others have actually cited a performance test and described how to run of it, and perhaps even shown openness to improving it. So here again, I hope your words merely sound more negative than what you actually mean. I don't understand why you would say you don't plan to build a test when above you cited a test and your plans to run it before and after. Am I missing some nuance?
> 
> I didn't quite follow this last paragraph.
> 
> On Thu, Jan 10, 2013 at 11:00 PM, Filip Pizlo <fpizlo at apple.com> wrote:
>> Thanks for your detailed reply. Seems like you guys have a pretty good plan in place.
> 
> Thanks.
> 
>> I hope this works and produces a performance improvement. That being said this does look like a sufficiently complex work item that success is far from guaranteed. So to play devil's advocate, what is your plan for if this doesn't work out?
> 
> If it doesn't work, we'll rip it out.
> 
>> I.e. are we talking about adding a bunch of threading support code in the optimistic hope that it makes things run fast, and then forgetting about it if it doesn't?  Or are you prepared to roll put any complexity that got landed if this does not ultimately live up to promise?  Or is this going to be one giant patch that only lands if it works?
> 
> I'm happy to rip out the feature if it turns out not to work.  There's
> a risk that we'll disagree about whether the feature is a win, but I
> imagine we'll come to a decision somehow.  :)
> 
> I would prefer not to land the code in one giant patch.  That's not
> typically how we approach WebKit development.  Instead, we plan to
> work behind an ENABLE flag, which will make it easier to identify what
> needs to be ripped out (and also make it easier to do A/B
> comparisons).
> 
>> I'm also trying to understand what would happen during the interim when this work is incomplete, we have thread-related goop in some critical paths, and we don't yet know if the WIP code is ever going to result in a speedup. And also, what will happen sometime from now if that code is never successfully optimized to the point where it is worth enabling.
> 
> Any threading goop will be behind an ENABLE flag until it's ready for
> prime time (i.e., until it is better than what we have currently).  As
> a project, we don't have a great track record of ripping out WIP code
> that's languishing (see, for example, the discussion about NEW_XML
> that I alluded to earlier), but I tend to be one of the people pushing
> to rip out unused code, so I'd lose a bit of my credibility if I
> wasn't willing to rip out my own code.  :)
> 
>> I appreciate that this sort of question can be asked of any performance work but in this particular case my gut tells me that this is going to result in significantly more complexity than the usual incremental performance work. So it's good to understand what plan B is.
>> 
>> Probably a good answer to this sort of question would address some fears that people may have. If this work does lead to a performance win then probably everyone will be happy. But if it doesn't then it would be great to have a "plan of retreat".
> 
> I'm thing that I've learned from this thread is that the community has
> more fear of multithreaded code than I expected.  In retrospect, this
> makes a lot of sense given that multithreaded code is rare in WebKit
> and that every subsystem rolls their own abstractions.
> 
> Our current plan is to use a simple design with two message queues:
> one queue of byte vectors being sent from the main thread to the
> background thread and one queue of PickledTokens being sent from the
> background thread to the main thread.  We're not planning on sharing
> any memory between the threads or on using any locking mechanisms
> (other than the ones that are already implemented in
> WTF::MessageQueue).  This general approach is widely used in the
> Chromium project and tends to be both high performance and relatively
> easy to reason about.  I'm happy to discuss this in more detail if
> you're interested.
> 
> To return to your question, plan B is to rip out the code and continue
> using the current implementation.  If you have something else in mind,
> I'm happy to discuss that as well.
> 
> Adam