[webkit-dev] Feature Announcement: Moving HTML Parser off the Main Thread

Thu Jan 10 12:07:18 PST 2013

Thanks everyone for your feedback.  Detailed responses inline.

On Wed, Jan 9, 2013 at 9:41 PM, Filip Pizlo <fpizlo at apple.com> wrote:
> I think your biggest challenge will be ensuring that the latency of shoving things to another core and then shoving them back will be smaller than the latency of processing those same things on the main thread.

Yes.  That's something we know we have to worry about.  Given that we
need to retain the ability to parse HTML on the main thread to handle
document.write and innerHTML, we should be able to easily do A/B
comparisons to make sure we understand any performance trade-offs that
might arise.

> For small documents, I expect concurrent tokenization to be a pure regression because the latency of waking up another thread to do just a small bit of work, plus the added cost of whatever synchronization operations will be needed to ensure safety, will involve more total work than just tokenizing locally.

Once we have the ability to tokenize on a background thread, we can
examine cases like these and heuristically decide whether to use the
background thread or not at runtime.  As I wrote above, we'll need
these ability anyway, so keeping the ability to optimize these cases
shouldn't add any new constraints to the design.

> We certainly see this in the JSC parallel GC, and in line with traditional parallel GC design, we ensure that parallel threads only kick in when the main thread is unable to keep up with the work that it has created for itself.
>
> Do you have a vision for how to implement a similar self-throttling, where tokenizing continues on the main thread so long as it is cheap to do so?

It's certainly something we can tune in the optimization phase.  I
don't think we need a particular vision to be able to do it.  Given
that we want to implement speculative parsing (to replace preload
scanning---more on this below), we'll already have the ability to
checkpoint and restore the tokenizer state across threads.  Once you
have that primitive, it's easy to decide whether to continue
tokenization on the main thread or on a background thread.

On Wed, Jan 9, 2013 at 10:04 PM, Ian Hickson <ian at hixie.ch> wrote:
> Parsing and (maybe to a lesser extent) compiling JS can be moved off the
> main thread, though, right? That's probably worth examining too, if it
> hasn't already been done.

Yes, once we have the tokenizer running on a background thread, that
opens up the possibility of parsing other sorts of data on the
background thread as well.  For example, when the tokenizer encounters
an inline script block, you could imagine parsing the script on the
background thread as well so that the main thread has less work to do.
 (You could also imagine making the optimizations without a background
tokenizer, but the design constraints would be a bit different.)

On Thu, Jan 10, 2013 at 12:11 AM, Zoltan Herczeg <zherczeg at webkit.org> wrote:
> Parsing, especially JS parsing still takes a large amount of time on page
> loading. We tried to improve the preload scanner by moving it into
> anouther thread, but there was no gain (except some special cases).
> Synchronization between threads is surprisingly (ridiculously) costly,
> usually worth for those tasks, which needs quite a few million
> instructions to be executed (and tokenization takes far less in most
> cases). For smaller tasks, SIMD instruction sets can help, which is
> basically a parallel execution on a single thread. Anyway it is worth
> trying, but it is really challenging to make it work in practice. Good
> luck!

This is something we're worried about and will need to be careful
about.  In the design we're proposing, preload scanning is replaced by
speculative parsing, so the overhead of the preload scanner is removed
entirely.  The way this works is a follows:

When running on the background thread, the tokenizer produces a queue
of PickledTokens.  As these tokens are queued, we can scan them to
kick off any preloads that we find.  Whenever the tokenizer queues a
token that creates a new insertion point (in the terminology of the
HTML specification), the tokenizer checkpoints itself but continues
tokenizing speculatively.  (Notice that tokens produced in this
situation are still scanned for preloads but might not ever actually
result in DOM being constructed.)

After the main thread has processed the token that created the
insertion point, if no characters were inserted, the main thread
continues processing PickledTokens that were created speculative.  If
some characters were inserted, the main thread instead instructs the
tokenizer to roll back to that checkpoint and continue tokenizing in a
new state.  In this case, the queue of speculative tokens is
discarded.

Notice that in the common case, we're execute JavaScript and tokenize
in parallel, something that's not possible with a main-thread
tokenizer.  Once the script is done executing, we expect it to be
common to be able to result tree building immediately as the tokenizer
will have already tokenized much of the subsequent data.

On Thu, Jan 10, 2013 at 12:37 AM, Maciej Stachowiak <mjs at apple.com> wrote:
> I presume from your other comments that the goal of this work is responsiveness, rather than page load speed as such. I'm excited about the potential to improve responsiveness during page loading.

The goals are described in the first link Eric gave in his email:
<https://bugs.webkit.org/show_bug.cgi?id=106127#c0>.  Specifically:

---8<---
1) Moving parsing off the main thread could make web pages more
responsive because the main thread is available for handling input
events and executing JavaScript.
2) Moving parsing off the main thread could make web pages load more
quickly because WebCore can do other work in parallel with parsing
HTML (such as parsing CSS or attaching elements to the render tree).
--->8---

> One question: what tests are you planning to use to validate whether this approach achieves its goals of better responsiveness?

The tests we've run so far are also described in the first link Eric
gave in his email: <https://bugs.webkit.org/show_bug.cgi?id=106127>.
They suggest that there's a good deal of room for improvement in this
area.  After we have a working implementation, we'll likely re-run
those experiments and run other experiments to do an A/B comparison of
the two approaches.  As Filip points out, we'll likely end up with a
hybrid of the two designs that's optimized for handling various work
loads.

> The reason I ask is that this sounds like a significant increase in complexity, so we should be very confident that there is a real and major benefit. One thing I wonder about is how common it is to have enough of the page processed that the user could interact with it in principle, yet still have large parsing chunks remaining which would prevent that interaction from being smooth.

If you're interested in reducing the complexity of the parser, I'd
recommend removing the NEW_XML code.  As previously discussed, that
code creates significant complexity for zero benefit.

> Another thing I wonder about is whether yielding to the event loop more aggressively could achieve a similar benefit at a much lower complexity cost.

Yielding to the event loop more could reduce the "ParseHTML_max" time,
but it cannot reduce the "ParseHTML" time.  Generally speaking,
yielding to the event loop is a trade-off between throughput (i.e.,
page load time) and responsiveness.  Moving work to a background
thread should let us achieve a better trade-off between these
quantities than we're likely to be able to achieve by tuning the yield
parameter alone.

> Having a test to drive the work would allow us to answer these types of questions. (It may also be that the test data you cited would already answer these questions but I didn't sufficiently understand it; if so, further explanation would be appreciated.)

If you're interested in building such a test, I would be interested in
hearing the results.  We don't plan to build such a test at this time.

On Thu, Jan 10, 2013 at 1:44 AM, Antti Koivisto <koivisto at iki.fi> wrote:
> When loading web pages we are very frequently in a situation where we
> already have the source data (HTML text here but the same applies to
> preloaded Javascript, CSS, images, ...) and know we are likely to need it in
> soon, but can't actually utilize it for indeterminate time. This happens
> because pending external JS resources blocks the main parser (and pending
> CSS resources block JS execution) for web compatibility reasons. In this
> situation it makes sense to start processing resources we have to forms that
> are faster to use when they are eventually actually needed (like token
> stream here).

Indeed.

> One thing we already do when the main parser gets blocked is preload
> scanning. We look through the unparsed HTML source we have and trigger loads
> for any resources found. It would be beneficial if this happened off the
> main thread. We could do it when new data arrives in parallel with JS
> execution and other time consuming engine work, potentially triggering
> resource loads earlier.

A couple people have tried to move preload scanning to a background
thread, but they haven't had much success.  Given that moving the
parser to a background thread gets us background preload scanning for
free, I don't think it's worth investing effort in moving just the
preload scanner anymore.

> I think a good first step here would be to share the tokens between the
> preload scanner and the main parser and worry about the threading part
> afterwards. We often parse the HTML source more or less twice so this is an
> unquestionable win.

We've discussed doing that for a number of years, but no one has
actually succeeded in doing it.  Given that moving the parsing to a
background thread gets us token reuse for free (because of the switch
from preload scanning to speculative tokenization), I don't think it's
worth investing effort in reusing the preload scanner's tokens
anymore.

Adam