[webkit-dev] Is there a plan for supporting multi-process and WebCL in webkit
fpizlo at apple.com
Wed Apr 10 13:15:06 PDT 2013
On Apr 10, 2013, at 12:37 PM, Dirk Pranke <dpranke at chromium.org> wrote:
> On Wed, Apr 10, 2013 at 7:37 AM, Filip Pizlo <fpizlo at apple.com> wrote:
> On Apr 10, 2013, at 2:29 AM, "Pozdnyakov, Mikhail" <mikhail.pozdnyakov at intel.com> wrote:
> >> Why not infer ParallelArrays automatically?
> > Sorry, did not get it. Could you please elaborate?
> Normal JS arrays. You use them with pure computation. Profiler notices this. And then the same things that you have in the ParallelArrays proposal now work except you don't need a new type.
> Compiler writers have been chasing this problem for 50 years, with (AFAIK) very limited success for languages that don't give you compile-time annotations to help you along.
Lets demystify this a bit. It's easy to say "OMG they did things for 50 years", and we're all guilty of saying things like this sometimes, but, it's a bit more interesting to identify what these people did for those 50 years, where they got, and what the remaining problems are. It is already possible to autoparallelize loops, but there are challenges that remain that make it impractical for "real code". The problems are with:
A) Identifying loops that are candidates for sound parallelization. In formal-speak they look for the lack of "loop-carried side-effects", i.e. something you'd do in one iteration that would be visible in a subsequent iteration - this represents a dependency that inhibits parallelization. But if you prove that no loop-carried side-effects exist, then you're off to a good start: you know that you could fork off some threads to run the loop's body in parallel, and the program would still be guaranteed to get identical results. (I'm oversimplifying a bit - if you tried to auto-parallelize a loop in a program that already had threads, you'd be in a world of hurt from a memory model standpoint - but we don't have that problem in JS because all code in JS is semantically serial and so the parallelization you did wouldn't interfere with existing uses of threads, since there aren't any.)
B) Identifying loops that are candidates for profitable parallelization. Spinning up a thread carries a fixed cost. If you find a loop with no loop-carried side-effects, but that loop actually only executes for a short time, parallelizing it will be a performance regression - and likely a steep one at that. Finding which loops are profitable to parallelize is thought to be a hard problem. The traditional attempts to solve it centered around making threads cheaper (i.e. Cilk, CML, and similar), or having stronger inference of profitability (I think this is what http://ciao-lang.org does), or pushing parallelizability hints into the language (X10, OpenMP, and these ParallelArrays).
(A) is not a problem that is solved by ParallelArrays. ParallelArrays guarantee sequential equivalence; i.e. ParallelArray.prototype.forEach() will behave like Array.prototype.forEach() if the closure you pass it has the equivalent of "loop-carried side-effects". I.e. if you believe that (A) is not a solvable problem, then this constitutes an argument against ParallelArrays, and not against inferring that a normal array should behave like a ParallelArray. If you can do the inference that allows ParallelArrays.prototype.forEach() to soundly parallelize, then you can do that inference for Array.prototype.forEach(), as well. I happen to think that (A) is a big problem with ParallelArrays; I don't like it that these guys are proposing a language construct called "ParallelArray" but then secretly they will probably not parallelize a ton of code because the user didn't realize that storing to a global variable shafts their inference for side-effects.
(B) is a problem that ParallelArrays attempt to solve. But here's what we know about the different approaches to solving (B):
- Making threads faster hasn't really worked. Cilk's threads are fast enough that if you use them yourself, manually, you'll be really happy, sort of, except that it sort of sometimes doesn't work like you'd expect because you've got user-level scheduling - but the point is, they're not fast enough that you could spin them up for _any_ loop in a program. You'd get a regression if you did.
- Stronger inference of profitability is a promising area, which isn't adequately explored, IMO. The approaches with which I'm most familiar focus too much on static inference.
- Adding hints to the language results in programming models that nobody uses. Have you ever used OpenMP? Probably not, and if you have, you probably wanted to puke all over yourself. It's a disgusting programming model. There's something offensive about writing what looks like a serial loop, and then adding decorations to it to "declare" to the compiler what are the set of things it could do to the loop. Eventually these declarations get so sophisticated that you realize that you made a poor life decision by using them - you could have been more productive if you wrote the thread code yourself. There is a similar, though lesser, problem with X10; in X10 at least you can do like in Cilk and just spin up a bunch of light-weight threads directly. But even X10 has so far had only limited success, and a bunch of super smart people have been working on it for a decade. In short, hardly anybody uses these programming hints. I'm opposed to adding such a hint-based model to JS because we already know that other such similar models are used by almost nobody - these are niche programming models that you don't want to pick up unless you've literally run out of all other sensible options.
So the most promising thing appears to be strengthening automagic inference, which brings me to:
> I'm not aware of any work on JIT compilers doing this.
Neither am I!
> Why do you think we could do better?
Because we're doing it in a JIT compiler. The problem with (B) is that you don't know, in a static ahead of time compiler, how long a loop will execute. And even if you did have a guess, if your guess turned out to be wrong and you emitted code to call pthread_create or similar, you'd get really terrible performance. This is why traditional autoparallelization techniques fail in static compilers: static compiler writers, like all other performance geeks, don't like to add features that allege to improve performance but in fact regress performance.
But in a JIT, we already have to do two things that the static people can't *quite* do:
1) Runtime feedback-driven decision making that is acutely aware of the running time of loops. Both JSC and V8 do already know your average trip count through any loop; they may not know it explicitly (i.e. for example in JSC we don't have a data structure that will immediately tell you this information) but we do know it implicitly (we happen to already have a loop trip counter for deciding when to tier up, even though that counter is currently not designed for the level of accuracy that autoparallelization would need). Point is, if we wanted to have the level of granularity of understanding of loop execution times that is necessary for inferring when it is profitable to parallelize a loop that you've already proven to be sound to parallelize, then we could add it, at not much cost - we'd likely somehow hijack existing loop profiling functionality.
2) JITs are designed around sometimes making dump decisions, bailing out when we detect that we've made a dumb decision, reconstituting the state of the world so that we're in a state "as if" we had never made that decision, recording that we were dumb to have made that decision, and then trying again without making that same dumb decision. This closed-loop feedback is what allows us to infer types, reshape objects, make arithmetic fast, etc. We - both us here in JSC and those V8 dudes - do it all the time, because we're awesome at our jobs. So, if we had some statistical profiling that told us that it's likely that a loop takes a long time, we could add guards that assert that this forever continues to be true; if it turns out to be false then we deoptimize back to a serial world. Conversely, if we had some profiling that told us that it is _NOT_ profitable to parallelize a loop, we could similarly guard this and "deoptimize" (more like *optimize*) into a parallel loop mode.
The irony here is that the overheads we already pay in JITs to enable even basic type inference already give us most of the functionality we would need to detect when it is profitable to autoparallelize.
This, to me, appears like a comprehensive solution to the profitability problem. It still leaves the soundness problem. Here we have some help from the JIT-crazy as well: for example we wouldn't have to *prove* that a loop has no carried side-effects; we could simply *speculate* that it doesn't and insert appropriate checks. This is easier. I don't know if it's easy and practical enough to put into a browser, though, which brings me to the central point:
If it is impractical to infer that a forEach closure passed to ParallelArray is safe-to-parallelize, then it follows that ParallelArrays are impractical, and we should R- this whole programming model and look for something else - like adding threads and locks to JS and calling it a day. But if inferring that a forEach closure is safe-to-parallelize is doable, then it is _no_more_ doable in a world where you have the ParallelArray construct - since that construct only helps with the profitable-to-parallelize part of the equation. We could already do it for any closure, or any loop body. And then we could use existing JIT madness to take care of the profitability question.
> (Honest question, I'm not trying to be snarky, nor am I an expert).
Hope my answers help! :-)
Anyway, I was the one being snarky here. ;-) I don't actually think that this is a good feature for the web. It would be better to just give JS a more general-purpose concurrency construct - something that would work even if you *did* have side effects! I like threads and locks, but then again, I've been told that I'm insane for liking them. But I do know that to do the work to infer the safe-to-parallelize that ParallelArrays need would be a *huge* opportunity cost for the project - and if it was successful then it would be better to go all the way and just infer the profitable-to-parallelize bit as well. It's better to infer things for the programmer, than to have the programmer tell you things, if at all possible!
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the webkit-dev