[webkit-dev] Iterating SunSpider

Sat Jul 4 15:27:57 PDT 2009

On Jul 4, 2009, at 11:47 AM, Mike Belshe wrote:

> I'd like to understand what's going to happen with SunSpider in the  
> future.  Here is a set of questions and criticisms.  I'm interested  
> in how these can be addressed.
>
> There are 3 areas I'd like to see improved in SunSpider, some of  
> which we've discussed before:
>
> #1: SunSpider is currently version 0.9.  Will SunSpider ever  
> change?  Or is it static?
> I believe that benchmarks need to be able to move with the times.   
> As JS Engines change and improve, and as new areas are needed to be  
> benchmarked, we need to be able to roll the version, fix bugs, and  
> benchmark new features.  The SunSpider version has not changed for  
> ~2yrs.  How can we change this situation?  Are there plans for a new  
> version already underway?

I've been thinking about updating SunSpider for some time. There are  
two categories of changes I've thought about:

1) Quality-of-implementation changes to the harness. Among these might  
be ability to use the harness with multiple test sets. That would be  
1.0.

2) An updated set of tests - the current tests are too short, and  
don't adequately cover some areas of the language. I'd like to make  
the tests take at least 100ms each on modern browsers on recent  
hardware. I'd also be interested in incorporating some of the tests  
from the v8 benchmark suite, if the v8 developers were ok with this.  
That would be SunSpider 2.0.

The reason I've been hesitant to make any changes is that the press  
and independent analysts latched on to SunSpider as a way of comparing  
JavaScript implementations. Originally, it was primarily intended to  
be a tool for the WebKit team to help us make our JavaScript faster.  
However, now that third parties are relying it, there are two things I  
want to be really careful about:

a) I don't want to invalidate people's published data, so significant  
changes to the test content would need to be published as a clearly  
separate version.

b) I want to avoid accidentally or intentionally making changes that  
are biased in favor of Safari or WebKit-based browsers in general, or  
that even give that impression. That would hurt the test's  
credibility. When we first made SunSpider, Safari actually didn't do  
that great on it, which I think helped people believe that the test  
wasn't designed to make us look good, it was designed to be a  
relatively unbiased comparison.

Thus, any change to the content would need to be scrutinized in some  
way. I'm not sure what it would take to get widespread agreement that  
a 2.0 content set is fair, but I agree it's time to make one soonish  
(before the end of the year probably). Thoughts on this are welcome.

>
> #2: Use of summing as a scoring mechanism is problematic
> Unfortunately, the sum-based scoring techniques do not withstand the  
> test of time as browsers improve.  When the benchmark was first  
> introduced, each test was equally weighted and reasonably large.   
> Over time, however, the test becomes dominated by the slowest tests  
> - basically the weighting of the individual tests is variable based  
> on the performance of the JS engine under test.  Today's engines  
> spend ~50% of their time on just string and date tests.  The other  
> tests are largely irrelevant at this point, and becoming less  
> relevant every day.  Eventually many of the tests will take near- 
> zero time, and the benchmark will have to be scrapped unless we  
> figure out a better way to score it.  Benchmarking research which  
> long pre-dates SunSpider confirms that geometric means provide a  
> better basis for comparison:  http://portal.acm.org/citation.cfm?id=5673 
>  Can future versions of the SunSpider driver be made so that they  
> won't become irrelevant over time?

Use of summation instead of geometric mean was a considered choice.  
The intent is that engines should focus on whatever is slowest. A  
simplified example: let's say it's estimated that likely workload in  
the field will consist of 50% Operation A, and 50% of Operation B, and  
I can benchmark them in isolation. Now let's say implementation in Foo  
these operations are equally fast, while in implementation Bar,  
Operation A is 4x as fast as in Foo, while Operation B is 4x as slow  
as in Foo. A comparison by geometric means would imply that Foo and  
Bar are equally good, but Bar would actually be twice as slow on the  
intended workload.

Of course, doing this requires a judgment call on reasonable balance  
of different kinds of code, and that balance needs to be re-evaluated  
periodically. But tests based on geometric means also make an implied  
judgment call. The operations comprising each individual test are  
added linearly. The test then judges that these particular  
combinations are each equally important.

>
> #3: The SunSpider harness has a variance problem due to CPU power  
> savings modes.
> Because the test runs a tiny amount of Javascript (often under 10ms)  
> followed by a 500ms sleep, CPUs will go into power savings modes  
> between test runs.  This radically changes the performance  
> measurements and makes it so that comparison between two runs is  
> dependent on the user's power savings mode.  To demonstrate this,  
> run SunSpider on two machines- one with the Windows  
> "balanced" (default) setting for power, and then again with "high  
> performance".  It's easy to see skews of 30% between these two  
> modes.  I think we should change the test harness to avoid such  
> accidental effects.
>
> (BTW - if you change SunSpider's sleep from 500ms  to 10ms, the test  
> runs in just a few seconds.  It is unclear to me why the pauses are  
> so large.  My browser gets a 650ms score, so run 5 times, that test  
> should take ~3000ms.  But due to the pauses, it takes over 1 minute  
> to run test, leaving the CPU ~96% idle).

I think the pauses were large in an attempt to get stable, repeatable  
results, but are probably longer than necessary to achieve this. I  
agree with you that the artifacts in "balanced" power mode are a  
problem. Do you know what timer thresholds avoid the effect? I think  
this would be a reasonable "1.0" kind of change.

>
> Possible solution:
> The dromaeo test suite already incorporates the SunSpider individual  
> tests under a new benchmark harness which fixes all 3 of the above  
> issues.   Thus, one approach would be to retire SunSpider 0.9 in  
> favor of Dromaeo.   http://dromaeo.com/?sunspider  Dromaeo has also  
> done a lot of good work to ensure statistical significance of the  
> results.  Once we have a better benchmarking framework, it would be  
> great to build a new microbenchmark mix which more realistically  
> exercises today's JavaScript.

In my experience, Dromaeo gives much significantly more variable  
results than SunSpider. I don't entirely trust the test to avoid  
interference from non-JS browser code executing, and I am not sure  
their statistical analysis is sound. In addition, using sum instead of  
geometric mean was a considered choice. It would be easy to fix in  
SunSpider if we wanted to, but I don't think we should. Also, I don't  
think Dromaeo has a pure command-line harness, and it depends on the  
server so it can't easily be used offline or with the network disabled.

Many things about the way the SunSpider harness works are designed to  
give precise and repeatable results. That's very important to us,  
because when doing performance work we often want to gauge the impact  
of changes that have a small performance effect. With Dromaeo there is  
too much noise to do this effectively, at least in my past experience.

Regards,
Maciej

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.webkit.org/pipermail/webkit-dev/attachments/20090704/8ad5c978/attachment.html>