[webkit-dev] Iterating SunSpider

Mike Belshe mike at belshe.com
Tue Jul 7 15:11:24 PDT 2009

On Sat, Jul 4, 2009 at 3:27 PM, Maciej Stachowiak <mjs at apple.com> wrote:

> On Jul 4, 2009, at 11:47 AM, Mike Belshe wrote:
> I'd like to understand what's going to happen with SunSpider in the future.
>  Here is a set of questions and criticisms.  I'm interested in how these can
> be addressed.
> There are 3 areas I'd like to see improved in
> SunSpider, some of which we've discussed before:
> #1: SunSpider is currently version 0.9.  Will SunSpider ever change?  Or is it static?
> I believe that benchmarks need to be able to
> move with the times.  As JS Engines change and improve, and as new areas are needed
> to be benchmarked, we need to be able to roll the version, fix bugs, and
> benchmark new features.  The SunSpider version has not changed for ~2yrs.
>  How can we change this situation?  Are there plans for a new version
> already underway?
> I've been thinking about updating SunSpider for some time. There are two
> categories of changes I've thought about:
> 1) Quality-of-implementation changes to the harness. Among these might be
> ability to use the harness with multiple test sets. That would be 1.0.
> 2) An updated set of tests - the current tests are too short, and don't
> adequately cover some areas of the language. I'd like to make the tests take
> at least 100ms each on modern browsers on recent hardware. I'd also be
> interested in incorporating some of the tests from the v8 benchmark suite,
> if the v8 developers were ok with this. That would be SunSpider 2.0.
> The reason I've been hesitant to make any changes is that the press and
> independent analysts latched on to SunSpider as a way of comparing
> JavaScript implementations. Originally, it was primarily intended to be a
> tool for the WebKit team to help us make our JavaScript faster. However, now
> that third parties are relying it, there are two things I want to be really
> careful about:
> a) I don't want to invalidate people's published data, so significant
> changes to the test content would need to be published as a clearly separate
> version.
> b) I want to avoid accidentally or intentionally making changes that are
> biased in favor of Safari or WebKit-based browsers in general, or that even
> give that impression. That would hurt the test's credibility. When we first
> made SunSpider, Safari actually didn't do that great on it, which I think
> helped people believe that the test wasn't designed to make us look good, it
> was designed to be a relatively unbiased comparison.
> Thus, any change to the content would need to be scrutinized in some way.
> I'm not sure what it would take to get widespread agreement that a 2.0
> content set is fair, but I agree it's time to make one soonish (before the
> end of the year probably). Thoughts on this are welcome.
> #2: Use of summing as a scoring mechanism is problematic
> Unfortunately, the sum-based scoring techniques do not withstand the test
> of time as browsers improve.  When the benchmark was first introduced, each
> test was equally weighted and reasonably large.  Over time, however, the
> test becomes dominated by the slowest tests - basically the weighting of the
> individual tests is variable based on the performance of the JS engine under
> test.  Today's engines spend ~50% of their time on just string and date
> tests.  The other tests are largely irrelevant at this point, and becoming
> less relevant every day.  Eventually many of the tests will take near-zero
> time, and the benchmark will have to be scrapped unless we figure out a
> better way to score it.  Benchmarking research which long pre-dates
> SunSpider confirms that geometric means provide a better basis for
> comparison:  http://portal.acm.org/citation.cfm?id=5673 Can future
> versions of the SunSpider driver be made so that they won't become
> irrelevant over time?
> Use of summation instead of geometric mean was a considered choice. The
> intent is that engines should focus on whatever is slowest. A simplified
> example: let's say it's estimated that likely workload in the field will
> consist of 50% Operation A, and 50% of Operation B, and I can benchmark them
> in isolation. Now let's say implementation in Foo these operations are
> equally fast, while in implementation Bar, Operation A is 4x as fast as in
> Foo, while Operation B is 4x as slow as in Foo. A comparison by geometric
> means would imply that Foo and Bar are equally good, but Bar would actually
> be twice as slow on the intended workload.

BTW - the way to work around this is to have enough sub-benchmarks such that
this just doesn't happen.  If we have the right test coverage, it seems
unlikely to me that a code change would dramatically improve exactly one
test at an exponential expense of exactly one other test.  I'm not saying it
is impossible - just that code changes don't generally cause that behavior.
 To combat this we can implement a broader base of benchmarks as well as
longer-running tests that are not "too micro".

This brings up another problem with summation.  The only case where
summation 'works' is if the benchmark workload is *the right workload* to
measure what browsers do.  In this case, your argument that slowing down one
portion of the benchmark at the expense of another should be measured is
reasonable.  But, I think the benchmark should be capable of adding more
benchmarks over time - potentially even covering corner cases and
less-frequented code.  These types of micro benchmarks have no place in a
summation based scoring model because we can't weight them accurately.
 Using geometric means, I could still weight the low-priority benchmark at
1/2 (or whatever) the weight of other benchmarks and have meaningful overall
scores.  However, in a sum based model, a browser which does really badly on
one low-priority test can get a horrible score even if it is better than all
the other browsers every other benchmarks.


> Of course, doing this requires a judgment call on reasonable balance of
> different kinds of code, and that balance needs to be re-evaluated
> periodically. But tests based on geometric means also make an implied
> judgment call. The operations comprising each individual test are added
> linearly. The test then judges that these particular combinations are each
> equally important.
> #3: The SunSpider harness has a variance problem due to CPU power savings
> modes.
> Because the test runs a tiny amount of Javascript (often under 10ms)
> followed by a 500ms sleep, CPUs will go into power savings modes between
> test runs.  This radically changes the performance measurements and makes it
> so that comparison between two runs is dependent on the user's power savings
> mode.  To demonstrate this, run SunSpider on two machines- one with the
> Windows "balanced" (default) setting for power, and then again with "high
> performance".  It's easy to see skews of 30% between these two modes.  I
> think we should change the test harness to avoid such accidental effects.
> (BTW - if you change SunSpider's sleep from 500ms  to 10ms, the test runs
> in just a few seconds.  It is unclear to me why the pauses are so large.  My
> browser gets a 650ms score, so run 5 times, that test should take ~3000ms.
>  But due to the pauses, it takes over 1 minute to run test, leaving the CPU
> ~96% idle).
> I think the pauses were large in an attempt to get stable, repeatable
> results, but are probably longer than necessary to achieve this. I agree
> with you that the artifacts in "balanced" power mode are a problem. Do you
> know what timer thresholds avoid the effect? I think this would be a
> reasonable "1.0" kind of change.
> Possible solution:
> The dromaeo test suite already incorporates the SunSpider individual tests
> under a new benchmark harness which fixes all 3 of the above issues.   Thus,
> one approach would be to retire SunSpider 0.9 in favor of Dromaeo.
> http://dromaeo.com/?sunspider  Dromaeo has also done a lot of good work to
> ensure statistical significance of the results.  Once we have a better
> benchmarking framework, it would be great to build a new microbenchmark mix
> which more realistically exercises today's JavaScript.
> In my experience, Dromaeo gives much significantly more variable results
> than SunSpider. I don't entirely trust the test to avoid interference from
> non-JS browser code executing, and I am not sure their statistical analysis
> is sound. In addition, using sum instead of geometric mean was a considered
> choice. It would be easy to fix in SunSpider if we wanted to, but I don't
> think we should. Also, I don't think Dromaeo has a pure command-line
> harness, and it depends on the server so it can't easily be used offline or
> with the network disabled.
> Many things about the way the SunSpider harness works are designed to give
> precise and repeatable results. That's very important to us, because when
> doing performance work we often want to gauge the impact of changes that
> have a small performance effect. With Dromaeo there is too much noise to do
> this effectively, at least in my past experience.
> Regards,
> Maciej
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.webkit.org/pipermail/webkit-dev/attachments/20090707/c262abb8/attachment.html>

More information about the webkit-dev mailing list