[webkit-dev] Iterating SunSpider

Sun Jul 5 01:59:37 PDT 2009

On Sat, Jul 4, 2009 at 3:27 PM, Maciej Stachowiak <mjs at apple.com> wrote:

>
> On Jul 4, 2009, at 11:47 AM, Mike Belshe wrote:
>
> I'd like to understand what's going to happen with SunSpider in the future.
>  Here is a set of questions and criticisms.  I'm interested in how these can
> be addressed.
>
> There are 3 areas I'd like to see improved in
> SunSpider, some of which we've discussed before:
>
>
> #1: SunSpider is currently version 0.9.  Will SunSpider ever change?  Or is it static?
> I believe that benchmarks need to be able to
> move with the times.  As JS Engines change and improve, and as new areas are needed
> to be benchmarked, we need to be able to roll the version, fix bugs, and
> benchmark new features.  The SunSpider version has not changed for ~2yrs.
>  How can we change this situation?  Are there plans for a new version
> already underway?
>
>
> I've been thinking about updating SunSpider for some time. There are two
> categories of changes I've thought about:
>
> 1) Quality-of-implementation changes to the harness. Among these might be
> ability to use the harness with multiple test sets. That would be 1.0.
>

Cool

>
> 2) An updated set of tests - the current tests are too short, and don't
> adequately cover some areas of the language. I'd like to make the tests take
> at least 100ms each on modern browsers on recent hardware. I'd also be
> interested in incorporating some of the tests from the v8 benchmark suite,
> if the v8 developers were ok with this. That would be SunSpider 2.0.
>

Cool.  Use of v8 tests is just fine; they're all open source.

>
> The reason I've been hesitant to make any changes is that the press and
> independent analysts latched on to SunSpider as a way of comparing
> JavaScript implementations. Originally, it was primarily intended to be a
> tool for the WebKit team to help us make our JavaScript faster. However, now
> that third parties are relying it, there are two things I want to be really
> careful about:
>
> a) I don't want to invalidate people's published data, so significant
> changes to the test content would need to be published as a clearly separate
> version.
>

Of course.  Small UI nit - the current SunSpider benchmark doesn't make the
version very prominent at all.  It would be nice to make it more salient.

>
> b) I want to avoid accidentally or intentionally making changes that are
> biased in favor of Safari or WebKit-based browsers in general, or that even
> give that impression. That would hurt the test's credibility. When we first
> made SunSpider, Safari actually didn't do that great on it, which I think
> helped people believe that the test wasn't designed to make us look good, it
> was designed to be a relatively unbiased comparison.
>

Of course.

>
> Thus, any change to the content would need to be scrutinized in some way.
> I'm not sure what it would take to get widespread agreement that a 2.0
> content set is fair, but I agree it's time to make one soonish (before the
> end of the year probably). Thoughts on this are welcome.
>
>
> #2: Use of summing as a scoring mechanism is problematic
> Unfortunately, the sum-based scoring techniques do not withstand the test
> of time as browsers improve.  When the benchmark was first introduced, each
> test was equally weighted and reasonably large.  Over time, however, the
> test becomes dominated by the slowest tests - basically the weighting of the
> individual tests is variable based on the performance of the JS engine under
> test.  Today's engines spend ~50% of their time on just string and date
> tests.  The other tests are largely irrelevant at this point, and becoming
> less relevant every day.  Eventually many of the tests will take near-zero
> time, and the benchmark will have to be scrapped unless we figure out a
> better way to score it.  Benchmarking research which long pre-dates
> SunSpider confirms that geometric means provide a better basis for
> comparison:  http://portal.acm.org/citation.cfm?id=5673 Can future
> versions of the SunSpider driver be made so that they won't become
> irrelevant over time?
>
>
> Use of summation instead of geometric mean was a considered choice. The
> intent is that engines should focus on whatever is slowest. A simplified
> example: let's say it's estimated that likely workload in the field will
> consist of 50% Operation A, and 50% of Operation B, and I can benchmark them
> in isolation. Now let's say implementation in Foo these operations are
> equally fast, while in implementation Bar, Operation A is 4x as fast as in
> Foo, while Operation B is 4x as slow as in Foo. A comparison by geometric
> means would imply that Foo and Bar are equally good, but Bar would actually
> be twice as slow on the intended workload.
>

I could almost buy this if:
   a)  we had a really really representative workload of what web pages do,
broken down into the exactly correct proportions.
   b)  the representative workload remains representative over time.

I'll argue that we'll never be very good at (a), and that (b) is impossible.

So, what you end up with is after a couple of years, the slowest test in the
suite is the most significant part of the score.  Further, I'll predict that
the slowest test will most likely be the least relevant test, because the
truly important parts of JS engines were already optimized.  This has
happened with Sunspider 0.9 - the regex portions of the test became the
dominant factor, even though they were not nearly as prominent in the real
world as they were in the benchmark.  This leads to implementors optimizing
for the benchmark - and that is not what we want to encourage.

Do you consider it an unchangeable aspect of the test, even as we move to a
new version?  Are you only willing to accept a summation?

>
> Of course, doing this requires a judgment call on reasonable balance of
> different kinds of code, and that balance needs to be re-evaluated
> periodically. But tests based on geometric means also make an implied
> judgment call. The operations comprising each individual test are added
> linearly. The test then judges that these particular combinations are each
> equally important.
>

Actually, this has nothing to do with geometric mean or sum based scoring,
and in fact is a weakness of summation.  What you're talking about here is
really the weighting of the tests.  SunSpider didn't originally weight the
tests at all - they were all equal.  But were all equally important?
 Probably not, but like most benchmarks, it weighted all equally because
there was no other better weighting (so I think equal weighting was a good
choice).  Over time, however, the tests diverged.  Slower tests became more
important than smaller tests.  As a result, when implementors make changes
today, they are no longer improving against an equally weighted set of
tests.   Isn't it better to maintain weighting over time than to let it
weight the least-important tests as most significant?

With geometric means, the dynamic weighting property goes away.  If the
tests are weighted equally in the beginning, they are still weighted equally
over time. This is the property which makes geometric means so important.

So you are right - any benchmark makes a judgment call about balance of
code; and this is independent of summation or geometric mean based scoring.

#3: The SunSpider harness has a variance problem due to CPU power savings
> modes.
> Because the test runs a tiny amount of Javascript (often under 10ms)
> followed by a 500ms sleep, CPUs will go into power savings modes between
> test runs.  This radically changes the performance measurements and makes it
> so that comparison between two runs is dependent on the user's power savings
> mode.  To demonstrate this, run SunSpider on two machines- one with the
> Windows "balanced" (default) setting for power, and then again with "high
> performance".  It's easy to see skews of 30% between these two modes.  I
> think we should change the test harness to avoid such accidental effects.
>
> (BTW - if you change SunSpider's sleep from 500ms  to 10ms, the test runs
> in just a few seconds.  It is unclear to me why the pauses are so large.  My
> browser gets a 650ms score, so run 5 times, that test should take ~3000ms.
>  But due to the pauses, it takes over 1 minute to run test, leaving the CPU
> ~96% idle).
>
>
> I think the pauses were large in an attempt to get stable, repeatable
> results, but are probably longer than necessary to achieve this. I agree
> with you that the artifacts in "balanced" power mode are a problem. Do you
> know what timer thresholds avoid the effect? I think this would be a
> reasonable "1.0" kind of change.
>

It's going to vary based on CPU and user settings.  Are pauses required in
order to get stable results?  I guess I haven't studied that, but I wouldn't
expect pauses to make results more stable?

> Possible solution:
> The dromaeo test suite already incorporates the SunSpider individual tests
> under a new benchmark harness which fixes all 3 of the above issues.   Thus,
> one approach would be to retire SunSpider 0.9 in favor of Dromaeo.
> http://dromaeo.com/?sunspider  Dromaeo has also done a lot of good work to
> ensure statistical significance of the results.  Once we have a better
> benchmarking framework, it would be great to build a new microbenchmark mix
> which more realistically exercises today's JavaScript.
>
>
> In my experience, Dromaeo gives much significantly more variable results
> than SunSpider. I don't entirely trust the test to avoid interference from
> non-JS browser code executing, and I am not sure their statistical analysis
> is sound. In addition, using sum instead of geometric mean was a considered
> choice. It would be easy to fix in SunSpider if we wanted to, but I don't
> think we should. Also, I don't think Dromaeo has a pure command-line
> harness, and it depends on the server so it can't easily be used offline or
> with the network disabled.
>

>
> Many things about the way the SunSpider harness works are designed to give
> precise and repeatable results. That's very important to us, because when
> doing performance work we often want to gauge the impact of changes that
> have a small performance effect. With Dromaeo there is too much noise to do
> this effectively, at least in my past experience.
>

Agree we need repeatable results; not sure if there is more variance in
dromaeo or not.

Mike

>
> Regards,
> Maciej
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.webkit.org/pipermail/webkit-dev/attachments/20090705/24a4f7b0/attachment.html>