[webkit-dev] An update on new-run-webkit-tests

Dirk Pranke dpranke at chromium.org
Thu Apr 7 12:55:39 PDT 2011


Hi Maciej,

Thanks for clarifying your concerns. I will address them a little
out-of-order, because I think we probably agree on the important
things even if we disagree on the less-important or more theoretical
things.

First, as far as the "future" discoveries go, I agree we should try to
fix as many of the known issues as possible before cutting over. It
may be that many or all of them have already been fixed, as I still
need to verify some of these bugs.

I definitely think the best way forward is to get NRWT bots up and see
how things are working in practice; that way we should have a lot of
data to tell us about the pros and cons of each tool.

That said,

On Thu, Apr 7, 2011 at 4:40 AM, Maciej Stachowiak <mjs at apple.com> wrote:
> If the new test tool causes more failures, or worse yet causes more tests to give unpredictable results, then that makes our testing system worse. The main benefit of new-run-webkit-tests, as I understand it, is that it can run the tests a lot faster. But I don't think it's a good tradeoff to run the tests a lot faster on the buildbot, if the results we get will be less reliable. I'm actually kind of shocked that anyone would consider replacing the test script with one that is known to make our existing tests less reliable.
>

Ideally, a test harness is stable, consistent, fast, expose as many
bugs as possible, and expose those bugs in a way that is as
reproducible as possible, and we should be shooting for that. But,
just as our code should be bug free but isn't, the test harness may
not be able to be ideal either, at which point, you'll probably
prioritize some aspects of its behavior over others.

For example, ORWT runs the tests in the same order every time in a
single thread, and uses a long timeout. This makes the tests results
very stable and consistent at the cost of potentially hiding some bugs
(tests getting slower but not actually timing out, or tests that
depend on previous tests having run).

NRWT, at least the way we run it by default in Chromium, uses a much
shorter timeout and of course runs things in parallel. This exposes
those bugs, at the cost of making things appear flakier. We have
attempted to build tooling to help with this, because we generally
value finding more bugs over completely stable test runs. For example,
we have an entirely separate tool called the "flakiness dashboard"
that can help track the behavior of tests over time.

So, your "less reliable" might actually be my "finding more bugs" :)
NRWT has at least a couple of hooks that ORWT does not for helping to
tune to your desired preferences, and we can configure them on a
port-by-port basis.

> I don't really care why tests would turn flaky. It's entirely possible that these are bugs in the tests themselves, or in the code they are testing. That should still be fixed.

Of course. I'm certainly not suggesting that we shouldn't fix bugs.
But, practically speaking, obviously we are okay with some tests
failing, because we list them in Skipped files today. It may be that
some of those tests would pass under NRWT, and we don't know that,
either (because NRWT re-uses the existing Skipped files as-is. At some
point we might want to change this).

You mentioned that the "main benefit" of NRWT is that it is faster,
but another benefit is that you can classify the expected failures
better using NRWT, and detect real changes in the failure. If a test
that used to render pixels incorrectly now actually has a different
render tree, we'll catch that. If it starts crashing, we'll catch
that. ORWT only has "run and expect it to pass" or "Skip and
potentially miss something changing". I think I actually consider this
more useful than the speed improvements.

>
> Nor do I think that marking tests as flaky in the expectations file means we are not losing test coverage. If a test can randomly pass or fail, and we know that and the tool expects it, we are nonetheless not getting the benefit of learning when the test starts failing.

See above. The data is all there, and it's somewhat a question of what
you want to surface where. We have largely attempted to build tools
that get the best of both worlds. Maybe this is partially a question
of what you would like the main waterfall and consoles to tell you,
and perhaps I do not fully understand how different ports would answer
this question?

-- Dirk


More information about the webkit-dev mailing list