[webkit-dev] Skipping Flakey Tests

Tue Dec 22 16:27:44 PST 2009

On Tue, Dec 22, 2009 at 10:31 AM, Darin Adler <darin at apple.com> wrote:
> On Dec 21, 2009, at 6:14 PM, Dirk Pranke wrote:
>
>> Given all that, Darin, what were you suggesting when you said "Let's fix that"?
>
> Lets add a feature so something in the tests tree can indicate a Chromium Windows result should be the base result, rather than the platform/win one. We can debate exactly how it should work, but lets come up with a design and do it.
>

For a given test, either the test produces generic results (and the
results are checked in alongside the test), the test produces "mostly
generic" results (meaning most platforms/ports can use the generic
results, but some intentionally diverge), or the test produces
completely platform-specific results.

In the completely generic case, I hope we are not checking in
incorrect results. Are we concerned about this case?

I think the "mostly generic" case is probably a variant of the
"generic" case, and should have the same policy.

That leaves the "platform-specific" case. In this case, marking any
particular platform as "right" doesn't make a lot of sense, because
what's right for one platform may or may not be right for another. The
problem comes up in ports like Chromium that use a search path for
results. I would not suggest that we change anything here - if
platform/win/foo-expected.txt is "wrong", we should probably just
check in an override in platform/chromium-win/foo-expected.txt . If
too many of these situations occur, we're probably just better off
dropping platform/win from the search path (which is what I think we
actually probably should do in in our win port, but I leave that as an
excercise for me to determine).

So, I don't think we need to change anything to address the above issues.

There are one or two other points of design.

First, there's the question of whether or not "intentionally
incorrect" results should ever be checked in. One reason to do this is
because run-webkit-tests doesn't have a "FAIL" concept, just a
"SKIPPED" concept. It would be easy to do this, and probably the best
way to do this is to add a "Failures" file alongside the "Skipped"
file, using the same syntax. An alternative would be to move to the
more general syntax (and hopefully, just move to the tool) that
Chromium uses.

Second, there's the question of whether or not you want to track what
the "expected incorrect" results are, separate from what the "expected
correct" results are. That way, you can detect when a test fails
*differently* than it has been in the past. It is an open question as
to how useful and/or how much maintenance it would be to do this. If
we were to do it, I would suggest adding something like
"foo-failure.txt" files alongside the "foo-expected.txt" files.

To sum up:

(1) For platform-specific failures, we should either (a) check in new
overriding baselines or (b) fix the baseline search path. No
significant code changes are needed.

(2) For generic failures, we can either (a) add a "Failures" file, (b)
implement Chromium's test-expectations syntax, (c) move to Chromium's
tools (getting b along the way), (d) check in incorrect output as the
"expected results" and add platform-specific baselines for platforms
that "get it right".

(3) If you want to capture "expected incorrect" *and* "expected
correct", add a "-failure" set of expectations and mod the tools
accordingly.

I would vote for (1a) or (1b) (basically status quo), and (2c). I
really don't like (2d), and (2b) seems like a waste of effort compared
to (2c). If we're unclear if (2c) is really valuable, I would
volunteer to implement (2a) as a stopgap (although it won't happen
until after the holidays). I would not bother to implement (3) at this
point, but I won't stop someone else from doing it, either.

-- Dirk