[webkit-dev] A proposal for handling "failing" layout tests and TestExpectations

Thu Aug 16 17:18:15 PDT 2012

On Thu, Aug 16, 2012 at 3:50 PM, Stephen Chenney <schenney at chromium.org> wrote:
> On Thu, Aug 16, 2012 at 6:15 PM, Ryosuke Niwa <rniwa at webkit.org> wrote:
>>
>> On Thu, Aug 16, 2012 at 3:04 PM, Dirk Pranke <dpranke at chromium.org> wrote:
>>>
>>> On Thu, Aug 16, 2012 at 2:23 PM, Ryosuke Niwa <rniwa at webkit.org> wrote:
>>> > Like Filip, I'm extremely concerned about the prospect of us
>>> > introducing
>>> > yet-another-way-of-doing-things, and not be able to get rid of it
>>> > later.
>>>
>>> Presumably the only way we'd be not able to get rid of it would be if
>>> some port actually liked it, in which case, why would we get rid of it?
>>
>>
>> Because we already have too many ways to deal with test failures.
>>
>> Why don't we at least merge Skipped and TestExpectations first.
>>
>
> I agree with the priorities above, at least. I also agree with the overall
> goal of making our implementation match our philosophy on testing.
>
> Ryosuke has raised a very valid point: it is not clear what a Gardener
> should do with a test that has changed behavior. Personally, the area I deal
> with is very susceptible to the "single pixel differences matter" issue, and
> I would prefer gardeners to flag updates in some way so that I can look at
> it myself (or one of the very few other experts can look at it). Maybe I'm
> just a control freak.
>
> In the current situation, gardeners are not likely to roll out a patch short
> of build failures, obvious test or perf regressions, or security issues. For
> the bulk of cases where a test result has changed, the gardener will either
> add it to expectations or go ahead and update the result.
>
> The result is a mixture of incorrect expected results and files listed in
> TestExpectations that may or may not be correct. The way I deal with this
> right now is to maintain a snapshot of TestExpectations and occasionally go
> through and look for newly added results, and check if they are in fact
> correct. And every now and then go look at older entries to see if they have
> been updated. Or I get lucky and notice that a new expected result has been
> checked in that is incorrect (I can mostly catch that by following checkins
> in the code).
>
> The proposed change does not alter this fundamental dynamic. Depending on
> what policy gardeners adopt, they might now rename the result as failing and
> remove the expected, or maybe they'll just update the expected. I'm still in
> a situation where I don't know the exact status of any given result, and
> where I have no easy way of knowing if a gardener has correctly updated an
> expected or failing image.

So, if we had -passing/-failing, then people who knew what the results
were supposed to be in a given directory (call them "owners") could
rename the existing -expected files into one category or another, and
then a gardener could add newly-failing tests as -expected; it would
then be trivial for the owner to look for new -expected files in that
directory.

Right now, an owner would have to periodically scan the directory for
files changed since the last time they scanned the directory, and
would have no way of knowing whether an updated file was "correct" or
not (perhaps you could filter out by who committed the change, but
it's certainly harder than ls *-expected).

So, I would contend that my proposal would make it easier for us to
have a process by which gardeners could punt on changes they're unsure
about and for owners to subsequently review them. I think you could
have a similar model if we just checked in *all* new results into
-expected, but it would be harder to implement (perhaps not
impossible, though, I haven't given it much thought yet).

On the other hand, if you follow the chromium model of just
suppressing the file, then we have no idea if the failure you
currently see has any relationship to the failure that was originally
seen, or if the file that is currently checked in has any relationship
to what "correct" might now need to be (we know that the file was
correct at some point but is now stale). So, in the chromium model, at
least, you can do a grep through TestExpectations to look for failing
tests, but you don't have the history you would get by looking at a
history of changes to -expected/-passing/-failing.

In a world where all failures are triaged promptly by qualified
owners, all of this mechanism is unnecessary. Unfortunately, we don't
live in that world at the moment.

-- Dirk
>
> Two questions come out of this:
> 1) What policy do we want gardeners to follow? Giving "one more option" is
> not helpful as far as I can tell. The new method does nothing that
> TestExpectations could not already do, so I suppose the goal is to not put
> known failing tests in expectations.

I believe it does do something TestExpectations cannot do, see above,
and yes, we would then not put known failing tests in
TestExpectations.

> 2) How will it be simpler to know if something is really failing, really
> correct, or whether a non-expert gardener made an understandable mistake?
>

This part I'm not sure about but I believe that it would be easier to
guess whether the new result is newly failing, potentially failing
differently, or potentially passing, if we know how the prior result
was categorized, assuming it was categorized correctly. Perhaps this
implies a much stronger level of saying that you shouldn't categorize
something as -passing or -failing unless you're *sure* about what
you're doing.

I think I have already addressed your remaining comments ...

-- Dirk