[webkit-dev] A proposal for handling "failing" layout tests and TestExpectations

Thu Aug 16 18:11:13 PDT 2012

On Thu, Aug 16, 2012 at 5:41 PM, Ryosuke Niwa <rniwa at webkit.org> wrote:
> On Thu, Aug 16, 2012 at 5:18 PM, Dirk Pranke <dpranke at chromium.org> wrote:
>>
>> On Thu, Aug 16, 2012 at 3:50 PM, Stephen Chenney <schenney at chromium.org>
>> wrote:
>> > I agree with the priorities above, at least. I also agree with the
>> > overall
>> > goal of making our implementation match our philosophy on testing.
>> >
>> > Ryosuke has raised a very valid point: it is not clear what a Gardener
>> > should do with a test that has changed behavior. Personally, the area I
>> > deal
>> > with is very susceptible to the "single pixel differences matter" issue,
>> > and
>> > I would prefer gardeners to flag updates in some way so that I can look
>> > at
>> > it myself (or one of the very few other experts can look at it). Maybe
>> > I'm
>> > just a control freak.
>> >
>> > In the current situation, gardeners are not likely to roll out a patch
>> > short
>> > of build failures, obvious test or perf regressions, or security issues.
>> > For
>> > the bulk of cases where a test result has changed, the gardener will
>> > either
>> > add it to expectations or go ahead and update the result.
>> >
>> > The result is a mixture of incorrect expected results and files listed
>> > in
>> > TestExpectations that may or may not be correct. The way I deal with
>> > this
>> > right now is to maintain a snapshot of TestExpectations and occasionally
>> > go
>> > through and look for newly added results, and check if they are in fact
>> > correct. And every now and then go look at older entries to see if they
>> > have
>> > been updated. Or I get lucky and notice that a new expected result has
>> > been
>> > checked in that is incorrect (I can mostly catch that by following
>> > checkins
>> > in the code).
>> >
>> > The proposed change does not alter this fundamental dynamic. Depending
>> > on
>> > what policy gardeners adopt, they might now rename the result as failing
>> > and
>> > remove the expected, or maybe they'll just update the expected. I'm
>> > still in
>> > a situation where I don't know the exact status of any given result, and
>> > where I have no easy way of knowing if a gardener has correctly updated
>> > an
>> > expected or failing image.
>>
>> So, if we had -passing/-failing, then people who knew what the results
>> were supposed to be in a given directory (call them "owners") could
>> rename the existing -expected files into one category or another, and
>> then a gardener could add newly-failing tests as -expected; it would
>> then be trivial for the owner to look for new -expected files in that
>> directory.
>>
>> Right now, an owner would have to periodically scan the directory for
>> files changed since the last time they scanned the directory, and
>> would have no way of knowing whether an updated file was "correct" or
>> not (perhaps you could filter out by who committed the change, but
>> it's certainly harder than ls *-expected).
>>
>> So, I would contend that my proposal would make it easier for us to
>> have a process by which gardeners could punt on changes they're unsure
>> about and for owners to subsequently review them. I think you could
>> have a similar model if we just checked in *all* new results into
>> -expected, but it would be harder to implement (perhaps not
>> impossible, though, I haven't given it much thought yet).
>
>
> That seems like a much better model.

To clarify: what does "That" refer to? My idea for gardeners to check
in "-expected" and owners to move to "-passing/failing", or the idea
that we check in "new results into -expected" and have some other way
of reviewing new baselines? (or something else)?

> Right now, one of the biggest problem
> is that people rebaseline tests or add test expectations without
> understanding what the correct output is. It happens for almost all ports.
>
> Having -correct.png doesn't necessarily help that because even a seemingly
> innocent one-pixel difference could be a bug (a patch may only add a
> baseline to one platform), and it's really hard to tell what the expected
> result is unless you have worked on the rendering code for a long time.
>

Yes. I hate gardening/sheriffing because I feel like I usually don't
know whether to rebaseline or not :) (Although I suspect having
-correct would be better than not having it, at least).

>> In a world where all failures are triaged promptly by qualified
>> owners, all of this mechanism is unnecessary. Unfortunately, we don't
>> live in that world at the moment.
>
>
> I'd argue that this is a bigger problem worth solving. Differentiating
> expected failure and correct results isn't going to help if gardeners were
> to keep making mistakes.
>

It is a bigger problem worth solving, but it is also much harder to
solve. As long as we have the combination of

1) changes can cause lots of tests to change expectations and need new baselines
2) it's acceptable to land changes without the accompanying updated baselines
3) there is pressure to keep the bots green (not only for its own
sake, but so that we can more easily identify *new* failures)
4) insufficient resources watching the tree (which is really true for
every port today, even chromium)

i think we're pretty much screwed :). I'm working on things that
*might* help with 1), but I don't see 2) or 4) changing in the
near-to-foreseeable future (unless my changes also might help w/ 2),
and I think 3) is just good practice.

Put differently, I think my proposal gives us the most manageable
process for keeping the bots green / keeping new failures visible
while still being able to identify older failures.

But I'm not wedded to it; if anyone has better ideas, I would
certainly like to hear them.

-- Dirk