[webkit-dev] A proposal for handling "failing" layout tests and TestExpectations

Fri Aug 17 11:06:39 PDT 2012

On Fri, Aug 17, 2012 at 8:07 AM, Ryosuke Niwa <rniwa at webkit.org> wrote:
> On Thu, Aug 16, 2012 at 6:11 PM, Dirk Pranke <dpranke at chromium.org> wrote:
>>
>> On Thu, Aug 16, 2012 at 5:41 PM, Ryosuke Niwa <rniwa at webkit.org> wrote:
>> > On Thu, Aug 16, 2012 at 5:18 PM, Dirk Pranke <dpranke at chromium.org>
>> > wrote:
>> >>
>> >> On Thu, Aug 16, 2012 at 3:50 PM, Stephen Chenney
>> >> <schenney at chromium.org>
>> >> wrote:
>> >> > I agree with the priorities above, at least. I also agree with the
>> >> > overall
>> >> > goal of making our implementation match our philosophy on testing.
>> >> >
>> >> > Ryosuke has raised a very valid point: it is not clear what a
>> >> > Gardener
>> >> > should do with a test that has changed behavior. Personally, the area
>> >> > I
>> >> > deal
>> >> > with is very susceptible to the "single pixel differences matter"
>> >> > issue,
>> >> > and
>> >> > I would prefer gardeners to flag updates in some way so that I can
>> >> > look
>> >> > at
>> >> > it myself (or one of the very few other experts can look at it).
>> >> > Maybe
>> >> > I'm
>> >> > just a control freak.
>> >> >
>> >> > In the current situation, gardeners are not likely to roll out a
>> >> > patch
>> >> > short
>> >> > of build failures, obvious test or perf regressions, or security
>> >> > issues.
>> >> > For
>> >> > the bulk of cases where a test result has changed, the gardener will
>> >> > either
>> >> > add it to expectations or go ahead and update the result.
>> >> >
>> >> > The result is a mixture of incorrect expected results and files
>> >> > listed
>> >> > in
>> >> > TestExpectations that may or may not be correct. The way I deal with
>> >> > this
>> >> > right now is to maintain a snapshot of TestExpectations and
>> >> > occasionally
>> >> > go
>> >> > through and look for newly added results, and check if they are in
>> >> > fact
>> >> > correct. And every now and then go look at older entries to see if
>> >> > they
>> >> > have
>> >> > been updated. Or I get lucky and notice that a new expected result
>> >> > has
>> >> > been
>> >> > checked in that is incorrect (I can mostly catch that by following
>> >> > checkins
>> >> > in the code).
>> >> >
>> >> > The proposed change does not alter this fundamental dynamic.
>> >> > Depending
>> >> > on
>> >> > what policy gardeners adopt, they might now rename the result as
>> >> > failing
>> >> > and
>> >> > remove the expected, or maybe they'll just update the expected. I'm
>> >> > still in
>> >> > a situation where I don't know the exact status of any given result,
>> >> > and
>> >> > where I have no easy way of knowing if a gardener has correctly
>> >> > updated
>> >> > an
>> >> > expected or failing image.
>> >>
>> >> So, if we had -passing/-failing, then people who knew what the results
>> >> were supposed to be in a given directory (call them "owners") could
>> >> rename the existing -expected files into one category or another, and
>> >> then a gardener could add newly-failing tests as -expected; it would
>> >> then be trivial for the owner to look for new -expected files in that
>> >> directory.
>> >>
>> >> Right now, an owner would have to periodically scan the directory for
>> >> files changed since the last time they scanned the directory, and
>> >> would have no way of knowing whether an updated file was "correct" or
>> >> not (perhaps you could filter out by who committed the change, but
>> >> it's certainly harder than ls *-expected).
>> >>
>> >> So, I would contend that my proposal would make it easier for us to
>> >> have a process by which gardeners could punt on changes they're unsure
>> >> about and for owners to subsequently review them. I think you could
>> >> have a similar model if we just checked in *all* new results into
>> >> -expected, but it would be harder to implement (perhaps not
>> >> impossible, though, I haven't given it much thought yet).
>> >
>> >
>> > That seems like a much better model.
>>
>> To clarify: what does "That" refer to? My idea for gardeners to check
>> in "-expected" and owners to move to "-passing/failing"] or the idea
>> that we check in "new results into -expected" and have some other way
>> of reviewing new baselines? (or something else)?
>
>
> The former. In particular, the idea that only experts of a given area can
> check in -correct/-failing is promising because there are many cases where
> pixel tests fail due to completely unrelated bugs and we want to consider it
> to be still correct. Butwe can implement this policy without the said
> proposal. All we need to do is for gardeners to add test expectations and
> wait for experts to come in and decide whether we need rebaseline, bug fix,
> or rollout.

A problem with just adding test expectations is that we lose
visibility into whether the test is failing the same way or a
different way. That is one of the (if not the primary) motivation for
my proposal ...

>
> On the other hand, the pixel test output that's correct to one expert may
> not be correct to another expert. For example, I might think that one
> editing test's output is correct because it shows the feature we're testing
> in the test is working properly. But Stephen might realizes that this
> -expected.png contains off-by-one Skia bug. So categorizing -correct.png and
> -failure.png may require multiple experts looking at the output, which may
> or may not be practical.

Perhaps. Obviously (a) there's a limit to what you can do here, and
(b) a test that requires multiple experts to verify its correctness
is, IMO, a bad test :).

>  I think we should just check-in whatever result we're
> currently seeing as -expected.png because we wouldn't at least have any
> ambiguity in the process then. We just check in whatever we're currently
> seeing and file a bug if we see a problem with the new result and possibly
> rollout the patch after talking with the author/reviewer.

This is basically saying we should just follow the "existing
non-Chromium" process, right? This would seem to bring us back to step
1: it doesn't address the problem that I identified with the "existing
non-Chromium" process, namely that a non-expert can't tell by looking
at the checkout  what tests are believed to be passing or not. I don't
think searching bugzilla (as it is currently used) is a workable
alternative.

> The new result we check in may not be 100% right but experts — e.g. me for
> editing and Stephen for Skia — can come in and look at recent changes to
> triage any new failures.
>
> In fact, it might be worthwhile for us to invest our time in improving tools
> to do this. For example, can we have a portal where I can see new
> rebaselines that happened in LayoutTests/editing and
> LayoutTests/platform/*/editing since the last time I visited the portal?
> e.g. it can show chronological timeline of baselines along with a possible
> cause (list of changesets maybe?) of the baseline.

We could build such a portal, sure. I would be interested to hear from
others whether such a thing would be more or less useful than my
proposal.

Of course, you could just set up a watchlist for new expectations
today. Probably not quite as polished as we could get with a portal,
but dirt simple ...

-- Dirk