[webkit-dev] A proposal for handling "failing" layout tests and TestExpectations

Thu Aug 16 17:41:11 PDT 2012

   1. On Thu, Aug 16, 2012 at 5:18 PM, Dirk Pranke <dpranke at chromium.org>wrote:

On Thu, Aug 16, 2012 at 3:50 PM, Stephen Chenney <schenney at chromium.org>
> wrote:
> > I agree with the priorities above, at least. I also agree with the
> overall
> > goal of making our implementation match our philosophy on testing.
> >
> > Ryosuke has raised a very valid point: it is not clear what a Gardener
> > should do with a test that has changed behavior. Personally, the area I
> deal
> > with is very susceptible to the "single pixel differences matter" issue,
> and
> > I would prefer gardeners to flag updates in some way so that I can look
> at
> > it myself (or one of the very few other experts can look at it). Maybe
> I'm
> > just a control freak.
> >
> > In the current situation, gardeners are not likely to roll out a patch
> short
> > of build failures, obvious test or perf regressions, or security issues.
> For
> > the bulk of cases where a test result has changed, the gardener will
> either
> > add it to expectations or go ahead and update the result.
> >
> > The result is a mixture of incorrect expected results and files listed in
> > TestExpectations that may or may not be correct. The way I deal with this
> > right now is to maintain a snapshot of TestExpectations and occasionally
> go
> > through and look for newly added results, and check if they are in fact
> > correct. And every now and then go look at older entries to see if they
> have
> > been updated. Or I get lucky and notice that a new expected result has
> been
> > checked in that is incorrect (I can mostly catch that by following
> checkins
> > in the code).
> >
> > The proposed change does not alter this fundamental dynamic. Depending on
> > what policy gardeners adopt, they might now rename the result as failing
> and
> > remove the expected, or maybe they'll just update the expected. I'm
> still in
> > a situation where I don't know the exact status of any given result, and
> > where I have no easy way of knowing if a gardener has correctly updated
> an
> > expected or failing image.
>
> So, if we had -passing/-failing, then people who knew what the results
> were supposed to be in a given directory (call them "owners") could
> rename the existing -expected files into one category or another, and
> then a gardener could add newly-failing tests as -expected; it would
> then be trivial for the owner to look for new -expected files in that
> directory.
>
> Right now, an owner would have to periodically scan the directory for
> files changed since the last time they scanned the directory, and
> would have no way of knowing whether an updated file was "correct" or
> not (perhaps you could filter out by who committed the change, but
> it's certainly harder than ls *-expected).
>
> So, I would contend that my proposal would make it easier for us to
> have a process by which gardeners could punt on changes they're unsure
> about and for owners to subsequently review them. I think you could
> have a similar model if we just checked in *all* new results into
> -expected, but it would be harder to implement (perhaps not
> impossible, though, I haven't given it much thought yet).
>

That seems like a much better model. Right now, one of the biggest problem
is that people rebaseline tests or add test expectations without
understanding what the correct output is. It happens for almost all ports.

Having -correct.png doesn't necessarily help that because even a seemingly
innocent one-pixel difference could be a bug (a patch may only add a
baseline to one platform), and it's really hard to tell what the expected
result is unless you have worked on the rendering code for a long time.

On the other hand, if you follow the chromium model of just
> suppressing the file, then we have no idea if the failure you
> currently see has any relationship to the failure that was originally
> seen, or if the file that is currently checked in has any relationship
> to what "correct" might now need to be (we know that the file was
> correct at some point but is now stale). So, in the chromium model, at
> least, you can do a grep through TestExpectations to look for failing
> tests, but you don't have the history you would get by looking at a
> history of changes to -expected/-passing/-failing.
>
> In a world where all failures are triaged promptly by qualified
> owners, all of this mechanism is unnecessary. Unfortunately, we don't
> live in that world at the moment.
>

I'd argue that this is a bigger problem worth solving. Differentiating
expected failure and correct results isn't going to help if gardeners were
to keep making mistakes.

- Ryosuke
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.webkit.org/pipermail/webkit-dev/attachments/20120816/cfea20f7/attachment.html>