[webkit-dev] A proposal for handling "failing" layout tests and TestExpectations

Fri Aug 17 08:07:43 PDT 2012

On Thu, Aug 16, 2012 at 6:11 PM, Dirk Pranke <dpranke at chromium.org> wrote:

> On Thu, Aug 16, 2012 at 5:41 PM, Ryosuke Niwa <rniwa at webkit.org> wrote:
> > On Thu, Aug 16, 2012 at 5:18 PM, Dirk Pranke <dpranke at chromium.org>
> wrote:
> >>
> >> On Thu, Aug 16, 2012 at 3:50 PM, Stephen Chenney <schenney at chromium.org
> >
> >> wrote:
> >> > I agree with the priorities above, at least. I also agree with the
> >> > overall
> >> > goal of making our implementation match our philosophy on testing.
> >> >
> >> > Ryosuke has raised a very valid point: it is not clear what a Gardener
> >> > should do with a test that has changed behavior. Personally, the area
> I
> >> > deal
> >> > with is very susceptible to the "single pixel differences matter"
> issue,
> >> > and
> >> > I would prefer gardeners to flag updates in some way so that I can
> look
> >> > at
> >> > it myself (or one of the very few other experts can look at it). Maybe
> >> > I'm
> >> > just a control freak.
> >> >
> >> > In the current situation, gardeners are not likely to roll out a patch
> >> > short
> >> > of build failures, obvious test or perf regressions, or security
> issues.
> >> > For
> >> > the bulk of cases where a test result has changed, the gardener will
> >> > either
> >> > add it to expectations or go ahead and update the result.
> >> >
> >> > The result is a mixture of incorrect expected results and files listed
> >> > in
> >> > TestExpectations that may or may not be correct. The way I deal with
> >> > this
> >> > right now is to maintain a snapshot of TestExpectations and
> occasionally
> >> > go
> >> > through and look for newly added results, and check if they are in
> fact
> >> > correct. And every now and then go look at older entries to see if
> they
> >> > have
> >> > been updated. Or I get lucky and notice that a new expected result has
> >> > been
> >> > checked in that is incorrect (I can mostly catch that by following
> >> > checkins
> >> > in the code).
> >> >
> >> > The proposed change does not alter this fundamental dynamic. Depending
> >> > on
> >> > what policy gardeners adopt, they might now rename the result as
> failing
> >> > and
> >> > remove the expected, or maybe they'll just update the expected. I'm
> >> > still in
> >> > a situation where I don't know the exact status of any given result,
> and
> >> > where I have no easy way of knowing if a gardener has correctly
> updated
> >> > an
> >> > expected or failing image.
> >>
> >> So, if we had -passing/-failing, then people who knew what the results
> >> were supposed to be in a given directory (call them "owners") could
> >> rename the existing -expected files into one category or another, and
> >> then a gardener could add newly-failing tests as -expected; it would
> >> then be trivial for the owner to look for new -expected files in that
> >> directory.
> >>
> >> Right now, an owner would have to periodically scan the directory for
> >> files changed since the last time they scanned the directory, and
> >> would have no way of knowing whether an updated file was "correct" or
> >> not (perhaps you could filter out by who committed the change, but
> >> it's certainly harder than ls *-expected).
> >>
> >> So, I would contend that my proposal would make it easier for us to
> >> have a process by which gardeners could punt on changes they're unsure
> >> about and for owners to subsequently review them. I think you could
> >> have a similar model if we just checked in *all* new results into
> >> -expected, but it would be harder to implement (perhaps not
> >> impossible, though, I haven't given it much thought yet).
> >
> >
> > That seems like a much better model.
>
> To clarify: what does "That" refer to? My idea for gardeners to check
> in "-expected" and owners to move to "-passing/failing"] or the idea
> that we check in "new results into -expected" and have some other way
> of reviewing new baselines? (or something else)?
>

The former. In particular, the idea that only experts of a given area can
check in -correct/-failing is promising because there are many cases where
pixel tests fail due to completely unrelated bugs and we want to consider
it to be still correct. Butwe can implement this policy without the said
proposal. All we need to do is for gardeners to add test expectations and
wait for experts to come in and decide whether we need rebaseline, bug fix,
or rollout.

On the other hand, the pixel test output that's correct to one expert may
not be correct to another expert. For example, I might think that one
editing test's output is correct because it shows the feature we're testing
in the test is working properly. But Stephen might realizes that this
-expected.png contains off-by-one Skia bug. So categorizing -correct.png
and -failure.png may require multiple experts looking at the output, which
may or may not be practical.

Furthermore, some areas may not have an adequate number of experts or
experts who can triage -expected.png, which means that we'll further
fragment the process depending on how many experts we have for a given area.

>> In a world where all failures are triaged promptly by qualified
> >> owners, all of this mechanism is unnecessary. Unfortunately, we don't
> >> live in that world at the moment.
> >
> >
> > I'd argue that this is a bigger problem worth solving. Differentiating
> > expected failure and correct results isn't going to help if gardeners
> were
> > to keep making mistakes.
> >
>
> It is a bigger problem worth solving, but it is also much harder to
> solve.

I agree. This is why I think we should just check-in whatever result we're
currently seeing as -expected.png because we wouldn't at least have any
ambiguity in the process then. We just check in whatever we're currently
seeing and file a bug if we see a problem with the new result and possibly
rollout the patch after talking with the author/reviewer.

The new result we check in may not be 100% right but experts — e.g. me for
editing and Stephen for Skia — can come in and look at recent changes to
triage any new failures.

In fact, it might be worthwhile for us to invest our time in improving
tools to do this. For example, can we have a portal where I can see new
rebaselines that happened in LayoutTests/editing and
LayoutTests/platform/*/editing since the last time I visited the portal?
e.g. it can show chronological timeline of baselines along with a possible
cause (list of changesets maybe?) of the baseline.

- Ryosuke
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.webkit.org/pipermail/webkit-dev/attachments/20120817/63bac0db/attachment.html>