[webkit-dev] A proposal for handling "failing" layout tests and TestExpectations

Fri Aug 17 11:52:41 PDT 2012

On Fri, Aug 17, 2012 at 11:29 AM, Ryosuke Niwa <rniwa at webkit.org> wrote:
> On Fri, Aug 17, 2012 at 11:06 AM, Dirk Pranke <dpranke at chromium.org> wrote:
>>
>> > On the other hand, the pixel test output that's correct to one expert
>> > may
>> > not be correct to another expert. For example, I might think that one
>> > editing test's output is correct because it shows the feature we're
>> > testing
>> > in the test is working properly. But Stephen might realizes that this
>> > -expected.png contains off-by-one Skia bug. So categorizing -correct.png
>> > and
>> > -failure.png may require multiple experts looking at the output, which
>> > may
>> > or may not be practical.
>>
>> Perhaps. Obviously (a) there's a limit to what you can do here, and
>> (b) a test that requires multiple experts to verify its correctness
>> is, IMO, a bad test :).
>
>
> With that argument, almost all pixel tests are bad tests because pixel tests
> in editing, for example, involve editing, rendering, and graphics code.

If in order to tell a pixel test is correct you need to be aware of
how all of that stuff works, then, yes, it's a bad test. It can fail
too many different ways, and is testing too many different bits of
information. As Filip might suggest, it would be nice if we could
split such tests up. That said, I will freely grant that in many cases
we can't easily do better given the way things are currently
structured, and splitting up such tests would be an enormous amount of
work.

If the pixel test is testing whether a rectangle is actually green or
actually red, such a test is fine, doesn't need much subject matter
expertise, and it is hard to imagine how you'd test such a thing some
other way.

> I don't think any single person can comprehend the entire stack to tell with a
> 100% confidence that the test result is exactly and precisely correct.

Sure. Such a high bar should be avoided.

>> >  I think we should just check-in whatever result we're
>> > currently seeing as -expected.png because we wouldn't at least have any
>> > ambiguity in the process then. We just check in whatever we're currently
>> > seeing and file a bug if we see a problem with the new result and
>> > possibly
>> > rollout the patch after talking with the author/reviewer.
>>
>> This is basically saying we should just follow the "existing
>> non-Chromium" process, right?
>
>
> Yes. In addition, doing so will significantly reduce the complexity of the
> current process.
>
>> This would seem to bring us back to step
>> 1: it doesn't address the problem that I identified with the "existing
>> non-Chromium" process, namely that a non-expert can't tell by looking
>> at the checkout what tests are believed to be passing or not.
>
>
> What is the use case of this? I've been working on WebKit for more than 3
> years, and I've never had to think about whether a test for an area outside
> of my expertise has the correct output or not other than when I was
> gardening. And having -correct / -failing wouldn't have helped me knowing
> what the correct output when I was gardening anyway because the new output
> may as well as be new -correct or -failing result.

I've done this frequently when gardening, when simply trying to learn
how a given chunk of code works and how a given chunk of tests work
(or don't work), and when trying to get a sense of how well our
product is or isn't passing tests.

Perhaps this is the case because I tend to more work on infrastructure
and testing, and look at stuff shallowly across the whole tree rather
than in depth in particular areas as you do.

>> I don't think searching bugzilla (as it is currently used) is a workable
>> alternative.
>
>
> Why not? Bugzilla is the tool we use to triage and track bugs. I don't see a
> need for an alternative method to keep track of bugs.

The way we currently use bugzilla, it is difficult if not impossible
to find a concise and accurate list of all the failing layout tests
meeting any sort of filename- or directory-based criteria (maybe you
can do it just for editing, I don't know). The layout test summary
reports that Ojan sends out to the chromium devs is an example of
this: he generates that from the TestExpectations files; doing so from
bugzilla is not currently feasible.

Note that we could certainly extend bugzilla to make this easier, if
there was consensus to do so (and I would be in favor of this, but
that would also incur more process than we have today).

- Dirk