[webkit-dev] A simpler proposal for handling failing tests WAS: A proposal for handling "failing" layout tests and TestExpectations
ojan at chromium.org
Fri Aug 17 16:55:19 PDT 2012
Asserting a test case is 100% correct is nearly impossible for a large
percentage of tests. The main advantage it gives us is the ability to have
-expected mean "unsure".
Lets instead only add -failing (i.e. no -passing). Leaving -expected to
mean roughly what it does today to Chromium folk (roughly, as best we can
tell this test is passing). -failing means it's *probably* an incorrect
result but needs an expert to look at it to either mark it correct (i.e.
rename it to -expected) or figure out how the root cause of the bug.
This actually matches exactly what Chromium gardeners do today, except
instead of putting a line in TestExpectations/Skipped to look at later,
they checkin the -failing file to look at later, which has all the
advantages Dirk listed in the other thread.
Like Dirk's proposal, having both a -failing and a -expected in the same
directory for the same test will be disallowed by the tooling.
The reason I like this is that it's more use-case driven. -failing is a
clear todo list for anyone wanting to fix layout tests.
On Fri, Aug 17, 2012 at 11:52 AM, Dirk Pranke <dpranke at chromium.org> wrote:
> On Fri, Aug 17, 2012 at 11:29 AM, Ryosuke Niwa <rniwa at webkit.org> wrote:
> > On Fri, Aug 17, 2012 at 11:06 AM, Dirk Pranke <dpranke at chromium.org>
> >> > On the other hand, the pixel test output that's correct to one expert
> >> > may
> >> > not be correct to another expert. For example, I might think that one
> >> > editing test's output is correct because it shows the feature we're
> >> > testing
> >> > in the test is working properly. But Stephen might realizes that this
> >> > -expected.png contains off-by-one Skia bug. So categorizing
> >> > and
> >> > -failure.png may require multiple experts looking at the output, which
> >> > may
> >> > or may not be practical.
> >> Perhaps. Obviously (a) there's a limit to what you can do here, and
> >> (b) a test that requires multiple experts to verify its correctness
> >> is, IMO, a bad test :).
> > With that argument, almost all pixel tests are bad tests because pixel
> > in editing, for example, involve editing, rendering, and graphics code.
> If in order to tell a pixel test is correct you need to be aware of
> how all of that stuff works, then, yes, it's a bad test. It can fail
> too many different ways, and is testing too many different bits of
> information. As Filip might suggest, it would be nice if we could
> split such tests up. That said, I will freely grant that in many cases
> we can't easily do better given the way things are currently
> structured, and splitting up such tests would be an enormous amount of
> If the pixel test is testing whether a rectangle is actually green or
> actually red, such a test is fine, doesn't need much subject matter
> expertise, and it is hard to imagine how you'd test such a thing some
> other way.
> > I don't think any single person can comprehend the entire stack to tell
> with a
> > 100% confidence that the test result is exactly and precisely correct.
> Sure. Such a high bar should be avoided.
> >> > I think we should just check-in whatever result we're
> >> > currently seeing as -expected.png because we wouldn't at least have
> >> > ambiguity in the process then. We just check in whatever we're
> >> > seeing and file a bug if we see a problem with the new result and
> >> > possibly
> >> > rollout the patch after talking with the author/reviewer.
> >> This is basically saying we should just follow the "existing
> >> non-Chromium" process, right?
> > Yes. In addition, doing so will significantly reduce the complexity of
> > current process.
> >> This would seem to bring us back to step
> >> 1: it doesn't address the problem that I identified with the "existing
> >> non-Chromium" process, namely that a non-expert can't tell by looking
> >> at the checkout what tests are believed to be passing or not.
> > What is the use case of this? I've been working on WebKit for more than 3
> > years, and I've never had to think about whether a test for an area
> > of my expertise has the correct output or not other than when I was
> > gardening. And having -correct / -failing wouldn't have helped me knowing
> > what the correct output when I was gardening anyway because the new
> > may as well as be new -correct or -failing result.
> I've done this frequently when gardening, when simply trying to learn
> how a given chunk of code works and how a given chunk of tests work
> (or don't work), and when trying to get a sense of how well our
> product is or isn't passing tests.
> Perhaps this is the case because I tend to more work on infrastructure
> and testing, and look at stuff shallowly across the whole tree rather
> than in depth in particular areas as you do.
> >> I don't think searching bugzilla (as it is currently used) is a workable
> >> alternative.
> > Why not? Bugzilla is the tool we use to triage and track bugs. I don't
> see a
> > need for an alternative method to keep track of bugs.
> The way we currently use bugzilla, it is difficult if not impossible
> to find a concise and accurate list of all the failing layout tests
> meeting any sort of filename- or directory-based criteria (maybe you
> can do it just for editing, I don't know). The layout test summary
> reports that Ojan sends out to the chromium devs is an example of
> this: he generates that from the TestExpectations files; doing so from
> bugzilla is not currently feasible.
> Note that we could certainly extend bugzilla to make this easier, if
> there was consensus to do so (and I would be in favor of this, but
> that would also incur more process than we have today).
> - Dirk
> webkit-dev mailing list
> webkit-dev at lists.webkit.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the webkit-dev