[webkit-dev] A proposal for handling "failing" layout tests and TestExpectations

Wed Aug 15 16:02:47 PDT 2012

On Wed, Aug 15, 2012 at 3:06 PM, Filip Pizlo <fpizlo at apple.com> wrote:
> Apparently I was somewhat unclear.  Let me restate.  We have the following mechanisms available when a test fails:
>
> 1) Check in a new -expected.* file.
>
> 2) Modify the test.
>
> 3) Modify a TestExpectations file.
>
> 4) Add the test to a Skipped file.
>
> 5) Remove the test entirely.
>
> I have no problem with (1) unless it is intended to mark the test as expected-to-fail-but-not-crash.  I agree that using -expected.* to accomplish what TestExpectations accomplishes is not valuable, but I further believe that even TestExpectations is not valuable.
>
> I broadly prefer (2) whenever possible.
>
> I believe that (3) and (4) are redundant, and I don't buy the value of (3).
>
> I don't like (5) but we should probably do more of it for tests that have a chronically low signal-to-noise ratio.
>

Thank you for clarifying. I had actually written an almost identical
list but didn't send it, so I think we're on the same page at least as
far as understanding the problem goes ...

So, I would describe my suggestion as an improved variant of the kind
of (1) that can be used as "expected-to-fail-but-not-crash" (which
I'll call 1-fail), and that we would use this in cases where we use
(3), (4), or (1-fail) today.

I would also agree that we should do (2) where possible, but I don't
think this is easily possible for a large class of tests, especially
pixel tests, although I am currently working on other things that will
hopefully help here.

Chromium certainly does a lot of (3) today, and some (1-fail). Other
ports definitely use (1-fail) or (4) today, because (2) is rarely
possible for many, many tests.

We know that doing (1-fail), (3), or (4) causes real maintenance woes
down the road, but also that doing (1-fail) or (3) catches real
problems that simply skipping the test would not -- at some cost.
Whether the benefit is worth the cost, is not known, of course, but I
believe it is. I am hoping that my suggestion will have a lower
overall cost than doing (1-fail) or (3).

> You're proposing a new mechanism. I'm arguing that given the sheer number of tests, and the overheads associated with maintaining them, (4) is the broadly more productive strategy in terms of bugs-fixed/person-hours.  And, increasing the number of mechanisms for dealing with tests by 20% is likely to reduce overall productivity rather than helping anyone.
>

Why do you believe this to be true? I'm not being flippant here ... I
think this is a very plausible argument, and it may well be true, but
I don't know what the criteria we would use to evaluate it are. Some
of the possible factors are:

* the complexity of the test infrastructure and the cognitive load it
introduces on developers
* the cost of bugs that are missed because we're skipping the tests
intended to catch those bugs
* the cost of looking at "regressions" and trying to figure out if the
regression is something you care about or not
* the cost of looking at the "-expected" results and trying to figure
out if what is "expected" is correct or not

There may be others as well, but the last three are all very real in
my experience, and I believe they significantly outweigh the first
one, but I don't know how to objectively assess that (and I don't
think it's even possible since different people/teams/ports will weigh
these things differently).

-- Dirk