[webkit-dev] A proposal for handling "failing" layout tests and TestExpectations

Wed Aug 15 17:00:08 PDT 2012

On Aug 15, 2012, at 4:02 PM, Dirk Pranke <dpranke at chromium.org> wrote:

> On Wed, Aug 15, 2012 at 3:06 PM, Filip Pizlo <fpizlo at apple.com> wrote:
>> Apparently I was somewhat unclear.  Let me restate.  We have the following mechanisms available when a test fails:
>> 
>> 1) Check in a new -expected.* file.
>> 
>> 2) Modify the test.
>> 
>> 3) Modify a TestExpectations file.
>> 
>> 4) Add the test to a Skipped file.
>> 
>> 5) Remove the test entirely.
>> 
>> I have no problem with (1) unless it is intended to mark the test as expected-to-fail-but-not-crash.  I agree that using -expected.* to accomplish what TestExpectations accomplishes is not valuable, but I further believe that even TestExpectations is not valuable.
>> 
>> I broadly prefer (2) whenever possible.
>> 
>> I believe that (3) and (4) are redundant, and I don't buy the value of (3).
>> 
>> I don't like (5) but we should probably do more of it for tests that have a chronically low signal-to-noise ratio.
>> 
> 
> Thank you for clarifying. I had actually written an almost identical
> list but didn't send it, so I think we're on the same page at least as
> far as understanding the problem goes ...
> 
> So, I would describe my suggestion as an improved variant of the kind
> of (1) that can be used as "expected-to-fail-but-not-crash" (which
> I'll call 1-fail), and that we would use this in cases where we use
> (3), (4), or (1-fail) today.
> 
> I would also agree that we should do (2) where possible, but I don't
> think this is easily possible for a large class of tests, especially
> pixel tests, although I am currently working on other things that will
> hopefully help here.
> 
> Chromium certainly does a lot of (3) today, and some (1-fail). Other
> ports definitely use (1-fail) or (4) today, because (2) is rarely
> possible for many, many tests.
> 
> We know that doing (1-fail), (3), or (4) causes real maintenance woes
> down the road, but also that doing (1-fail) or (3) catches real
> problems that simply skipping the test would not -- at some cost.
> Whether the benefit is worth the cost, is not known, of course, but I
> believe it is. I am hoping that my suggestion will have a lower
> overall cost than doing (1-fail) or (3).

I also believe that the trade-off is known and, and specifically, I believe that the cost of having any tests in the (1-fail) or (3) states is more costly than having them in (4) or (5).

> 
>> You're proposing a new mechanism. I'm arguing that given the sheer number of tests, and the overheads associated with maintaining them, (4) is the broadly more productive strategy in terms of bugs-fixed/person-hours.  And, increasing the number of mechanisms for dealing with tests by 20% is likely to reduce overall productivity rather than helping anyone.
>> 
> 
> Why do you believe this to be true? I'm not being flippant here ... I
> think this is a very plausible argument, and it may well be true, but
> I don't know what the criteria we would use to evaluate it are. Some
> of the possible factors are:
> 
> * the complexity of the test infrastructure and the cognitive load it
> introduces on developers
> * the cost of bugs that are missed because we're skipping the tests
> intended to catch those bugs
> * the cost of looking at "regressions" and trying to figure out if the
> regression is something you care about or not
> * the cost of looking at the "-expected" results and trying to figure
> out if what is "expected" is correct or not
> 
> There may be others as well, but the last three are all very real in
> my experience, and I believe they significantly outweigh the first
> one, but I don't know how to objectively assess that (and I don't
> think it's even possible since different people/teams/ports will weigh
> these things differently).

I believe that the cognitive load is greater than any benefit from catching bugs incidentally by continuing to run a (1-fail) or (3) test, and continuing to evaluate whether or not the expectation matches some notions of desired behavior.

And therein lies one possible source of disagreement.

But there is another source of disagreement: would adding a sixth facility that overlaps with (1-fail) or (3) help?  No, I don't believe it would.  It's just another mechanism leading to more possible arguments about which mechanism is better.

-F