[webkit-dev] A proposal for handling "failing" layout tests and TestExpectations

Wed Aug 15 17:24:08 PDT 2012

On Wed, Aug 15, 2012 at 5:00 PM, Filip Pizlo <fpizlo at apple.com> wrote:
>
> On Aug 15, 2012, at 4:02 PM, Dirk Pranke <dpranke at chromium.org> wrote:
>
>> On Wed, Aug 15, 2012 at 3:06 PM, Filip Pizlo <fpizlo at apple.com> wrote:
>>> Apparently I was somewhat unclear.  Let me restate.  We have the following mechanisms available when a test fails:
>>>
>>> 1) Check in a new -expected.* file.
>>>
>>> 2) Modify the test.
>>>
>>> 3) Modify a TestExpectations file.
>>>
>>> 4) Add the test to a Skipped file.
>>>
>>> 5) Remove the test entirely.
>>>
>>> I have no problem with (1) unless it is intended to mark the test as expected-to-fail-but-not-crash.  I agree that using -expected.* to accomplish what TestExpectations accomplishes is not valuable, but I further believe that even TestExpectations is not valuable.
>>>
>>> I broadly prefer (2) whenever possible.
>>>
>>> I believe that (3) and (4) are redundant, and I don't buy the value of (3).
>>>
>>> I don't like (5) but we should probably do more of it for tests that have a chronically low signal-to-noise ratio.
>>>
>>
>> Thank you for clarifying. I had actually written an almost identical
>> list but didn't send it, so I think we're on the same page at least as
>> far as understanding the problem goes ...
>>
>> So, I would describe my suggestion as an improved variant of the kind
>> of (1) that can be used as "expected-to-fail-but-not-crash" (which
>> I'll call 1-fail), and that we would use this in cases where we use
>> (3), (4), or (1-fail) today.
>>
>> I would also agree that we should do (2) where possible, but I don't
>> think this is easily possible for a large class of tests, especially
>> pixel tests, although I am currently working on other things that will
>> hopefully help here.
>>
>> Chromium certainly does a lot of (3) today, and some (1-fail). Other
>> ports definitely use (1-fail) or (4) today, because (2) is rarely
>> possible for many, many tests.
>>
>> We know that doing (1-fail), (3), or (4) causes real maintenance woes
>> down the road, but also that doing (1-fail) or (3) catches real
>> problems that simply skipping the test would not -- at some cost.
>> Whether the benefit is worth the cost, is not known, of course, but I
>> believe it is. I am hoping that my suggestion will have a lower
>> overall cost than doing (1-fail) or (3).
>
> I also believe that the trade-off is known and, and specifically, I believe that the cost of having any tests in the (1-fail) or (3) states is more costly than having them in (4) or (5).
>
>>
>>> You're proposing a new mechanism. I'm arguing that given the sheer number of tests, and the overheads associated with maintaining them, (4) is the broadly more productive strategy in terms of bugs-fixed/person-hours.  And, increasing the number of mechanisms for dealing with tests by 20% is likely to reduce overall productivity rather than helping anyone.
>>>
>>
>> Why do you believe this to be true? I'm not being flippant here ... I
>> think this is a very plausible argument, and it may well be true, but
>> I don't know what the criteria we would use to evaluate it are. Some
>> of the possible factors are:
>>
>> * the complexity of the test infrastructure and the cognitive load it
>> introduces on developers
>> * the cost of bugs that are missed because we're skipping the tests
>> intended to catch those bugs
>> * the cost of looking at "regressions" and trying to figure out if the
>> regression is something you care about or not
>> * the cost of looking at the "-expected" results and trying to figure
>> out if what is "expected" is correct or not
>>
>> There may be others as well, but the last three are all very real in
>> my experience, and I believe they significantly outweigh the first
>> one, but I don't know how to objectively assess that (and I don't
>> think it's even possible since different people/teams/ports will weigh
>> these things differently).
>
> I believe that the cognitive load is greater than any benefit from catching bugs incidentally by continuing to run a (1-fail) or (3) test, and continuing to evaluate whether or not the expectation matches some notions of desired behavior.
>
> And therein lies one possible source of disagreement.
>

Yes :)

> But there is another source of disagreement: would adding a sixth facility that overlaps with (1-fail) or (3) help?  No, I don't believe it would.  It's just another mechanism leading to more possible arguments about which mechanism is better.

Perhaps. I think it will, obviously, or I wouldn't be proposing this
in the first place.

I welcome other opinions on this as well ...

-- Dirk