[webkit-dev] A simpler proposal for handling failing tests WAS: A proposal for handling "failing" layout tests and TestExpectations

Wed Aug 22 14:00:07 PDT 2012

Sorry for the delays in getting back to this ... it's been a busy week.

On Mon, Aug 20, 2012 at 6:03 PM, Maciej Stachowiak <mjs at apple.com> wrote:
>
> Sorry, I overlooked these questions earlier.
>
> On Aug 17, 2012, at 7:36 PM, Dirk Pranke <dpranke at chromium.org> wrote:
>
>> I'm not sure if I like this idea or not. A couple of observations/questions ...
>>
>> 1) I wouldn't want to call it '-correct' unless we were sure it was
>> correct; '-previous' is better in that regard
>
> Good point. Though I think -unexpected-pass might be even better.
>

This suggestion has merit :).

>>
>> 2) the issue with keeping a '-correct' in the tree is that it's quite
>> possible for a previous correct expectation to need to change to a
>> different expectation to still be correct. i.e., they get stale.
>
> Sure, it's possible. At worst, when somebody goes to investigate expected failures, the old result will no longer be useful to them.
>

Yeah, it's not clear to me how useful the old result will be or not,
but we can try it and see.

>> I fear that this could quickly become more confusing than it's worth.
>> It's also not clear to me when -previous gets updated or removed?
>
> It would get removed if someone decides that -expected is actually better or no worse; or if someone fixes the bug that caused the expected failure in the first place, or if the result updates in such a way that neither result is relevant.
>

Okay, so it's the developer's discretion to remove the old file? This
strikes me as something that's not likely to happen reliably without
nagging, so maybe we should create a nag mechanism / report or
automatically file bugs or something.

Do we agree that the old result should not be allowed to persist
indefinitely (i.e., it should be removed at least when the bug is
fixed and -expected is believed to be correct/passing)?

>> 3) It also feels like '-previous' is something that we can just as
>> easily get from SVN/Git/whatever, in a completely foolproof and
>> automatic way. I grant that it's easier to just do a side by side
>> compare, but "diff against previous" isn't that hard to do and we
>> could easily write a wrapper to make it trivial ...
>
> True, but in addition to a convenience improvement , it would also leave a clear record of which tests are expected failures.
>

Right, that's a good point. Also, it occurred to me that "previous" is
harder if you're flipping back and forth between
-passing/-failing/-expected.

>
>> 4) I'd want to work through the various branches in the workflow more
>> before I felt comfortable with this. When I was coming up with my
>> original proposal I originally wanted to allow -passing and -expected
>> to live side-by-side, but all sorts of complications arose, so I'd be
>> worried that we'd have them here, too.
>
> Here's how I imagine the workflow when a sheriff or just innocent bystander notices a deterministically failing test. Follow this two-step algorithm:
>
> 1) Are you confident that the new result is an improvement or no worse? If so, then simply update -expected.txt.
> 2) Otherwise, copy the old result to -<whatever-we-call-the-unexpected-pass-result>.txt, and check in the new result as -<whatever-we-call-the-expected-failure-result.txt>.
>

Thanks for clarifying/restating. I still need to think through the
implications of this a bit more ... I will send another note with my
suggestions/preferences for the "whatever-we-call" files and after I
make sure I think the algorithm works in all the cases.

> This replaces all other approaches to marking expected failures, including the Skipped list, overwriting -expected even you know the result is a regression, marking the test in TestExpectations as Skip, Wontfix, Image, Text, or Text+Image, or any of the other legacy techniques for marking an expected failure reult.

This wouldn't replace Skip or Wontfix. Skip is/will be/should be
reserved for tests that, if we run them, cause other tests to fail (or
problems for the harness), and so checking in a failing result won't
help here.

Tests marked Wontfix will be skipped in order to minimize checking in
baselines we don't care about, avoid other potential instabilities,
and in order to speed up the run by not running tests we don't care
about. The downside here is that we might miss some crash or timeout
bugs, but I think this is an acceptable (and preferable) tradeoff.

Also, we still need some mechanism to address flaky failures.

-- Dirk