[webkit-dev] A proposal for handling "failing" layout tests and TestExpectations

Thu Aug 16 15:04:13 PDT 2012

On Thu, Aug 16, 2012 at 2:23 PM, Ryosuke Niwa <rniwa at webkit.org> wrote:
> On Thu, Aug 16, 2012 at 2:05 PM, Dirk Pranke <dpranke at chromium.org> wrote:
>>
>> I think your observations are correct, but at least my experience as a
>> gardener/sheriff leads me to a different conclusion. Namely, when I'm
>> looking at a newly failing test, it is difficult if not impossible for
>> me to know if the existing baseline was previously believed to be
>> correct or not, and thus it's hard for me to tell if the new baseline
>> should be considered worse, better, or different.
>
>
> How does the proposal solve this problem?

It doesn't solve the problem per se; sorry, I didn't mean to imply
that it did. What my proposal does do is give the sheriff more
information with which to decide what to do, namely:

1) the test was passing, but now it isn't. Is that okay? Do I need to
rebaseline, or should we revert?
2) the test was failing; now it's either failing differently, or it's
maybe passing ...
3) the test was -expected (status quo).

Granted, in all three cases the sheriff has to make a judgement call,
but my theory/claim is that this is easier than just (3).

In addition, we gain an entirely new feature (for the non-Chromium
ports): the ability to tell if a test is believed to be failing or
not.

> The more options we have, the more likely people make mistakes.

I don't think this is necessarily true; it's possible that having more
options will actually make it easier to do the right thing.

>> This is why I want to test this theory :). It seems like if we got
>> experience with this on one (or more) ports for a couple of months we
>> would have a much more well-informed opinion, and I'm not seeing a
>> huge downside to at least trying this idea out.
>
>
> Sure. But if we're doing this experiment on trunk, thereby imposing
> significant cognitive load on every other port

How does this impose a cognitive load (of any size) on every other
port? If the files are only checked in to platform/X, every other
platform doesn't even need to know about them ...

> setting both an exact date at which we decide whether this approach is good
> or not and criteria by which we decide this before the experiment starts.

I like the theory of this but I think determining the exact criteria
will be difficult if not impossible. I'm not sure how we would
evaluate this other than "contributors are happier" (or not).

> Like Filip, I'm extremely concerned about the prospect of us introducing
> yet-another-way-of-doing-things, and not be able to get rid of it later.

Presumably the only way we'd be not able to get rid of it would be if
some port actually liked it, in which case, why would we get rid of
it?

On Thu, Aug 16, 2012 at 2:32 PM, Filip Pizlo <fpizlo at apple.com> wrote:
> In what way do things become better?  Because other people will see what the sheriff believed about the result?

Yes, that's the theory; see above.

> Can you articulate some more about what happens when you have both -expected and -failing?
>
> My specific concern is that after someone checks in a fix, we will have some sheriff accidentally misjudge the change in behavior to be a regression, and check in a -failing file.  And then we end up in a world of confusion.
>

Ah! This is important ... you *can't have this scenario*. The tools
will prohibit you (via style checks or something) from having more
than one kind of result checked in. I absolutely agree that this would
be confusing, which is why I don't want to allow it (I originally
started out thinking we'd support both -correct and -failing, and
realized it made life horrible)

(I'm going to skip the flaky test discussion ... I think it's
tangential and it's clearly a debatable topic in its own right; I only
brought it up because it still shows a purpose for TestExpectations
entries).

> 2) The WONTFIX mode in TestExpectations feels to me more like a statement that you're just trying to see if the test doesn't crash.  Correct?  Either way, it's confusing.

No, sorry, in the new world (new TestExpectations syntax) WONTFIX
files will be skipped automatically. They are like SKIP except that
they carry the connotation that there are no plans to unskip them.

> 3) Your new mechanism feels like it's already covered by our existing use of -expected files.  I'm not quite convinced that having -failing in addition to -expected files would be all that helpful.

Fair enough, I realize it's not an open-and-shut case. I'm not 100%
convinced it'll be useful myself, but I think it's worth trying.

> (3) concerns me the most, and it concerns me particularly because we're still not giving good answers to (1) or (2).

You lost me here. I plan to do (1) and (2) regardless of whether we
start using '-passing/-failing' or not. How is (3) related to (1) and
(2)?

-- Dirk

-- Dirk