[webkit-dev] Reconsidering test expectations (PASS, TEXT, IMAGE, TEXT+IMAGE, TIMEOUT, CRASH, etc...)

Fri May 18 12:55:13 PDT 2012

On Fri, May 18, 2012 at 12:43 AM, Maciej Stachowiak <mjs at apple.com> wrote:
> I guess we do. I think there is no point to saying PASS, because if a test
> crashes, hangs or is skipped it's meaningless. And if it does not crash and
> none of those other things apply, then it shouldn't be listed. But I could
> be missing something.

I'm not sure what you mean by "if a test crashes, hangs, or is skipped
it's meaningless"? Do mean that we can't conclude if it was
text/image/whatever? this is certainly true.

Although we don't normally specify passing tests, obviously, there are
two cases where it's useful.

The first is used in conjunction with a line applying to a whole
directory, in order to specify suppressions for all tests except a
few. For example:

fast/html = FAIL
fast/html/article-element.html = PASS

The second is to indicate that a test is flaky, e.g. PASS TIMEOUT to
indicate that a test may run fine and may timeout.

> I think when there are regressions which do not cause a crash or hang, we
> should be checking in a new expectation, so further regressions to either
> text or image would be detected. More detail to come in an upcoming more
> comprehensive proposal.

How long should a test be allowed to produce output different from the
checked-in-expectation? I think the answer should probably be at least
as long as the cycle times for the bots (so, upwards of an hour or
two), and it's reasonable to give developers a chance to triage
failures (which probably doesn't need to be longer than the cycle
time; if so, the change should be either reverted or new expectations
checked in, right?)

During that time, should the bots be red? If you don't want the bots
to be red, you add an entry to the expectations file, right? (I assume
the answer is yes here ...)

Are you saying that you don't care to detect changes in the output
between that point and the time the change is either reverted or new
expectations are checked in? Or that you don't care if a TEXT failure
becomes a TEXT+IMAGE failure, but other changes (e.g., IMAGE -> TEXT)
may be more interesting?

>> If one of the text tests or the image tests will fail but maybe not both,
>> that means the test is nondeterministic, so it should be marked as flaky and
>> its results should not affect greenness of the bots, so long as it does not
>> hang or crash. It doesn't seem like we currently have a FLAKY result
>> expectation based on the bots, you are supposed to indicate it by listing
>> all possible kinds of failures, but that seems unhelpful. Also, a flaky test
>> that sometimes entirely passes on multiple runs in a row will turn the bots
>> red, which seems bad. Let's just have FLAKY state instead where we don't get
>> upset whether the test passes or fails.
>

Just to rephrase this to make sure we're on the same page ... there
are two kinds of flakiness. The first is intra-run: a test which fails
but when immediately retried by NRWT, passes. Such flakiness does not
cause the bot to turn red (but will be reported as "unexpected
flakiness"). Note that CRASHes are never retried. The second is
inter-run: A test may pass most runs, but sometimes fail. If you mark
the test as PASS IMAGE (or whatever), it will not turn the bot red. I
believe this is desirable to you, right?

You then say that "listing all possible kinds of failures ... seems
unhelpful". Why? Surely it is interesting if a test that previously
had intermittent pixel failures starts having text failures as well
(or worse, crashes or times out as rniwa points out)? Wouldn't you
want that to turn the bot red?

> Tests that could randomly crash or time out should probably not be run until fixed.

Are you saying that marking a test as "PASS CRASH" should
automatically skip the test (much as WONTFIX would be equivalent to
SKIP?) Or are you suggesting that people should mark the test as SKIP
instead?

On a related note, would you say that a test that deterministically
(i.e., reliably) crashes should also be skipped?

> Tests that randomly fail in one of several ways, I am not sure it is super useful to list what files might be affected but nothing else.

I'm not sure I'm parsing this properly. By "what files might be
affected", do you mean indicating TEXT/IMAGE/both ? As noted above, I
think some changes in behavior are interesting ...

> If a test gets one of N results, bugzilla is a fine way to document that in full detail.

We currently display the different expected outcomes in the flakiness
dashboard. If we were to move this information to a bug and not expose
some way of retrieving that information, that would be a step
backwards, I think. Perhaps we could use bug keywords for this?

-- Dirk