[webkit-dev] A proposal for handling "failing" layout tests and TestExpectations

Wed Aug 15 12:22:02 PDT 2012

Hi all,

As many of you know, we normally treat the -expected files as
"regression" tests rather than "correctness" tests; they are intended
to capture the current behavior of the tree. As such, they
historically have not distinguished between a "correct failure" and an
"incorrect failure".

The chromium port, however, has historically often not checked in
expectations for tests that are currently failing (or even have been
failing for a long time), and instead listed them in the expectations
file. This was primarily motivated by us wanting to know easily all of
the tests that were "failing". However, this approach has its own
downsides.

I would like to move the project to a point where all of the ports
were basically using the same workflow/model, and combine the best
features of each approach [1].

To that end, I propose that we allow tests to have expectations that
end in '-passing' and '-failing' as well as '-expected'.

The meanings for '-passing' and '-failing' should be obvious, and
'-expected' can continue the current meaning of either or both of
"what we expect to happen" and "I don't know if this is correct or
not" :).

A given test will be allowed to only have one of the three potential
results at any one time/revision in a checkout. [2]

Because '-expected' will still be supported, this means that ports can
continue to work as they do today and we can try -passing/-failing on
a piecemeal basis to see if it's useful or not.

Ideally we will have some way (via a presubmit hook, or lint checks,
or something) to be able to generate a (checked-in) list (or perhaps
just a dashboard or web page) of all of the currently failing tests
and corresponding bugs from the "-failing" expectations, so that we
can keep one of the advantages that chromium has gotten out of their
TestExpectations files [3].

I will update all of the tools (run-webkit-tests, garden-o-matic,
flakiness dashboard, etc.) as needed to make managing these things as
easy as possible. [4]

Thoughts? I'm definitely open to suggestions/variants/other ideas/etc.

-- Dirk

Notes:

[1] Both the "check in the failures" and the "suppress the failures"
approaches have advantages and disadvantages:

Both approaches have their advantages and disadvantages:

Advantages for checking in failures:

* you can tell when a test starts failing differently
* the need to distinguish between different types of failures (text
vs. image vs. image+text) in the expectations file drops; the baseline
tells you what to expect
* the TestExpectations file can be much smaller and easier to manage -
the current Chromium file is a massively unruly mess
* the history of a particular test can be found by looking at the repo history

Disadvantages for checking in failures (advantages for just suppressing them):
* given current practice (just using -expected) you can't tell if a
particular -expected file is supposed to be be correct or not
* it may create more churn in the checked-in baselines in the repo
* you can't get a list of all of the failing tests associated with a
particular bug as easily

There are probably lots of ways one could attempt to design a solution
to these problems; I believe the approach outlined above is perhaps
the simplest possible and allows for us to try it in small parts of
the test suite (and only on some ports) to see if it's useful or not
without forcing it on everyone.

[2] I considered allowing "-passing" and "-failing" to co-exist, but a
risk is that the "passing" or correct result for a test will change,
and if a test is currently expected to fail, we won't notice that that
port's "passing" result becomes stale. In addition, managing the
transitions across multiple files becomes a lot more
complicated/tricky. There are tradeoffs here, and it's possible some
other logic will work better in practice, but I thought it might be
best to start simple.

[3] I haven't figured out the best way to do this yet, or whether it's
better to keep this list inside the TestExpectations files or in
separate files, or just have a dashboard separate from the repository
completely ...

[4] rough sketch of the work needed to be done here:
* update run-webkit-tests to look for all three suffixes, and to
generate new -failing -failing and/or -passing results as well as new
-expected results
* update garden-o-matic so that when you want to rebaseline a file you
can (optionally) indicate whether the new baseline is passing or
failing
* update webkit-patch rebaseline-expectations to (optionally) indicate
if the new baselines are passing or failing
* pass the information from run-webkit-tests to the flakiness
dashboard (via results.json) to indicate whether we matched against
-expected/-passing/-failing so that the dashboard can display the
right baselines for comparison.
* figure out the solution to [3] :).