[webkit-dev] A proposal for handling "failing" layout tests and TestExpectations

Wed Aug 15 12:48:58 PDT 2012

I've received at least one bit of feedback off-list that it might not
have been clear what problems I was trying to solve and whether the
solution would add enough benefit to be worth it. Let me try to
restate things, in case this helps ...

The problems I'm attempting to solve are:

1) Chromium does things differently from the other WebKit ports, and
this seems bad :).

2) the TestExpectations syntax is too complicated

3) When you suppress or skip failures (as chromium does), you can't
easily tell when test changes behavior (starts failing differently).

4) We can't easily tell if a test's -expected output is actually
believed to be correct or not, which makes figuring out whether to
rebaseline or not difficult, and makes figuring out if your change
that is causing a test to "fail" is actually introducing a new
problem, just failing differently, or even potentially fixing
something.

I agree that it's unclear how much churn there will be and how much
benefit there will be, but we've been talking about these problems for
years and I think without actually trying *something* we won't know if
there's a better way to solve this or not, and I personally think it's
worth trying. The solution I've described is the least intrusive
mechanism we can try that I've yet come up with.

-- Dirk

On Wed, Aug 15, 2012 at 12:22 PM, Dirk Pranke <dpranke at chromium.org> wrote:
> Hi all,
>
> As many of you know, we normally treat the -expected files as
> "regression" tests rather than "correctness" tests; they are intended
> to capture the current behavior of the tree. As such, they
> historically have not distinguished between a "correct failure" and an
> "incorrect failure".
>
> The chromium port, however, has historically often not checked in
> expectations for tests that are currently failing (or even have been
> failing for a long time), and instead listed them in the expectations
> file. This was primarily motivated by us wanting to know easily all of
> the tests that were "failing". However, this approach has its own
> downsides.
>
> I would like to move the project to a point where all of the ports
> were basically using the same workflow/model, and combine the best
> features of each approach [1].
>
> To that end, I propose that we allow tests to have expectations that
> end in '-passing' and '-failing' as well as '-expected'.
>
> The meanings for '-passing' and '-failing' should be obvious, and
> '-expected' can continue the current meaning of either or both of
> "what we expect to happen" and "I don't know if this is correct or
> not" :).
>
> A given test will be allowed to only have one of the three potential
> results at any one time/revision in a checkout. [2]
>
> Because '-expected' will still be supported, this means that ports can
> continue to work as they do today and we can try -passing/-failing on
> a piecemeal basis to see if it's useful or not.
>
> Ideally we will have some way (via a presubmit hook, or lint checks,
> or something) to be able to generate a (checked-in) list (or perhaps
> just a dashboard or web page) of all of the currently failing tests
> and corresponding bugs from the "-failing" expectations, so that we
> can keep one of the advantages that chromium has gotten out of their
> TestExpectations files [3].
>
> I will update all of the tools (run-webkit-tests, garden-o-matic,
> flakiness dashboard, etc.) as needed to make managing these things as
> easy as possible. [4]
>
> Thoughts? I'm definitely open to suggestions/variants/other ideas/etc.
>
> -- Dirk
>
> Notes:
>
> [1] Both the "check in the failures" and the "suppress the failures"
> approaches have advantages and disadvantages:
>
> Both approaches have their advantages and disadvantages:
>
> Advantages for checking in failures:
>
> * you can tell when a test starts failing differently
> * the need to distinguish between different types of failures (text
> vs. image vs. image+text) in the expectations file drops; the baseline
> tells you what to expect
> * the TestExpectations file can be much smaller and easier to manage -
> the current Chromium file is a massively unruly mess
> * the history of a particular test can be found by looking at the repo history
>
> Disadvantages for checking in failures (advantages for just suppressing them):
> * given current practice (just using -expected) you can't tell if a
> particular -expected file is supposed to be be correct or not
> * it may create more churn in the checked-in baselines in the repo
> * you can't get a list of all of the failing tests associated with a
> particular bug as easily
>
> There are probably lots of ways one could attempt to design a solution
> to these problems; I believe the approach outlined above is perhaps
> the simplest possible and allows for us to try it in small parts of
> the test suite (and only on some ports) to see if it's useful or not
> without forcing it on everyone.
>
> [2] I considered allowing "-passing" and "-failing" to co-exist, but a
> risk is that the "passing" or correct result for a test will change,
> and if a test is currently expected to fail, we won't notice that that
> port's "passing" result becomes stale. In addition, managing the
> transitions across multiple files becomes a lot more
> complicated/tricky. There are tradeoffs here, and it's possible some
> other logic will work better in practice, but I thought it might be
> best to start simple.
>
> [3] I haven't figured out the best way to do this yet, or whether it's
> better to keep this list inside the TestExpectations files or in
> separate files, or just have a dashboard separate from the repository
> completely ...
>
> [4] rough sketch of the work needed to be done here:
> * update run-webkit-tests to look for all three suffixes, and to
> generate new -failing -failing and/or -passing results as well as new
> -expected results
> * update garden-o-matic so that when you want to rebaseline a file you
> can (optionally) indicate whether the new baseline is passing or
> failing
> * update webkit-patch rebaseline-expectations to (optionally) indicate
> if the new baselines are passing or failing
> * pass the information from run-webkit-tests to the flakiness
> dashboard (via results.json) to indicate whether we matched against
> -expected/-passing/-failing so that the dashboard can display the
> right baselines for comparison.
> * figure out the solution to [3] :).