[webkit-dev] A proposal for handling "failing" layout tests and TestExpectations

Wed Aug 15 13:36:05 PDT 2012

It seems to me that there are two issues here.  One is Chromium specific about process conformity.  It seems to me that should stay a Chromium issue without making the mechanism more complex for all ports.  The other ports seem to make things work using the existing framework.

The other broader issue is failing tests.  If I understand part of Filip's concern it is a signal to noise issue.  We do not want the pass / fail signal to be lost in the noise of expected failures.  Failing tests should be fixed as appropriate for failing platform(s).  That fixing might involve splitting off or removing a failing sub-test so that the remaining test adds value once again.  Especially "a pass becoming a fail" edge.  For me, a test failing differently provides unknown value as the noise of it being a failing test likely exceeds the benefit of the different failure mode signal.  It takes a non-zero effort to filter that noise and that effort is likely better spent fixing the test.

- Michael

On Aug 15, 2012, at 12:48 PM, Dirk Pranke <dpranke at chromium.org> wrote:

> I've received at least one bit of feedback off-list that it might not
> have been clear what problems I was trying to solve and whether the
> solution would add enough benefit to be worth it. Let me try to
> restate things, in case this helps ...
> 
> The problems I'm attempting to solve are:
> 
> 1) Chromium does things differently from the other WebKit ports, and
> this seems bad :).
> 
> 2) the TestExpectations syntax is too complicated
> 
> 3) When you suppress or skip failures (as chromium does), you can't
> easily tell when test changes behavior (starts failing differently).
> 
> 4) We can't easily tell if a test's -expected output is actually
> believed to be correct or not, which makes figuring out whether to
> rebaseline or not difficult, and makes figuring out if your change
> that is causing a test to "fail" is actually introducing a new
> problem, just failing differently, or even potentially fixing
> something.
> 
> I agree that it's unclear how much churn there will be and how much
> benefit there will be, but we've been talking about these problems for
> years and I think without actually trying *something* we won't know if
> there's a better way to solve this or not, and I personally think it's
> worth trying. The solution I've described is the least intrusive
> mechanism we can try that I've yet come up with.
> 
> -- Dirk
> 
> On Wed, Aug 15, 2012 at 12:22 PM, Dirk Pranke <dpranke at chromium.org> wrote:
>> Hi all,
>> 
>> As many of you know, we normally treat the -expected files as
>> "regression" tests rather than "correctness" tests; they are intended
>> to capture the current behavior of the tree. As such, they
>> historically have not distinguished between a "correct failure" and an
>> "incorrect failure".
>> 
>> The chromium port, however, has historically often not checked in
>> expectations for tests that are currently failing (or even have been
>> failing for a long time), and instead listed them in the expectations
>> file. This was primarily motivated by us wanting to know easily all of
>> the tests that were "failing". However, this approach has its own
>> downsides.
>> 
>> I would like to move the project to a point where all of the ports
>> were basically using the same workflow/model, and combine the best
>> features of each approach [1].
>> 
>> To that end, I propose that we allow tests to have expectations that
>> end in '-passing' and '-failing' as well as '-expected'.
>> 
>> The meanings for '-passing' and '-failing' should be obvious, and
>> '-expected' can continue the current meaning of either or both of
>> "what we expect to happen" and "I don't know if this is correct or
>> not" :).
>> 
>> A given test will be allowed to only have one of the three potential
>> results at any one time/revision in a checkout. [2]
>> 
>> Because '-expected' will still be supported, this means that ports can
>> continue to work as they do today and we can try -passing/-failing on
>> a piecemeal basis to see if it's useful or not.
>> 
>> Ideally we will have some way (via a presubmit hook, or lint checks,
>> or something) to be able to generate a (checked-in) list (or perhaps
>> just a dashboard or web page) of all of the currently failing tests
>> and corresponding bugs from the "-failing" expectations, so that we
>> can keep one of the advantages that chromium has gotten out of their
>> TestExpectations files [3].
>> 
>> I will update all of the tools (run-webkit-tests, garden-o-matic,
>> flakiness dashboard, etc.) as needed to make managing these things as
>> easy as possible. [4]
>> 
>> Thoughts? I'm definitely open to suggestions/variants/other ideas/etc.
>> 
>> -- Dirk
>> 
>> Notes:
>> 
>> [1] Both the "check in the failures" and the "suppress the failures"
>> approaches have advantages and disadvantages:
>> 
>> Both approaches have their advantages and disadvantages:
>> 
>> Advantages for checking in failures:
>> 
>> * you can tell when a test starts failing differently
>> * the need to distinguish between different types of failures (text
>> vs. image vs. image+text) in the expectations file drops; the baseline
>> tells you what to expect
>> * the TestExpectations file can be much smaller and easier to manage -
>> the current Chromium file is a massively unruly mess
>> * the history of a particular test can be found by looking at the repo history
>> 
>> Disadvantages for checking in failures (advantages for just suppressing them):
>> * given current practice (just using -expected) you can't tell if a
>> particular -expected file is supposed to be be correct or not
>> * it may create more churn in the checked-in baselines in the repo
>> * you can't get a list of all of the failing tests associated with a
>> particular bug as easily
>> 
>> There are probably lots of ways one could attempt to design a solution
>> to these problems; I believe the approach outlined above is perhaps
>> the simplest possible and allows for us to try it in small parts of
>> the test suite (and only on some ports) to see if it's useful or not
>> without forcing it on everyone.
>> 
>> [2] I considered allowing "-passing" and "-failing" to co-exist, but a
>> risk is that the "passing" or correct result for a test will change,
>> and if a test is currently expected to fail, we won't notice that that
>> port's "passing" result becomes stale. In addition, managing the
>> transitions across multiple files becomes a lot more
>> complicated/tricky. There are tradeoffs here, and it's possible some
>> other logic will work better in practice, but I thought it might be
>> best to start simple.
>> 
>> [3] I haven't figured out the best way to do this yet, or whether it's
>> better to keep this list inside the TestExpectations files or in
>> separate files, or just have a dashboard separate from the repository
>> completely ...
>> 
>> [4] rough sketch of the work needed to be done here:
>> * update run-webkit-tests to look for all three suffixes, and to
>> generate new -failing -failing and/or -passing results as well as new
>> -expected results
>> * update garden-o-matic so that when you want to rebaseline a file you
>> can (optionally) indicate whether the new baseline is passing or
>> failing
>> * update webkit-patch rebaseline-expectations to (optionally) indicate
>> if the new baselines are passing or failing
>> * pass the information from run-webkit-tests to the flakiness
>> dashboard (via results.json) to indicate whether we matched against
>> -expected/-passing/-failing so that the dashboard can display the
>> right baselines for comparison.
>> * figure out the solution to [3] :).
> _______________________________________________
> webkit-dev mailing list
> webkit-dev at lists.webkit.org
> http://lists.webkit.org/mailman/listinfo/webkit-dev