[webkit-dev] A proposal for handling "failing" layout tests and TestExpectations

Wed Aug 15 15:02:19 PDT 2012

On Aug 15, 2012, at 2:16 PM, Maciej Stachowiak <mjs at apple.com> wrote:

> 
> On Aug 15, 2012, at 12:27 PM, Filip Pizlo <fpizlo at apple.com> wrote:
> 
>> This sounds like it's adding even more complication to an already complicated system.
>> 
>> Given how many tests we currently have, I also don't buy that continuing to run a test that is already known to fail provides much benefit.  If the test covers two things and one fails while the other succeeds (example #1: it prints correct results but doesn't crash; example #2: it literally has tests for two distinct unrelated features, and one feature is broken) then the correct thing to do is to split up the test.
> 
> The case where this happens isn't so much testing unrelated features. It's more that a single test can face failures of multiple incidental aspects. Let me give an example. Let's say you have a test that's intended to verify JavaScript multiplication. Let's say it starts failing because of a regression in function call behavior. There may not be a clean way to split these two aspects. You may have to rewrite the multiplication test to work around this hypothetical function call bug. It may be nontrivial to figure out how to do that.
> 
> So I agree in principle that splitting orthogonal tests is good, but it's not always a good solution to avoiding chained regressions.
> 
> Note also that having more test files makes the test rights slower.
> 
>> 
>> So, I'd rather not continue to institutionalize this notion that we should have loads of incomprehensible machinery to reason about tests that have already given us all the information they were meant to give (i.e. they failed, end of story).
> 
> I think another aspect of your reasoning doesn't entirely fit with the history of WebKit regression testing. Our tests were not originally designed to test against some abstract notion of "correct" behavior. They were designed to detect behavior changes. It's definitely not the case that, once behavior has changed once, there's no value to detecting when it changes again. Some of these behavior changes might be good and should form a new baseline. Others might be bad but still should not inhibit detecting other testing while they are investigated.

Sorry, I was rather unclear.  I'm arguing against marking a test as expected-to-fail-but-not-crash.  I have no broad problem with rebasing the test, except if you're using -expected.* files to indicate that the test is in this expected-to-fail-but-not-crash state.  There is potentially a significant difference between tests that cover what is displayed and the loads of tests that cover discrete functionality (DOM APIs, JS behavior, etc).  Rebasing has very little benefit in the latter kinds of tests.

I'm also arguing against having yet another mechanism for indicating that a test is borked.

Finally, I think any strategy we use must take into account the fact that we currently have a *huge* number of tests.  While there may be examples of failing tests that still succeed in giving us meaningful coverage, I think there are far more examples of tests that provide no benefit, but require an evil troika of (1) CPU time to run, (2) network/disk time to check out and update, and (3) person time to rebase.  So, at this point, having sophisticated mechanisms for dealing with expected-to-fail-but-not-crash tests beyond filing a bug and skipping the test is likely to just be noise.

> 
> 
> (I should note that I don't yet have an opinion on the new proposal; I have not read it. But I am skeptical that splitting tests is a practical solution.)
> 
> 
> Regards,
> Maciej
> 
> 
>> 
>> -F
>> 
>> 
>> On Aug 15, 2012, at 12:22 PM, Dirk Pranke <dpranke at chromium.org> wrote:
>> 
>>> Hi all,
>>> 
>>> As many of you know, we normally treat the -expected files as
>>> "regression" tests rather than "correctness" tests; they are intended
>>> to capture the current behavior of the tree. As such, they
>>> historically have not distinguished between a "correct failure" and an
>>> "incorrect failure".
>>> 
>>> The chromium port, however, has historically often not checked in
>>> expectations for tests that are currently failing (or even have been
>>> failing for a long time), and instead listed them in the expectations
>>> file. This was primarily motivated by us wanting to know easily all of
>>> the tests that were "failing". However, this approach has its own
>>> downsides.
>>> 
>>> I would like to move the project to a point where all of the ports
>>> were basically using the same workflow/model, and combine the best
>>> features of each approach [1].
>>> 
>>> To that end, I propose that we allow tests to have expectations that
>>> end in '-passing' and '-failing' as well as '-expected'.
>>> 
>>> The meanings for '-passing' and '-failing' should be obvious, and
>>> '-expected' can continue the current meaning of either or both of
>>> "what we expect to happen" and "I don't know if this is correct or
>>> not" :).
>>> 
>>> A given test will be allowed to only have one of the three potential
>>> results at any one time/revision in a checkout. [2]
>>> 
>>> Because '-expected' will still be supported, this means that ports can
>>> continue to work as they do today and we can try -passing/-failing on
>>> a piecemeal basis to see if it's useful or not.
>>> 
>>> Ideally we will have some way (via a presubmit hook, or lint checks,
>>> or something) to be able to generate a (checked-in) list (or perhaps
>>> just a dashboard or web page) of all of the currently failing tests
>>> and corresponding bugs from the "-failing" expectations, so that we
>>> can keep one of the advantages that chromium has gotten out of their
>>> TestExpectations files [3].
>>> 
>>> I will update all of the tools (run-webkit-tests, garden-o-matic,
>>> flakiness dashboard, etc.) as needed to make managing these things as
>>> easy as possible. [4]
>>> 
>>> Thoughts? I'm definitely open to suggestions/variants/other ideas/etc.
>>> 
>>> -- Dirk
>>> 
>>> Notes:
>>> 
>>> [1] Both the "check in the failures" and the "suppress the failures"
>>> approaches have advantages and disadvantages:
>>> 
>>> Both approaches have their advantages and disadvantages:
>>> 
>>> Advantages for checking in failures:
>>> 
>>> * you can tell when a test starts failing differently
>>> * the need to distinguish between different types of failures (text
>>> vs. image vs. image+text) in the expectations file drops; the baseline
>>> tells you what to expect
>>> * the TestExpectations file can be much smaller and easier to manage -
>>> the current Chromium file is a massively unruly mess
>>> * the history of a particular test can be found by looking at the repo history
>>> 
>>> Disadvantages for checking in failures (advantages for just suppressing them):
>>> * given current practice (just using -expected) you can't tell if a
>>> particular -expected file is supposed to be be correct or not
>>> * it may create more churn in the checked-in baselines in the repo
>>> * you can't get a list of all of the failing tests associated with a
>>> particular bug as easily
>>> 
>>> There are probably lots of ways one could attempt to design a solution
>>> to these problems; I believe the approach outlined above is perhaps
>>> the simplest possible and allows for us to try it in small parts of
>>> the test suite (and only on some ports) to see if it's useful or not
>>> without forcing it on everyone.
>>> 
>>> [2] I considered allowing "-passing" and "-failing" to co-exist, but a
>>> risk is that the "passing" or correct result for a test will change,
>>> and if a test is currently expected to fail, we won't notice that that
>>> port's "passing" result becomes stale. In addition, managing the
>>> transitions across multiple files becomes a lot more
>>> complicated/tricky. There are tradeoffs here, and it's possible some
>>> other logic will work better in practice, but I thought it might be
>>> best to start simple.
>>> 
>>> [3] I haven't figured out the best way to do this yet, or whether it's
>>> better to keep this list inside the TestExpectations files or in
>>> separate files, or just have a dashboard separate from the repository
>>> completely ...
>>> 
>>> [4] rough sketch of the work needed to be done here:
>>> * update run-webkit-tests to look for all three suffixes, and to
>>> generate new -failing -failing and/or -passing results as well as new
>>> -expected results
>>> * update garden-o-matic so that when you want to rebaseline a file you
>>> can (optionally) indicate whether the new baseline is passing or
>>> failing
>>> * update webkit-patch rebaseline-expectations to (optionally) indicate
>>> if the new baselines are passing or failing
>>> * pass the information from run-webkit-tests to the flakiness
>>> dashboard (via results.json) to indicate whether we matched against
>>> -expected/-passing/-failing so that the dashboard can display the
>>> right baselines for comparison.
>>> * figure out the solution to [3] :).
>>> _______________________________________________
>>> webkit-dev mailing list
>>> webkit-dev at lists.webkit.org
>>> http://lists.webkit.org/mailman/listinfo/webkit-dev
>> 
>> _______________________________________________
>> webkit-dev mailing list
>> webkit-dev at lists.webkit.org
>> http://lists.webkit.org/mailman/listinfo/webkit-dev
>