[webkit-dev] Does NRWT let you indicate that a test should fail with a particular failure diff?

Tue Jul 5 12:29:09 PDT 2011

On Sun, Jul 3, 2011 at 10:07 PM, Hao Zheng <zhenghao at chromium.org> wrote:
>> There's at least two reasons for divergence .. one is that the port is
>> actually doing the wrong thing, and the other is that the port is
>> doing the "right" thing but the output is different anyway (e.g., a
>> control is rendered differently). We cannot easily separate the two if
>> we have only a single convention (platform-specific -expected files),
>> but SKIPPING tests seems wrong for either category.
>
> Yes. I think separating the two categories is important. But we can do
> it without -failing files.
> 1. the port is doing the "right" thing but the output is different anyway.
>    We can 'rebaseline' these tests. ('rebaseline' means check in the
> -expected files)
> 2. the port is actually doing the wrong thing
>    We should NOT 'rebaeline' them. Instead, we should add them into
> test_expectations.txt with a bug number. We can easily track all
> failures we have at the specific time by just seeing
> test_expectations.txt, and opening the related bug if we want to see
> more detailed description.
>
> Both things can be done under current test framework. Adding -failing
> files will make the huge layout tests effort even more complicated.
> Anyway, we only want to know which tests are failing, but not to what
> extent do they fail. If we want to know that, it means our tests are
> not reduced to the proper scale. Of course there are many 'big' tests,
> like acid tests, but I think the potential problems covered by these
> tests can also be covered by other small tests; if that's not the
> case, we just need to add some more small tests. So IMO -failing files
> are not necessary.
>

The problem with your idea is I think what brought this idea up in the
first place: if you just track that the test is failing using the
test_expectations.txt file, but don't track *how* it is failing (by
using something like the -failing.txt idea, or a new -expected.txt
file), then you cannot tell when the failing output changes, and you
may miss significant regressions.

-- Dirk

>> It seems like -failing gives you the control you would want, no?
>> Obviously, it wouldn't help the thousands of -expected files that are
>> "wrong" but at least it could keep things from getting worse.
>>
>
> How to correct thousands of the wrong files is really a big problem...
>
> On Sat, Jul 2, 2011 at 6:37 AM, Dirk Pranke <dpranke at chromium.org> wrote:
>> On Fri, Jul 1, 2011 at 3:24 PM, Darin Fisher <darin at chromium.org> wrote:
>>> On Fri, Jul 1, 2011 at 3:04 PM, Darin Adler <darin at apple.com> wrote:
>>>>
>>>> On Jul 1, 2011, at 2:54 PM, Dirk Pranke wrote:
>>>>
>>>> > Does that apply to -expected.txt files in the base directories, or just
>>>> > platform-specific exceptions?
>>>>
>>>> Base directories.
>>>>
>>>> Expected files contain output reflecting the behavior of WebKit at the
>>>> time the test was checked in. The expected result when we re-run a test.
>>>> Many expected files contain text that says “FAIL” in them. The fact that
>>>> these expected results are not successes, but rather expected failures does
>>>> not seem to me to be a subtle point, but one of the basic things about how
>>>> these tests are set up.
>>>
>>> Right, it helps us keep track of where we are, so that we don't regress, and
>>> only make forward progress.
>>>
>>>>
>>>> > I wonder how it is that I've been working (admittedly, mostly on
>>>> > tooling) in WebKit for more that two years and this is the first I'm hearing
>>>> > about this.
>>>>
>>>> I’m guessing it’s because you have been working on Chrome.
>>>>
>>>> The Chrome project came up with a different system for testing layered on
>>>> top of the original layout test machinery based on different concepts. I
>>>> don’t think anyone ever discussed that system with me; I was the one who
>>>> created the original layout test system, to help Dave Hyatt originally, and
>>>> then later the rest of the team started using it.
>>>
>>> The granular annotations (more than just SKIP) in test_expectations.txt was
>>> something we introduced back when Chrome was failing a large percentage of
>>> layout tests, and we needed a system to help us triage the failures.  It was
>>> useful to distinguish tests that crash from tests that generate bad results,
>>> for example.  We then focused on the crashing tests first.
>>> In addition, we wanted to understand how divergent we were from the standard
>>> WebKit port, and we wanted to know if we were failing to match text results
>>> or just image results.  This allowed us to measure our degree of
>>> incompatibility with standard WebKit.  We basically used this mechanism to
>>> classify differences that mattered and differences that didn't matter.
>>> I think that if we had just checked in a bunch of port-specific "failure"
>>> expectations as -expected files, then we would have had a hard time
>>> distinguishing failures we needed to fix for compat reasons from failures
>>> that were expected (e.g., because we have different looking form controls).
>>> I'm not sure if we are at a point now where this mechanism isn't useful, but
>>> I kind of suspect that it will always be useful.  Afterall, it is not
>>> uncommon for a code change to result in different rendering behavior between
>>> the ports.  I think it is valuable to have a measure of divergence between
>>> the various WebKit ports.  We want to minimize such divergence from a web
>>> compat point of view, of course.  Maybe the count of SKIPPED tests is
>>> enough?  But, then we suffer from not running the tests at all.  At least by
>>> annotating expected IMAGE failures, we get to know that the TEXT output is
>>> the same and that we don't expect a CRASH.
>>
>> There's at least two reasons for divergence .. one is that the port is
>> actually doing the wrong thing, and the other is that the port is
>> doing the "right" thing but the output is different anyway (e.g., a
>> control is rendered differently). We cannot easily separate the two if
>> we have only a single convention (platform-specific -expected files),
>> but SKIPPING tests seems wrong for either category.
>>
>> It seems like -failing gives you the control you would want, no?
>> Obviously, it wouldn't help the thousands of -expected files that are
>> "wrong" but at least it could keep things from getting worse.
>>
>> I will note that reftests might solve some issues but not all of them
>> (since obviously code could render both pages "wrong").
>>
>> -- Dirk
>>
>>> I suspect this isn't the best solution to the problem though.
>>> -Darin
>>>
>>>
>>>>
>>>> > Are there reasons we [are] doing things this way[?]
>>>>
>>>> Sure. The idea of the layout test framework is to check if the code is
>>>> still behaving as it did when the test was created and last run; we want to
>>>> detect any changes in behavior that are not expected. When there are
>>>> expected changes in behavior, we change the contents of the expected results
>>>> files.
>>>>
>>>> It seems possibly helpful to augment the test system with editorial
>>>> comments about which tests show bugs that we’d want to fix. But I wouldn’t
>>>> want to stop running all regression tests where the output reflects the
>>>> effects of a bug or missing feature.
>>>>
>>>>    -- Darin
>>>>
>>>
>>>
>>>
>> _______________________________________________
>> webkit-dev mailing list
>> webkit-dev at lists.webkit.org
>> http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
>>
>