[webkit-qt] Tests in Skipped List that Pass with Pixel Differences

Mon Apr 19 12:45:24 PDT 2010

Hi Andras!

[Just to clarify what I mean when I say 'pixel differences' - I'm referring 
to differences in non-plaintext dumprendertree results, usually due to 
different font width etc.]

On Sunday 18 April 2010 23:35:49 Andras Becsi wrote:
> Hi Robert,
> 
> this is a great plan.
> The only issue is that we are also facing this kind of pixel differences
>   - between different releases of Qt: for example a Qt 4.6.2 build has
> ~300 failing tests on the bot which pass on a 4.6.1 build,
>   - between different distributions: our buildbot runs Debian Lenny but
> developers frequently note that they have 300-500 different results on
> other distributions or other versions of the same distributions, and we
> also experience this problem on our ARMv5 tester, which might or might
> not be related to package version differences too.
> 

Like nearly everyone else (by the sounds of it) I'm a victim of both of 
these problems. In the short to medium term this situation is unlikely to 
go away - at least until reftests happen.

So until then we are left with choosing a baseline and making that work so 
that we can establish which failures in the skipped list are actual bugs. I 
agree there are definite downsides to doing this, as you have identified, 
but that provided a group of us (including a reviewer or two) can commit to 
working the plan then we will at least have one computer in the world 
passing all the layout tests it has a right to pass!

Am I right in thinking the linux release buildbot is 4.6.1? I thought I 
heard somewhere it was 4.5. Whichever it is I propose we treat the buildbot 
as a baseline and use the exercise I've described to indentify all the 
tests in the current Skipped list that can be unskipped because their pixel 
differences are not relevant to the outcome of the test. We will then be 
left with genuine bugs in trunk qtwebkit/qt 4.6.1.

Once that exercise is complete we move the bot to 4.7, run webkit-tests and 
find ourselves with a whole bunch of tests that now fail. These are either 
genuine regressions or unimportant pixel differences. We will then have to 
repeat the exercise above to get rid of the false positives.

That sounds like a horrible prospect but there are a couple of reasons it 
might be ok:

- I think there is a lower chance of tests that pass on the buildbot with 
4.6.1 failing on the buildbot with 4.7, unless they are genuine qt bugs. 
That is because I believe the tests failing in other build environments are 
due to setup other than qt/webkit. Is there any way of confiriming or 
disconfirming this?

- A lot of the failures are likely to be the same failures that we 
encountered when working through the pixel differences in 4.5

So in summary:

- There are an awful lot of Skipped tests in fast/ and editing/ - we 
confine this effort to those.
- We block upgrading the buildbot until they are all unskipped or proven to 
be actual bugs. We keep a file in the platform/qt folder documenting the 
decision to unskip each test and why.
- Once the effort is complete and we upgrade some subset of these will 
break again. We work through the failures against the file 
created/maintained in the step above and amend the expected results where 
we can see there has been no actual regression.
- Then repeat steps above for all new failures from the upgrade.

What do you think? There are 2100 tests skipped in fast and editing. I 
would say a first pass could result in 500 getting unskipped.

In the absence of a magic bullet that helps us eliminate the failures due 
to build environment differences this, or something like it, appears to be 
what we're left with going forward.

> I'm strongly in favor of unskipping these tests if they are stable
> enougth (IIRC Ossy has a list somewhere), and rebaslineing the results,
> but we also have to take in account that if we do not find the cause of
> the differences between environments or it is not a bug of Qt (which I
> suppose) we have to update a whole lot of test results each time we
> switch Qt versions, and the landed results would only be relevant on our
> buildbot which would be an awkward thing. These tests would also need
> comparing of pixel results to the Mac ones because the DRT dumps are not
> always reliable. Pixel testing is a time consuming activity.
> Another partly relevant thing is that we were thinking about an tool,
> which would automatically run all the tests in the skipped list say once
> a week, to check if there are possible changes, but there are a few
> blocker tests which result in run-webkit-tests infinite loop. I took a
> note somewhere, I'll check and file a bug as a starting point. Fixing
> these would make some automation possible and ease the further actions
> on the skipped list.
> 
> BR,
> Andras
> 
> 2010-04-18 23:30 keltezéssel, Robert Hogan írta:
> > There is a definite class of tests in the Qt skipped list that fail
> > only because of unimportant pixel differences.
> >
> > I propose to open a master bug for these the purpose of which will be
> > to act as a staging pad for removing them from the Skipped list.
> >
> > I would see it working like this:
> >
> > - Identify a group of tests that only fail due to pixel differences
> > and for which it is clear pixel differences are not material
> > - Add a patch against LayoutTests/platform/qt/Skipped with the tests
> > removed.
> > - Along with the patch add a comment to bugzilla which will act as a a
> > 'manifest' listing each test to be removed and why it is safe to use
> > platform specific results. It might even be best to keep this manifest
> > as a file in the platform/qt tree.
> > - Send a note to the buildbot team requesting them to try the patch.
> > - Wait for a member of the buildbot team to run the patch against the
> > buildbot. The buildbot team can then post an updated patch with the
> > changes to the Skipped list and the platform specific results. They
> > mark it for review.
> > - A reviewer comes along, satisifies themselves that platform specific
> > results are justified for each of the skipped tests and approves.
> >
> > Keeping all this under a single bugzilla entry might get cumbersome
> > after a while, when that happens a new one can be created to replace
> > it and linked from the old one. I think it's important to have one
> > go-to place for this effort though.
> >
> > Thoughts? Do the buildbot team this is something that could work?
> > _______________________________________________
> > webkit-qt mailing list
> > webkit-qt at lists.webkit.org
> > http://lists.webkit.org/mailman/listinfo.cgi/webkit-qt
> 
> _______________________________________________
> webkit-qt mailing list
> webkit-qt at lists.webkit.org
> http://lists.webkit.org/mailman/listinfo.cgi/webkit-qt
>