[webkit-dev] handling failing tests (test_expectations, Skipped files, etc.)

Mon Apr 9 15:21:04 PDT 2012

On Mon, Apr 9, 2012 at 3:02 PM, James Robinson <jamesr at google.com> wrote:
>> 3) Don't use test_expectations.txt to suppress failures across a
>> single cycle of the bot, just so you can gather updated baselines
>> without the tree going red. While it might seem that you're doing tree
>> maintainers a favor, in my experience this just makes things confusing
>> and it's better to let the tree go red until you can actually
>> rebaseline things. It's too easy to add a suppression "for now" and
>> then forget about it.
>
>
> How would you suggest someone contribute a patch that changes layout test
> results, if not by marking the test as failing?
>

Both this and Dana's response are good questions which I don't have
great answers for. Let me explain my thinking in a little more detail.

In my ideal world, you would be able to get updated baselines *prior*
to trying to land a patch. This is of course not really possible today
for any test that fails on multiple ports with different results, as
it's practically impossible to run more than a couple of ports by
hand, and we don't have a bot infrastructure to help you here.

In my closer-to-the-real-world ideal, the test_expectations.txt file
would be *empty*, all of the ports would be using them, and then
adding in suppressions for files you expect to fail in your patch
would be a good signal.

In practice, however, marking a test as failing has a few problems:

1) if you don't get the list of suppressions right, than what shows up
at the bots can be confusing - you don't realize that more (or
different) tests are failing than the ones that showed up. Also, you
can see tests fail on some bots that you didn't account for that don't
use the expectations at all (e.g., you suppressed Chromium correctly,
but the Apple bots go red).

2) adding suppressions to test_expectations.txt increases contention
on the test_expectations.txt file, confusing the gardeners /
bot-watchers and making merges harder. It is practically difficult to
tell the difference between "I just added these lines as a temporary
suppression" and "I added these lines as a suppression two weeks ago
but then forgot about them", and as Ojan will tell you, you end up
with a bunch of lines in the file that just have a comment saying
"just needs rebaselining", but you have no idea if those statements
are still true or not.

3) the current tool-of-the-day for managing rebaselining,
garden-o-matic, is much better suited to handling the "unexpected"
failures on the bots rather than the "expected" failures you've
introduced.

That said, as Dana points out, my proposal makes it hard if not
impossible for the EWS to work at all; I'm biased because I hardly
ever post changes that are expected to introduce failures on the EWS
bots that should still land. I don't know what to do about this,
although perhaps moving to my "near-ideal" model might work.

If there's consensus in the mean time that it is better on balance to
check in suppressions, perhaps we can figure out a better way to do
that. Maybe (shudder) a second test_expectations file? Or maybe it
would be better to actually check in suppressions marked as REBASELINE
(or something like that)?

-- Dirk