[webkit-dev] Closing the loop on flaky tests (was Re: Flaky test hit list)

Tue Oct 19 11:16:03 PDT 2010

On Tue, Oct 19, 2010 at 8:42 AM, Alexey Proskuryakov <ap at webkit.org> wrote:
> 15.10.2010, в 07:39, Eric Seidel написал(а):
>
>> BTW, the commit-queue has started complaining publicly about flaky tests:
>>
>> https://bugs.webkit.org/show_bug.cgi?id=47698#c5
>>
>> Hopefully this will bring further awareness to the issue.
>
> I find this extremely annoying and offensive. Half of my bugmail is already about bugs that I'm not interested in.

Sorry Alexey, I certainly didn't intend to offend you.

The problem we're trying to solve is currently there is no feedback
loop for authors of flaky tests.  If someone writes a flaky test,
there's no mechanism for them to find out about it.  It just sticks
around and causes pain for everyone else.  The idea behind this change
is to create a feedback loop whereby authors of flaky tests can
discover that their tests are flaky.

Looking back at the history since this feature was enabled, it looks
like you were CCed on 3 of the 4 bugs that encountered flaky tests.
Here are the tests that flaked out:

1x http://trac.webkit.org/browser/trunk/LayoutTests/http/tests/appcache/404-manifest.html
2x http://trac.webkit.org/browser/trunk/LayoutTests/http/tests/appcache/insert-html-element-with-manifest-2.html

According to SVN, you did write both of these tests, so the tool is
accurately computing the author.  This triggering more often than we
expected.  I'm not sure whether that's a statistical aberration.
Here's how we calculated how much traffic this tool would generate:

According to webkit-patch find-flaky-tests, the flakiest test fails
about 7 times per 2000 revisions, which means it fails for 0.3% of
test runs.  The commit-queue lands about 30 patches per day, so that
means the author of the flakiest test should get CCed on about one bug
every ten days.  Also, these bugs are close to the end of their
lifecycle (because their patch is about to land), so they shouldn't
generate more than 3 or 4 emails each.  That boils down to about one
or two emails per week for the flakiest test.

Now, that calculation is a very rough approximation, and we might have
missed some important factors.  We're certainly open to other
suggestions for how to close the loop on flaky tests if this approach
generates too much email.

Adam