[webkit-dev] [PSA] WebKitGTK layout testers available on the Bugzilla EWS bubbles

Fri Dec 24 10:23:55 PST 2021

On 24/12/2021 15:00, Michael Catanzaro via webkit-dev wrote:
> On Fri, Dec 24 2021 at 12:44:49 AM +0000, Carlos Alberto Lopez Perez via
> webkit-dev <webkit-dev at lists.webkit.org> wrote:
>> So we ended deploying a different version of the EWS that has a much
>> higher tolerance to pre-existent failures (up to 500 before exiting
>> early) and also that tries hard to discard pre-existent failures and
>> flakies by repeating each failure 10 times with patch and 10 times
>> without it. [1]
> 
> Mixed thoughts on this:
> 
> (1) Good job. Having layout tests on EWS is a great improvement. We've
> been talking about this for a long time, and you finally made it happen!
> 
> (2) That you needed to use such a big hammer to make the EWS work
> reliably suggests either that either WebKitGTK quality or WebKit test
> quality is quite low. I'm sure it's a mix of both, but mostly the
> former, because test flakiness is not this severe for Apple ports. This
> is not encouraging.
> 

Sorry, but I don't agree with your conclusion about quality.

So, let me explain in more detail the factors that contribute to this
issue with the tests:

  1) Number of unexpected failures on the clean tree

    The higher number of unexpected failures on the clean tree is caused
    mainly by the following reasons:

      1.1) Until now we didn't have an EWS. So it was pretty hard (if not
      impossible) for any developer to notice that the patch was going to
      break GTK tests. This didn't helped to avoid breaking patches landing.

      1.2) We don't have a rule to roll-back patches breaking GTK tests.
      If a patch lands adding unexpected failures for GTK those usually 
      stay there until some of our gardeners have time to fix the issue
      or mark the new failure as expected. Also having such rule wouldn't
      have made sense before having an EWS that developers can use.

      1.3) We don't have anyone working full-time doing gardening. We try
      to share the effort between us on a best-effort basis. So unexpected
      failures once landed can remain there for days until those are gardened.

      1.4) Patches landing via commit-queue run layout test on Mac before
      landing. So a patch won't land if it breaks layout tests on Mac.
      But it will land anyway if it breaks tests on GTK.

  2) Number of unexpected flaky tests

      2.1) It is true that we do have a higher number of flaky tests compared
      to Apple ports. But the flakiness issue is also a problem there.
      It is not unusual to see the standard EWS giving false positives due
      to some test being flaky.

      2.2) I'm not sure if our higher number of flaky tests is caused by
      some issue on the code of the port or is just that we don't have 
      enough manpower to be on top of flaky tests on a daily basis and
      mark any detected flaky test as soon as it is detected.

And regarding quality or test quality:

  3) Having the results of the layout tests "green" is not synonym of quality.
  Layout tests giving a "green" or "red" result is not about passing or failing
  the tests, is just about giving the "expected" result (which can be a failure).
  A port can have lot of failures marked as "expected failure" or lot
  of flaky tests marked as "expected flaky" and be more green than
  other port that has less failures or less flaky tests but not marked.

  If you want to compare the quality of the ports, then maybe something like
  wpt.fyi [1] can be more useful than WebKit layout tests, because tests there
   can't be "expected failures". So it will be only green if it passes the test.

And looking ahead to improve things:

  4) I expect the number of unexpected failures in the clean tree to start
  to be more controllable now that we have this EWS working an developers
  can be notified in advance of a breaking change before landing.

  5) The EWS also has now code to detect flaky tests when it does all those
  runs and repeats, and is sending mails to the bot watchers with the names
  of all the flaky tests that it detects. We will be gardening those with
  the idea of reducing the number of unexpected flakies.

[1] https://wpt.fyi/results/?label=master&label=experimental&product=safari&product=webkitgtk&aligned

> (3) Any plans for WPE?
> 

Yes. We look forward to add WPE testers as soon as possible. Hopefully it will happen in 2022-Q1.

Best regards and happy holidays!
--------------------------------

[1] https://wpt.fyi/results/?label=master&label=experimental&product=safari&product=webkitgtk&aligned