[PSA] WebKitGTK layout testers available on the Bugzilla EWS bubbles
Hi, Some of you might have noticed that since a few days ago on the Bugzilla there is a new EWS bubble for GTK WK2 layout tests. Until now the GTK port was without layout test coverage on the EWS. It has been a challenge to add this testers on the EWS because the GTK port layout tests are usually not 100% green and there are usually unexpected failures and also flakies. This means that the EWS was usually either reporting false positives (due to flakies) or exiting early (because of more than 30 unexpected failures on the clean tree or more than 30 accumulated between the ones caused by patch and the ones in the clean tree) We tried to do extra gardening efforts to try to mark all the flakies and also to try to keep it always green, but for several reasons we couldn't keep it on this always-green status for more than a few consecutive days. Also new flakies would keep appearing. So we ended deploying a different version of the EWS that has a much higher tolerance to pre-existent failures (up to 500 before exiting early) and also that tries hard to discard pre-existent failures and flakies by repeating each failure 10 times with patch and 10 times without it. [1] This new version of the EWS WK2 layout tester is only used on the GTK port for now, and we plan to use it also for the WPE port when we deploy layout tests on the EWS for it. So this has the advantage that this EWS should not report false positives or flakies. Any failure reported by it should be a consistent new failure (fails always with the patch but never without it after 10 runs with patch and 10 without it). If you see it reporting any false positive please let me know it. However, it has also the disadvantage that the view to see the layout test results is a bit confusing because you see there also failures unrelated with your patch (those are either pre-existent failures or flakies) My tip here is to only pay attention to the test failures reported by the EWS as new, which are those that are shown on the Bugzilla bubble when you hover the mouse over it (or those that appear on the e-mail that you receive from the EWS) Perhaps I'm thinking it can be a further improvement to add an extra step to do a final run with the patch just for the list of new failures, in order to present a clean view of the results with only the new failures. This can be also very useful to be able to properly use the script update-test-expectations-from-bugzilla to automatically apply the new test expectations from the EWS locally. Any comments welcome :) Best regards! [1] https://bugs.webkit.org/show_bug.cgi?id=231999
On Fri, Dec 24 2021 at 12:44:49 AM +0000, Carlos Alberto Lopez Perez via webkit-dev <webkit-dev@lists.webkit.org> wrote:
So we ended deploying a different version of the EWS that has a much higher tolerance to pre-existent failures (up to 500 before exiting early) and also that tries hard to discard pre-existent failures and flakies by repeating each failure 10 times with patch and 10 times without it. [1]
Mixed thoughts on this: (1) Good job. Having layout tests on EWS is a great improvement. We've been talking about this for a long time, and you finally made it happen! (2) That you needed to use such a big hammer to make the EWS work reliably suggests either that either WebKitGTK quality or WebKit test quality is quite low. I'm sure it's a mix of both, but mostly the former, because test flakiness is not this severe for Apple ports. This is not encouraging. (3) Any plans for WPE? Anyway, I agree this was the best approach given the current situation. Happy holidays, Michael
On 24/12/2021 15:00, Michael Catanzaro via webkit-dev wrote:
On Fri, Dec 24 2021 at 12:44:49 AM +0000, Carlos Alberto Lopez Perez via webkit-dev <webkit-dev@lists.webkit.org> wrote:
So we ended deploying a different version of the EWS that has a much higher tolerance to pre-existent failures (up to 500 before exiting early) and also that tries hard to discard pre-existent failures and flakies by repeating each failure 10 times with patch and 10 times without it. [1]
Mixed thoughts on this:
(1) Good job. Having layout tests on EWS is a great improvement. We've been talking about this for a long time, and you finally made it happen!
(2) That you needed to use such a big hammer to make the EWS work reliably suggests either that either WebKitGTK quality or WebKit test quality is quite low. I'm sure it's a mix of both, but mostly the former, because test flakiness is not this severe for Apple ports. This is not encouraging.
Sorry, but I don't agree with your conclusion about quality. So, let me explain in more detail the factors that contribute to this issue with the tests: 1) Number of unexpected failures on the clean tree The higher number of unexpected failures on the clean tree is caused mainly by the following reasons: 1.1) Until now we didn't have an EWS. So it was pretty hard (if not impossible) for any developer to notice that the patch was going to break GTK tests. This didn't helped to avoid breaking patches landing. 1.2) We don't have a rule to roll-back patches breaking GTK tests. If a patch lands adding unexpected failures for GTK those usually stay there until some of our gardeners have time to fix the issue or mark the new failure as expected. Also having such rule wouldn't have made sense before having an EWS that developers can use. 1.3) We don't have anyone working full-time doing gardening. We try to share the effort between us on a best-effort basis. So unexpected failures once landed can remain there for days until those are gardened. 1.4) Patches landing via commit-queue run layout test on Mac before landing. So a patch won't land if it breaks layout tests on Mac. But it will land anyway if it breaks tests on GTK. 2) Number of unexpected flaky tests 2.1) It is true that we do have a higher number of flaky tests compared to Apple ports. But the flakiness issue is also a problem there. It is not unusual to see the standard EWS giving false positives due to some test being flaky. 2.2) I'm not sure if our higher number of flaky tests is caused by some issue on the code of the port or is just that we don't have enough manpower to be on top of flaky tests on a daily basis and mark any detected flaky test as soon as it is detected. And regarding quality or test quality: 3) Having the results of the layout tests "green" is not synonym of quality. Layout tests giving a "green" or "red" result is not about passing or failing the tests, is just about giving the "expected" result (which can be a failure). A port can have lot of failures marked as "expected failure" or lot of flaky tests marked as "expected flaky" and be more green than other port that has less failures or less flaky tests but not marked. If you want to compare the quality of the ports, then maybe something like wpt.fyi [1] can be more useful than WebKit layout tests, because tests there can't be "expected failures". So it will be only green if it passes the test. And looking ahead to improve things: 4) I expect the number of unexpected failures in the clean tree to start to be more controllable now that we have this EWS working an developers can be notified in advance of a breaking change before landing. 5) The EWS also has now code to detect flaky tests when it does all those runs and repeats, and is sending mails to the bot watchers with the names of all the flaky tests that it detects. We will be gardening those with the idea of reducing the number of unexpected flakies. [1] https://wpt.fyi/results/?label=master&label=experimental&product=safari&prod...
(3) Any plans for WPE?
Yes. We look forward to add WPE testers as soon as possible. Hopefully it will happen in 2022-Q1. Best regards and happy holidays! -------------------------------- [1] https://wpt.fyi/results/?label=master&label=experimental&product=safari&prod...
participants (2)
-
Carlos Alberto Lopez Perez
-
Michael Catanzaro