[webkit-dev] DRT/WTR should clear the cache at the beginning of each test?

Sun Oct 28 06:32:39 PDT 2012

On Oct 26, 2012, at 11:11 PM, Ryosuke Niwa <rniwa at webkit.org> wrote:

> 
> I’m sure Antti, Alexey, and others who have worked on the loader and other parts of WebKit are happy to write those tests or list the kind of things they want to test. Heck, I don’t mind writing those tests if someone could make a list.
> 
> I totally sympathize with the sentiment to reduce the test flakiness but loader and cache code have historically been under-tested, and we’ve had a number of bugs detected only by running non-loader tests consecutively.
> 
> On the contrary, we’ve had this DRT behavior for ages. Is there any reason we can’t wait for another couple of weeks or months until we add more loader & cache tests before making the behavior change?

I think the nature of loader and cache code is that it's very hard to make tests which always fail deterministically when regressions are introduced, as opposed to randomly. The reason for this is that bugs in these areas are often timing-dependent. I think it's likely this tendency to fail randomly will be the case whether or not the tests are trying to explicitly test the cache or are just incidentally doing so in the course of other things.

Unfortunately, it's very tempting when a test is failing randomly to blame the test rather than to investigate whether there is an actual regression affecting it. And sometimes it really is the test's fault. But sometimes it is a genuine bug in the code. 

On the other hand, nondetermisitic test failures make it harder to use test infrastructure in general.

These are difficult things to reconcile. The original philosophy of WebKit tests is to test end-to-end under relatively realistic conditions, but at the same time unpredictability makes it hard to stay at zero regressions.

I think making different ports do testing under different conditions makes it more likely that some contributors will introduce regressions without noticing, leaving it for others to clean up. So it's regrettable if we go that way because we are unable to reach consensus. Creating some special opt-in --antti mode would be even worse, as it's almost certain that failures would creep into a mode that nobody runs.

What I personally would most wish for is good tools to catch when a test starts failing nondeterministically, and to identify the revision where the failures began. The reason we hate random failures is that they are hard to track down and diagnose. But some types of bugs are unlikely to manifest in a purely deterministic way. It would be good if we had a reliable and useful way to catch those types of bugs.

Regards,
Maciej