[webkit-dev] DRT/WTR should clear the cache at the beginning of each test?

Sun Oct 28 14:09:53 PDT 2012

> On Oct 26, 2012, at 11:11 PM, Ryosuke Niwa <rniwa at webkit.org> wrote:
>> I’m sure Antti, Alexey, and others who have worked on the loader and other parts of WebKit are happy to write those tests or list the kind of things they want to test. Heck, I don’t mind writing those tests if someone could make a list.
>>
>> I totally sympathize with the sentiment to reduce the test flakiness but loader and cache code have historically been under-tested, and we’ve had a number of bugs detected only by running non-loader tests consecutively.
>>
>> On the contrary, we’ve had this DRT behavior for ages. Is there any reason we can’t wait for another couple of weeks or months until we add more loader & cache tests before making the behavior change?
>

Please correct me if I'm misinformed, but it's been three months since
this issue was first raised, and it doesn't sound like they've been
writing those tests or are happy to do so, and despite people asking
on this thread, they haven't been listing the kinds of tests they
think they need.

Have we actually made any progress here, or was the issue dropped
until Ami raised it again? It seems like the latter to me ... again,
please correct me if this is being actively worked on, because that
would change the whole tenor of this debate.

On Sun, Oct 28, 2012 at 6:32 AM, Maciej Stachowiak <mjs at apple.com> wrote:
>
> I think the nature of loader and cache code is that it's very hard to make tests which always fail deterministically when regressions are introduced, as opposed to randomly. The reason for this is that bugs in these areas are often timing-dependent. I think it's likely this tendency to fail randomly will be the case whether or not the tests are trying to explicitly test the cache or are just incidentally doing so in the course of other things.
>

I am not familiar with the loader and caching code in webkit, but I
know enough about similar problem spaces to be puzzled by why it's
impossible to write tests that can adequately test the code. Is the
caching disk-based, and maybe running tests in parallel screwing with
things? If so, then maybe the fact that we now run tests in parallel
is why this is a problem now and hasn't been before? Or maybe the fact
that a given process doesn't always see the same tests in the same
order is the problem?

> Unfortunately, it's very tempting when a test is failing randomly to blame the test rather than to investigate whether there is an actual regression affecting it. And sometimes it really is the test's fault. But sometimes it is a genuine bug in the code.
>
> On the other hand, nondetermisitic test failures make it harder to use test infrastructure in general.
>
> These are difficult things to reconcile. The original philosophy of WebKit tests is to test end-to-end under relatively realistic conditions, but at the same time unpredictability makes it hard to stay at zero regressions.
>

Exactly. Personally, the cost of unpredictability in the test
infrastructure is so much higher than the value we're getting
(implicitly) that this is a no-brainer to me. There are some tradeoffs
(like running tests in parallel) that are worth it, but this isn't one
of them. I am happy to explain further my thinking and standards if
there's interest.

Hopefully that partially answers Alexey's questions about where we
should draw the line in trying to make our tests deterministic and
hermetic: do everything you reasonably can. We're not picking on
caching here.

> I think making different ports do testing under different conditions makes it more likely that some contributors will introduce regressions without noticing, leaving it for others to clean up. So it's regrettable if we go that way because we are unable to reach consensus.

I agree that it is bad to have different ports behaving differently,
and I would like to avoid that as well. I don't want any port
suffering from flaky tests, but I also don't think it's reasonable to
have one group force that on everyone else indefinitely, either.

I am also fine with having some way to test systems more
non-deterministically in a way to expose more bugs, but that needs to
be clearly separated from the other testing we do; it is an unfair
cost to impose on the rest of the system otherwise and should be
tolerated only if we have no other choice. We have other choices.

> Creating some special opt-in --antti mode would be even worse, as it's almost certain that failures would creep into a mode that nobody runs.
>

This comment (and Antti's suggestion, below) makes me think that you
didn't understand my "virtual test suite" suggestion; that's not
surprising, since Apple doesn't actual use this feature of NRWT yet.

A virtual test suite is a way of saying (re-)run the tests under
directory X with command-line flags Y and Z, and put the results in a
new directory. For example, Chromium runs all of the tests in
fast/canvas twice, once "normally" using the regular software code
path, and once with a command-line flag for
--enable-accelerated-2d-canvas that forces things through the gpu
accelerated code paths (using osmesa for emulation).

So, all you would have to do would be to identify which tests you'd
like to run (or re-run) w/ caching enabled, add a command line flag,
and add two lines of code to NRWT.

This isn't a separate "opt-in --antti mode"; these tests are run twice
on every single run on the bots in every single config. You can keep
separate baselines for them or re-use the existing baselines, and you
can have separate TestExpectations (they aren't currently
reused/inherited, but that would also be easy to fix).

> What I personally would most wish for is good tools to catch when a test starts failing nondeterministically, and to identify the revision where the failures began. The reason we hate random failures is that they are hard to track down and diagnose. But some types of bugs are unlikely to manifest in a purely deterministic way. It would be good if we had a reliable and useful way to catch those types of bugs.

This is a fine idea -- and I'm always happy to talk about ways we can
improve our test tooling, please feel free to start a separate thread
on these issues -- but I don't want to lose sight of the main issue
here.

It sounds like we've identified three existing problems - please
correct me if I'm misstating them:

1. There appears to be a bug in the caching code that is causing tests
for other parts of the system to fail randomly.

2. DRT and WTR on some ports are implemented in a way that is causing
the system to be more fragile than some of us would like it to be, and
there doesn't seem to be an a priori need for this to be the case;
indeed some ports already don't do this.

3. We don't apparently have dedicated test coverage for caching and
the loader that people think is good enough, and getting such tests
might be "hard".

I would like for us to solve all three of these problems; solving one
is not good enough and only a partial solution. I can trivially solve
(2). While it could be that I can solve (1) and (3) given enough time
and dedication, I am hardly the best person to do so, nor is Ami.

And while I am sensitive to the idea that solving (2) might cause us
to miss test coverage, I have explained that that's a tradeoff I'm
perfectly fine with -- across all ports. Others might not be, and if
they don't want me to solve that problem on their port (yet, or even
at all), I won't.

So unless someone can convince me that there is actually a plan and a
timeline for resolving (1) and (3) that we can expect to happen and
that I should just wait a little while longer, I plan to R+ Ami's
change so he can land it for the ports that do want it. I believe we
are inflicting more harm on the project as a whole by not doing so.

-- Dirk