[webkit-dev] DRT/WTR should clear the cache at the beginning of each test?

Mon Oct 29 11:45:51 PDT 2012

On Mon, Oct 29, 2012 at 5:48 AM, Maciej Stachowiak <mjs at apple.com> wrote:
>
> On Oct 28, 2012, at 10:09 PM, Dirk Pranke <dpranke at chromium.org> wrote:
>
>>
>> On Sun, Oct 28, 2012 at 6:32 AM, Maciej Stachowiak <mjs at apple.com> wrote:
>>>
>>> I think the nature of loader and cache code is that it's very hard to make tests which always fail deterministically when regressions are introduced, as opposed to randomly. The reason for this is that bugs in these areas are often timing-dependent. I think it's likely this tendency to fail randomly will be the case whether or not the tests are trying to explicitly test the cache or are just incidentally doing so in the course of other things.
>>>
>>
>> I am not familiar with the loader and caching code in webkit, but I
>> know enough about similar problem spaces to be puzzled by why it's
>> impossible to write tests that can adequately test the code.
>
> Has anyone claimed that? I think "impossible to write tests that can adequately test the code" is not a position that anyone in this thread has taken, certainly not me above.
>
> My claim is only that many classes of loader and cache bugs, when first introduced, are likely to cause nondeterministic test failures. And further, this is likely to be the case even if tests are written to target that subsystem. That's not the same as saying adequate tests are impossible.

I'm sorry, I didn't mean "impossible" literally. Please strike that,
as it sounds like it has just made a confusing situation worse.

But, you did claim that it would be "very hard to make tests that
always fail deterministically", and I don't see why that's true?
Testing things that are timing-dependent only require that you be able
to control or simulate time. It may be that this is hard to do with
layout tests, but it's pretty straightforward with unit tests that
allow you to control the layers above and below the cache.

> It just means to have good testing of some areas of the code, we need a good way of dealing with nondeterministic failures.

This is backwards. If you *don't* have good testing, more of your
failures are likely to show up sporadically, which leads you to want
to build tools for them. Randomized testing is a helpful tool to use
*alongside* focused testing to ensure coverage, but should not be used
as a replacement.

>>
>>> What I personally would most wish for is good tools to catch when a test starts failing nondeterministically, and to identify the revision where the failures began. The reason we hate random failures is that they are hard to track down and diagnose. But some types of bugs are unlikely to manifest in a purely deterministic way. It would be good if we had a reliable and useful way to catch those types of bugs.
>>
>> This is a fine idea -- and I'm always happy to talk about ways we can
>> improve our test tooling, please feel free to start a separate thread
>> on these issues -- but I don't want to lose sight of the main issue
>> here.
>
> I think the problem I identified -- that it's overly hard to track down and diagnose regressions that cause tests to fail only part of the time -- is more important and more fundamental than any of the three problems that you cite below. Our test infrastructure ultimately exists to help us notice and promptly fix regressions, and for some types of regressions, namely those that do not manifest 100% of the time, it is not working so well. The problems you mention are all secondary consequences of that fundamental problem, in my opinion.

First of all, this isn't an either/or situation. We should be capable
of addressing all of these issues in parallel.

Second, I don't see how the existence of bugs in the code, the lack of
test isolation, or the lack of good test coverage for certain layers
of the code follow from not having good tools to triage intermittent
failures? That seems like putting the cart before the horse.

Third, are you familiar with the flakiness dashboard?

http://test-results.appspot.com/dashboards/flakiness_dashboard.html#group=%40ToT%20-%20webkit.org&builder=Apple%20Lion%20Debug%20WK1%20(Tests)

Does it not do exactly what you're describing? Are there things that
you would like added? If it would be helpful for us to have a meeting
or something to help explain how this works, I'm sure we could set one
up.

>
>  - Maciej
>
>>
>> It sounds like we've identified three existing problems - please
>> correct me if I'm misstating them:
>>
>> 1. There appears to be a bug in the caching code that is causing tests
>> for other parts of the system to fail randomly.
>>
>> 2. DRT and WTR on some ports are implemented in a way that is causing
>> the system to be more fragile than some of us would like it to be, and
>> there doesn't seem to be an a priori need for this to be the case;
>> indeed some ports already don't do this.
>>
>> 3. We don't apparently have dedicated test coverage for caching and
>> the loader that people think is good enough, and getting such tests
>> might be "hard".
>
> P.S. I do think your problem statements are somewhat tendentious and not really supported by evidence provided in the thread.

How so? Ami cited the bug that identifies the first two problems. The
third I took from the comments in this thread. I actually thought I
was being pretty neutral here.

> But even granting them as written, I don't think any of these is the "main issue".

I'm trying to make sure we don't lose sight of the problems that
motivated this whole discussion (again!). As long as we don't do that,
feel free to talk about tooling and raise other problems all you like.

-- Dirk