[webkit-dev] Process for importing third party tests

Tue May 8 13:04:54 PDT 2012

Thanks for the comments, Ryosuke. My replies are inline ...

On Tue, May 8, 2012 at 12:42 PM, Ryosuke Niwa <contact at rniwa.com> wrote:
> On Wed, Apr 25, 2012 at 2:18 PM, Dirk Pranke <dpranke at chromium.org> wrote:
>>
>> > 1b. Run suite locally in WebKit directory
>> >   * Ref Tests
>> >     * Pass - good, submit it
>> >     * Fail - create a testName-expected-failure.txt file
>>
>> We don't currently generate an "-expected-failure.txt" for reftests;
>> there's no concept of a baseline for a reftest at all.
>
>
> We need to such a concept because we want to be able to do:
>
>> Put differently, assuming the normal WebKit model of "the baseline is
>> what we currently produce", we don't actually have a way of capturing
>> "what do we current produce" for a reference test.
>
>
>> I am also more than a little leery of mixing -expected.{txt,png}
>> results with -expected.html results; I feel like it would be very
>> confusing, and it would lose many of the advantages of reftests, since
>> we'd presumably have to update the reference every time as often as we
>> update pixel tests. [We could create a "fake" reftest that just
>> contained the 800x600 pixel dump, but I'm not sure if that's better or
>> not].
>
>
> I don't think we will lose advantages. Without some sort of the current
> result, we will not be able to catch regressions.
> According to Maciej, we've caught many regressions by using conformance
> tests this way.
>

I have no doubt that you can catch regressions by using conformance
tests this way. My concern -- which is expressed throughout and which
you seemed to either miss or downplay -- is that adding more tests
creates more work, especially for pixel tests (and it slows down build
and test cycles, obviously). I don't think we should just add tests
because they exist somewhere on the web; they may provide no
additional coverage beyond the tests we already have.

I feel like we need a stronger mechanism to either check that new test
suites do cover more functionality or we move to obsolete tests we
already have.

To be clear, I am all for importing test suites when we believe they
are comprehensive or do cover things we don't cover well now. But, for
example, rather than having four different test suites for flexbox, I
would rather see us have one good one.

>> > ii. DRT / Pixel tests
>> >   * Expectations: create render tree dumps for each test and use that
>> > output as the new test expectation
>> >     * Potential regressions will be identified by changes to this output
>> >   * Proposal (open to discussion) - stop the production of .PNGs for
>> > these imported tests
>> >     * PROS
>> >       * Avoid the increase to the overall size of the repository caused
>> > by all the new PNGs
>> >       * Regressions will likely be caught by the render tree dumps
>> >       * Avoid maintenance of all associated PNGs
>> >     * CONS
>> >       * Regressions may be missed without the use of .PNGs
>> >       * May make test results harder to interpret
>>
>> I'm not particularly a fan of this. I think each port should follow
>> its own convention for pixel tests or no. i.e., if Chromium normally
>> runs pixel tests, it should run all of these tests as pixel tests; if
>> Gtk doesn't, than they should just check the render trees here as
>> well.
>
>
> Right. I think we should just generate PNG files.
>
>>
>> Also, I was under the impression that (a) the W3C is mostly focused on
>> ref tests going forward and (b) we had agreed in that discussion that
>> we wouldn't import non-ref tests? Did something change in a discussion
>> after that session?
>
>
> No. We had agreed to import all tests regardless of whether they're reftests
> or not because non-reftests can still catch future regressions or
> progressions. And the number of PNG files added to the repository wasn't
> considered as a valid counter-argument due to this utility.

Well, I certainly didn't agree to it :) My concern is not the # of
PNGs so much as the cost of maintenance.

>> > iii. JavaScript tests
>> >   * Pass - good, submit it (along with new expected.txt file - W3C does
>> > not use an expected.txt file for JS tests)
>> >   * Fail - Add to test_expectations file to avoid failures
>> >     * Over time, individual can clean up failing JS tests
>>
>> If they don't have expected.txt files, how is failure determined?
>>
>> Why would we want to add failures to test_expectations.txt here but
>> not for pixel tests or reftests? If anything, these text-only tests
>> are *more* amenable to checking in the "what we do now, even if it's
>> wrong" expectation.
>
>
> I agree we should just generate expected.txt.
>
>> > iv. Manual tests
>> >  * Submit in their current form
>> >    * Over time, convert to ref tests to be submitted back to W3C
>>
>> I don't know what "submit in their current form" means ... doesn't
>> submitting have to do with exporting tests (i.e., importing into the
>> w3c repos), and we're talking about importing tests?
>>
>> Are Manual tests somehow different from the other non-ref tests?
>
>
> I think what he meant is to just import as is. (They won't work as intended
> but we can't really do anything about it).
>
>> > 1. How should W3C tests that fail in WebKit be handled?
>> >   a. Failures should be checked in. Details in General Import Process
>> > above.
>>
>> We discussed this in the session, but I don't see this in the notes; I
>> would really like for us to move to model in our repo where it's
>> possible to look at the filename for the baselines and determine
>> whether the baseline is believed to be correct, incorrect, or unknown,
>> in addition to capturing what we "currently do" (these are independent
>> axes).
>
>
> I support that. e.g. for reftests, we can have expected.html and
> expected-failure.png/txt.
>
>> > 3. Can the approval process for previously reviewed W3C tests be
>> > streamlined?
>> >   a. No, the process should be proscribed
>> >   b. The intent is for the reviewer to confirm that the following type
>> > of actions were performed correctly: correct suites were imported, tests
>> > were run, updates made to test files, new output files generated,
>> > test_expectations updated, full test run after patch is clear of errors,
>> > etc.
>>
>> I don't think I understand this; in particular, I don't understand
>> your use of "proscribed" here. What are you trying to address?
>
>
> Meaning that a full code review should take place.
>
>> As I mentioned once already, I'm not comfortable with this as such a
>> blanket statement. Maintaining the tests (especially if they're not
>> ref tests) incurs a cost, and we should be clear that the value we get
>> from the tests outweighs that cost. In particular, I thought we had
>> agreed not to import test that relied on manual verification?
>>
>> I would also like for us to have some sort of process to identify the
>> overlap between a new test suite and tests that we already have -
>> where we can remove our tests because an official suite gets the same
>> coverage, that would be great.
>
>
> See my comments above.
>
>> > 6. How should we identify imported test suites?
>> >   a. Use a new directory structure
>>
>> If we need to create new baselines, and the new baselines are portable
>> (generic to all webkit implementations, at least, even if IE or FF
>> does something else), I would like to ask that we keep these baselines
>> in a new directory *separate* from the tests and *separate* from the
>> existing platform-specific directories. Think of this as a
>> "platform/webkit" that everyone would have in their fallback path
>> after their own platform-specific directories but before looking
>> alongside the test.
>
>
> I don't understand your proposal about adding platform/webkit. Why do we
> want that? As far as I know, there are no files in W3C test directories that
> end with -expected.txt or -expected.png.

The idea would be that no webkit-specific files would live in the test
directory, only files received from upstream. My thinking was that it
would make importing new versions easier and it would be easier to
understand what was ours vs. what was theirs. I don't feel that
strongly about this, though, it was just an idea.

>> I really don't want to have to look at a given directory and wonder
>> which files in it came from upstream and which didn't.
>
>
> I don't think this will be a problem given our naming convention although
> I'm not entirely opposed to it.
>
>> >   b. Start putting imported tests into this structure, but ultimately
>> > move all existing tests into new hierarchy (i.e. there are some existing
>> > directories that could be collected under a new, top-level CSS directory -
>> > e.g. flexbox)
>>
>> By "ultimately move all existing tests", I assume you're including
>> tests that are currently in LayoutTests that have not come from (or
>> been submitted to) the W3C, e.g., the tests in fast/ ?
>
>
> Yes.
>

I think reorganizing our existing test tree is an entirely different
discussion. I'm all for it, I just don't want to confuse it with the
discussion about importing test suites.

-- Dirk