[webkit-dev] Process for importing third party tests

Wed Apr 25 14:18:06 PDT 2012

On Mon, Apr 23, 2012 at 4:13 PM, Jacob Goldstein <jacobg at adobe.com> wrote:
> At the recent WebKit Contributors Meeting, a process was drafted for
> importing third party tests into WebKit.
>
> I created a wiki page that captures the process that we came up with here:
>
> http://trac.webkit.org/wiki/ImportingThirdPartyTests
>
> We'd like to get more input from the community on all aspects of this
> process.
>
> Please review and lets discuss further.
>

Hi Jacob,

I've only had a chance to glance over the document thus, but thanks
for writing it up!

Two initial comments:

> 1a. Import entire test suite(s) from W3C repository: pixel tests, ref tests, manual tests, JS tests

You should probably change "Import" to "download" or something more
descriptive here, since the whole process is an "import" :)

> 1b. Run suite locally in WebKit directory
>   * Ref Tests
>     * Pass - good, submit it
>     * Fail - create a testName-expected-failure.txt file

We don't currently generate an "-expected-failure.txt" for reftests;
there's no concept of a baseline for a reftest at all.

Put differently, assuming the normal WebKit model of "the baseline is
what we currently produce", we don't actually have a way of capturing
"what do we current produce" for a reference test.

I am also more than a little leery of mixing -expected.{txt,png}
results with -expected.html results; I feel like it would be very
confusing, and it would lose many of the advantages of reftests, since
we'd presumably have to update the reference every time as often as we
update pixel tests. [We could create a "fake" reftest that just
contained the 800x600 pixel dump, but I'm not sure if that's better or
not].

> ii. DRT / Pixel tests
>   * Expectations: create render tree dumps for each test and use that output as the new test expectation
>     * Potential regressions will be identified by changes to this output
>   * Proposal (open to discussion) - stop the production of .PNGs for these imported tests
>     * PROS
>       * Avoid the increase to the overall size of the repository caused by all the new PNGs
>       * Regressions will likely be caught by the render tree dumps
>       * Avoid maintenance of all associated PNGs
>     * CONS
>       * Regressions may be missed without the use of .PNGs
>       * May make test results harder to interpret

I'm not particularly a fan of this. I think each port should follow
its own convention for pixel tests or no. i.e., if Chromium normally
runs pixel tests, it should run all of these tests as pixel tests; if
Gtk doesn't, than they should just check the render trees here as
well.

Also, I was under the impression that (a) the W3C is mostly focused on
ref tests going forward and (b) we had agreed in that discussion that
we wouldn't import non-ref tests? Did something change in a discussion
after that session?

> iii. JavaScript tests
>   * Pass - good, submit it (along with new expected.txt file - W3C does not use an expected.txt file for JS tests)
>   * Fail - Add to test_expectations file to avoid failures
>     * Over time, individual can clean up failing JS tests

If they don't have expected.txt files, how is failure determined?

Why would we want to add failures to test_expectations.txt here but
not for pixel tests or reftests? If anything, these text-only tests
are *more* amenable to checking in the "what we do now, even if it's
wrong" expectation.

So, it seems like we have three different kinds of tests that you are
suggesting we treat three different ways. You can probably guess that
I don't like that :).

> iv. Manual tests
>  * Submit in their current form
>    * Over time, convert to ref tests to be submitted back to W3C

I don't know what "submit in their current form" means ... doesn't
submitting have to do with exporting tests (i.e., importing into the
w3c repos), and we're talking about importing tests?

Are Manual tests somehow different from the other non-ref tests?

> 1. How should W3C tests that fail in WebKit be handled?
>   a. Failures should be checked in. Details in General Import Process above.

We discussed this in the session, but I don't see this in the notes; I
would really like for us to move to model in our repo where it's
possible to look at the filename for the baselines and determine
whether the baseline is believed to be correct, incorrect, or unknown,
in addition to capturing what we "currently do" (these are independent
axes).

This might be a separate discussion -- and of course there are
complications that arise with this -- but I would like to establish it
before we go to far down the import path ... in particular, I think it
will be difficult to convince the chromium devs to move fully off
their current model of "checked in files are correct; if we currently
do something different, we suppress that".

> 2. Should a set frequency be used for importing tests?
>   a. No, frequency is up to the people who want to do this task.

I'm fine w/ this

> 3. Can the approval process for previously reviewed W3C tests be streamlined?
>   a. No, the process should be proscribed
>   b. The intent is for the reviewer to confirm that the following type of actions were performed correctly: correct suites were imported, tests were run, updates made to test files, new output files generated, test_expectations updated, full test run after patch is clear of errors, etc.
>

I don't think I understand this; in particular, I don't understand
your use of "proscribed" here. What are you trying to address?

> 4. Should other tests (from Mozilla, Microsoft, etc.) continue to be imported?
>   a. Still open to discussion.
>   b. One proposal: Yes, but only from W3C. We should encourage vendors to contribute their tests to W3C, we will then only import from W3C
>     * Under this scenario - would we wait until tests are approved (could be a long process) or import submitted tests as well?
>     * If we import submitted tests, would they go into a special directory to indicate that they are not yet approved?
>     * Import script would need to handle the case where a test goes from submitted to approved

I had actually something different in the session ... while I would
encourage all of the test suites to end up in the W3C, I think it
would be unfortunate if we are gated on their approval process. I
would like to require that the tests uses the W3C's test formats, but
if they do I think it could be reasonable to import them directly
while waiting for the W3C to sign off on them. Of course, the name
space and directory locations should be kept separate to be clear
about their origin.

> 5. Should W3C pixel tests be imported?
>   a. Yes. We should import entire test suites no matter what type of tests they contain

As I mentioned once already, I'm not comfortable with this as such a
blanket statement. Maintaining the tests (especially if they're not
ref tests) incurs a cost, and we should be clear that the value we get
from the tests outweighs that cost. In particular, I thought we had
agreed not to import test that relied on manual verification?

I would also like for us to have some sort of process to identify the
overlap between a new test suite and tests that we already have -
where we can remove our tests because an official suite gets the same
coverage, that would be great.

> 6. How should we identify imported test suites?
>   a. Use a new directory structure

If we need to create new baselines, and the new baselines are portable
(generic to all webkit implementations, at least, even if IE or FF
does something else), I would like to ask that we keep these baselines
in a new directory *separate* from the tests and *separate* from the
existing platform-specific directories. Think of this as a
"platform/webkit" that everyone would have in their fallback path
after their own platform-specific directories but before looking
alongside the test.

I really don't want to have to look at a given directory and wonder
which files in it came from upstream and which didn't.

>   b. Start putting imported tests into this structure, but ultimately move all existing tests into new hierarchy (i.e. there are some existing directories that could be collected under a new, top-level CSS directory - e.g. flexbox)

By "ultimately move all existing tests", I assume you're including
tests that are currently in LayoutTests that have not come from (or
been submitted to) the W3C, e.g., the tests in fast/ ?

I think it would be a noble goal to rearrange our existing test
hierarchy so the purpose of each directory was more well-defined, but
I kinda feel like that's a whole different discussion, and probably
shouldn't be mixed into this.

> 7. Should we create a single, centralized test_expectations file that covers all common failures?
>   a. Individual ports would still have their own test_expectations file for port-specific failures. This would facilitate excluding failing imported tests (e.g. JavaScript tests)

I think you're missing a "yes, but" at the beginning of your answer.
In other words, you're suggesting that we should have a webkit-wide
test_expectations file *plus* each port could have one or more
additional files for port-specific expectations. I'm good with that
(and almost have the implementation needed for this done, independent
of this whole discussion). I think this point is largely independent
of the rest of the "import w3c test suite" discussion, except that we
might want to make it a necessary precondition.

Note however that I think the verdict is still a bit out on how many
different test_expectations.txt files we should have. My current
thinking is that we should have one generic file plus one file per
implementation (qt/gtk/efl/chromium/apple, etc.) but that multiple
platform/os-specific variants should share the same file (e.g., we
have one "chromium" test_expectations.txt rather than one chromium
plus one chromium-mac-lion plus one chromium-mac-leopard). I think
this is a bit easier to maintain that the variant-specific Skipped
files. I am open to further discussion on this, though, and this is a
tangent anyway.

I hope this is helpful feedback. What do others think?

-- Dirk