An update on new-run-webkit-tests
Hi all, I am getting increasingly close to having new-run-webkit-tests running correctly on the apple mac platform with full feature parity to old-run-webkit-tests. (And, of course, it continues to run on all of the Chromium bots as well). I am hoping to close the remaining issues blocking full parity over the next couple weeks. You can track the issues where NRWT still falls behind ORWT here: https://bugs.webkit.org/show_bug.cgi?id=34984 Note that I expect some tests to fail or be flaky under NRWT when it is running multiple processes/threads at a time. Doing so can both expose test dependencies on the environment or change the order of tests that are run by a DumpRenderTree, and I don't consider those to be bugs that should stop people from using NRWT. The new tool has a "test_expectations" mechanism that can be used to track expected failures and we should use it (IMO). That said, if there are significant issues making the tool unstable, causing lots of tests to fail, or just missing functionality, now's the time to let me know. Also, for anyone building / maintaining webkit ports other than the chromium and apple ones, I strongly encourage you to take another look at getting NRWT running on your implementations. I hope to at least make some attempt to run some of the GTK and QT bots myself, but I'm not about to sign up to test every build variant personally :) Note that I believe that NRWT is fully usable today and people should seriously start using it. That said, I believe the following bugs should be fixed before we attempt to switch the apple mac port over. You may wish to wait for these fixes before switching your port over as well: 56731 new-run-webkit-tests doesn't support sample-on-timeout 56729 new-run-webkit-tests doesn't support WebKit2 55907 new-run-webkit-tests should upload crash logs 37739 new-run-webkit-tests does not report stderr output 37736 new-run-webkit-tests output is different from (and worse than) original run-webkit-tests There are also a number of issues keeping the apple win port from working at the moment that I plan to work on shortly. There are also a number of bugs currently listed as blocking that I don't think really qualify. Unless told otherwise, I'm plannning to remove the blocking flag from the following on Monday 4/11 (if they haven't been fixed first): 57640 [GTK] overlapping drag&drop tests fail on NRWT 55909 new-run-webkit-tests --run-singly option is busted 55163 new-run-webkit-tests: enable multiple processes by default on Chromium Win 47240 new-run-webkit-tests: getting an "error 2" back from ImageDiff 37426 new-run-webkit-tests should use the ServerProcess abstraction in chromium.py 37007 fast/tokenizer/doctype-search-reset.html fails when run out of order (new-run-webkit-tests) 35359 fast/repaint/renderer-destruction-by-invalidateSelection-crash.html fails intermittently 35266 new-run-webkit-tests --platform=mac-leopard timeout limit should match run-webkit-tests 35049 http/tests/security/cross-frame-access-put.html fails intermittently under new-run-webkit-tests 35006 fast/dom/global-constructors.html is failing based on previous tests Also, just because I don't think they should block a cutover, I do still think they should be fixed, so don't worry about that :) Lastly, over the next couple weeks I also plan to produce some better documentation for both how to use NRWT and how the code itself is structured for anyone that wants to hack on it or port it to new platforms. Comments definitely welcome, -- Dirk
On Wed, Apr 6, 2011 at 7:39 PM, Dirk Pranke <dpranke@chromium.org> wrote:
Hi all,
I am getting increasingly close to having new-run-webkit-tests running correctly on the apple mac platform with full feature parity to old-run-webkit-tests. (And, of course, it continues to run on all of the Chromium bots as well). I am hoping to close the remaining issues blocking full parity over the next couple weeks.
You can track the issues where NRWT still falls behind ORWT here:
https://bugs.webkit.org/show_bug.cgi?id=34984
Note that I expect some tests to fail or be flaky under NRWT when it is running multiple processes/threads at a time. Doing so can both expose test dependencies on the environment or change the order of tests that are run by a DumpRenderTree, and I don't consider those to be bugs that should stop people from using NRWT. The new tool has a "test_expectations" mechanism that can be used to track expected failures and we should use it (IMO).
That said, if there are significant issues making the tool unstable, causing lots of tests to fail, or just missing functionality, now's the time to let me know.
Also, for anyone building / maintaining webkit ports other than the chromium and apple ones, I strongly encourage you to take another look at getting NRWT running on your implementations. I hope to at least make some attempt to run some of the GTK and QT bots myself, but I'm not about to sign up to test every build variant personally :)
Note that I believe that NRWT is fully usable today and people should seriously start using it. That said, I believe the following bugs should be fixed before we attempt to switch the apple mac port over. You may wish to wait for these fixes before switching your port over as well:
56731 new-run-webkit-tests doesn't support sample-on-timeout 56729 new-run-webkit-tests doesn't support WebKit2 55907 new-run-webkit-tests should upload crash logs 37739 new-run-webkit-tests does not report stderr output 37736 new-run-webkit-tests output is different from (and worse than) original run-webkit-tests
There are also a number of issues keeping the apple win port from working at the moment that I plan to work on shortly.
There are also a number of bugs currently listed as blocking that I don't think really qualify. Unless told otherwise, I'm plannning to remove the blocking flag from the following on Monday 4/11 (if they haven't been fixed first):
57640 [GTK] overlapping drag&drop tests fail on NRWT
Minor revision ... this one might actually qualify as a blocker :) -- Dirk
55909 new-run-webkit-tests --run-singly option is busted 55163 new-run-webkit-tests: enable multiple processes by default on Chromium Win 47240 new-run-webkit-tests: getting an "error 2" back from ImageDiff 37426 new-run-webkit-tests should use the ServerProcess abstraction in chromium.py 37007 fast/tokenizer/doctype-search-reset.html fails when run out of order (new-run-webkit-tests) 35359 fast/repaint/renderer-destruction-by-invalidateSelection-crash.html fails intermittently 35266 new-run-webkit-tests --platform=mac-leopard timeout limit should match run-webkit-tests 35049 http/tests/security/cross-frame-access-put.html fails intermittently under new-run-webkit-tests 35006 fast/dom/global-constructors.html is failing based on previous tests
Also, just because I don't think they should block a cutover, I do still think they should be fixed, so don't worry about that :)
Lastly, over the next couple weeks I also plan to produce some better documentation for both how to use NRWT and how the code itself is structured for anyone that wants to hack on it or port it to new platforms.
Comments definitely welcome,
-- Dirk
On Apr 6, 2011, at 7:39 PM, Dirk Pranke wrote:
There are also a number of bugs currently listed as blocking that I don't think really qualify. Unless told otherwise, I'm plannning to remove the blocking flag from the following on Monday 4/11 (if they haven't been fixed first):
57640 [GTK] overlapping drag&drop tests fail on NRWT 55909 new-run-webkit-tests --run-singly option is busted 55163 new-run-webkit-tests: enable multiple processes by default on Chromium Win 47240 new-run-webkit-tests: getting an "error 2" back from ImageDiff 37426 new-run-webkit-tests should use the ServerProcess abstraction in chromium.py 37007 fast/tokenizer/doctype-search-reset.html fails when run out of order (new-run-webkit-tests) 35359 fast/repaint/renderer-destruction-by-invalidateSelection-crash.html fails intermittently 35266 new-run-webkit-tests --platform=mac-leopard timeout limit should match run-webkit-tests 35049 http/tests/security/cross-frame-access-put.html fails intermittently under new-run-webkit-tests 35006 fast/dom/global-constructors.html is failing based on previous tests
Also, just because I don't think they should block a cutover, I do still think they should be fixed, so don't worry about that :)
I think the ones that represent tests newly failing or becoming flaky should be fixed before cutting over. We wouldn't want to lose test coverage when we do the switch, right? Regards, Maciej
On Wed, Apr 6, 2011 at 9:01 PM, Maciej Stachowiak <mjs@apple.com> wrote:
On Apr 6, 2011, at 7:39 PM, Dirk Pranke wrote:
There are also a number of bugs currently listed as blocking that I don't think really qualify. Unless told otherwise, I'm plannning to remove the blocking flag from the following on Monday 4/11 (if they haven't been fixed first):
57640 [GTK] overlapping drag&drop tests fail on NRWT 55909 new-run-webkit-tests --run-singly option is busted 55163 new-run-webkit-tests: enable multiple processes by default on Chromium Win 47240 new-run-webkit-tests: getting an "error 2" back from ImageDiff 37426 new-run-webkit-tests should use the ServerProcess abstraction in chromium.py 37007 fast/tokenizer/doctype-search-reset.html fails when run out of order (new-run-webkit-tests) 35359 fast/repaint/renderer-destruction-by-invalidateSelection-crash.html fails intermittently 35266 new-run-webkit-tests --platform=mac-leopard timeout limit should match run-webkit-tests 35049 http/tests/security/cross-frame-access-put.html fails intermittently under new-run-webkit-tests 35006 fast/dom/global-constructors.html is failing based on previous tests
Also, just because I don't think they should block a cutover, I do still think they should be fixed, so don't worry about that :)
I think the ones that represent tests newly failing or becoming flaky should be fixed before cutting over. We wouldn't want to lose test coverage when we do the switch, right?
Hi Maciej, I'm not sure I understand you, but if I do, this is what I was attempting to talk about in the paragraph above, about expecting some tests to be flaky or failing under NRWT simply because NRWT isn't exactly identical to ORWT. NRWT may be exposing bugs in the code that ORWT didn't trigger (e.g., because tests ran in a slightly different order, or because of the concurrency issues). It may be that you're thinking that either we run the test and it fails, or we put the test in the Skipped file, because that was our only choice with ORWT. In the new system, we can mark the test as expected to fail in a particular way, but continue to run it (in order to ensure that the test doesn't get worse and maintaining coverage). Certainly running both systems in parallel for a while and shaking out bugs that the NRWT bots reveal prior to cutting over is a good idea, but I don't know that it's realistic to target all tests passing 100% of the time prior to cutover. Then again, it may be that I'm more used to Chromium bots where we have a large number of tests that aren't expected to pass for one reason or another, and the Apple Mac port will be more stable and easier to converge on. Does that address your concerns? And, just to be clear, I am not presuming to decide when anyone can or should cut over (besides Chromium, of course). It's up to the respective bot owners to decide to reconfigure their bots and switch over if and when they're ready to do so. I'm just trying to make it look appealing :) -- Dirk
On Apr 6, 2011, at 10:33 PM, Dirk Pranke wrote:
I'm not sure I understand you, but if I do, this is what I was attempting to talk about in the paragraph above, about expecting some tests to be flaky or failing under NRWT simply because NRWT isn't exactly identical to ORWT. NRWT may be exposing bugs in the code that ORWT didn't trigger (e.g., because tests ran in a slightly different order, or because of the concurrency issues).
It may be that you're thinking that either we run the test and it fails, or we put the test in the Skipped file, because that was our only choice with ORWT. In the new system, we can mark the test as expected to fail in a particular way, but continue to run it (in order to ensure that the test doesn't get worse and maintaining coverage).
I think if there are changes in test behavior that give worse test results (either failures or flakiness), those should be fixed before cutting over. If the new test tool causes more failures, or worse yet causes more tests to give unpredictable results, then that makes our testing system worse. The main benefit of new-run-webkit-tests, as I understand it, is that it can run the tests a lot faster. But I don't think it's a good tradeoff to run the tests a lot faster on the buildbot, if the results we get will be less reliable. I'm actually kind of shocked that anyone would consider replacing the test script with one that is known to make our existing tests less reliable. I don't really care why tests would turn flaky. It's entirely possible that these are bugs in the tests themselves, or in the code they are testing. That should still be fixed. Nor do I think that marking tests as flaky in the expectations file means we are not losing test coverage. If a test can randomly pass or fail, and we know that and the tool expects it, we are nonetheless not getting the benefit of learning when the test starts failing.
Certainly running both systems in parallel for a while and shaking out bugs that the NRWT bots reveal prior to cutting over is a good idea, but I don't know that it's realistic to target all tests passing 100% of the time prior to cutover. Then again, it may be that I'm more used to Chromium bots where we have a large number of tests that aren't expected to pass for one reason or another, and the Apple Mac port will be more stable and easier to converge on.
OK, but we are not talking about future discoveries here. We are talking about problems that have been in bugzilla for a year or so. And we're talking about (for now) a relatively short list. Of course, once we do a test run we may discover there are more problems, but I don't think that gives us license to ignore the problems we already know about.
Does that address your concerns?
Not really! Regards, Maciej
Hi Maciej, Thanks for clarifying your concerns. I will address them a little out-of-order, because I think we probably agree on the important things even if we disagree on the less-important or more theoretical things. First, as far as the "future" discoveries go, I agree we should try to fix as many of the known issues as possible before cutting over. It may be that many or all of them have already been fixed, as I still need to verify some of these bugs. I definitely think the best way forward is to get NRWT bots up and see how things are working in practice; that way we should have a lot of data to tell us about the pros and cons of each tool. That said, On Thu, Apr 7, 2011 at 4:40 AM, Maciej Stachowiak <mjs@apple.com> wrote:
If the new test tool causes more failures, or worse yet causes more tests to give unpredictable results, then that makes our testing system worse. The main benefit of new-run-webkit-tests, as I understand it, is that it can run the tests a lot faster. But I don't think it's a good tradeoff to run the tests a lot faster on the buildbot, if the results we get will be less reliable. I'm actually kind of shocked that anyone would consider replacing the test script with one that is known to make our existing tests less reliable.
Ideally, a test harness is stable, consistent, fast, expose as many bugs as possible, and expose those bugs in a way that is as reproducible as possible, and we should be shooting for that. But, just as our code should be bug free but isn't, the test harness may not be able to be ideal either, at which point, you'll probably prioritize some aspects of its behavior over others. For example, ORWT runs the tests in the same order every time in a single thread, and uses a long timeout. This makes the tests results very stable and consistent at the cost of potentially hiding some bugs (tests getting slower but not actually timing out, or tests that depend on previous tests having run). NRWT, at least the way we run it by default in Chromium, uses a much shorter timeout and of course runs things in parallel. This exposes those bugs, at the cost of making things appear flakier. We have attempted to build tooling to help with this, because we generally value finding more bugs over completely stable test runs. For example, we have an entirely separate tool called the "flakiness dashboard" that can help track the behavior of tests over time. So, your "less reliable" might actually be my "finding more bugs" :) NRWT has at least a couple of hooks that ORWT does not for helping to tune to your desired preferences, and we can configure them on a port-by-port basis.
I don't really care why tests would turn flaky. It's entirely possible that these are bugs in the tests themselves, or in the code they are testing. That should still be fixed.
Of course. I'm certainly not suggesting that we shouldn't fix bugs. But, practically speaking, obviously we are okay with some tests failing, because we list them in Skipped files today. It may be that some of those tests would pass under NRWT, and we don't know that, either (because NRWT re-uses the existing Skipped files as-is. At some point we might want to change this). You mentioned that the "main benefit" of NRWT is that it is faster, but another benefit is that you can classify the expected failures better using NRWT, and detect real changes in the failure. If a test that used to render pixels incorrectly now actually has a different render tree, we'll catch that. If it starts crashing, we'll catch that. ORWT only has "run and expect it to pass" or "Skip and potentially miss something changing". I think I actually consider this more useful than the speed improvements.
Nor do I think that marking tests as flaky in the expectations file means we are not losing test coverage. If a test can randomly pass or fail, and we know that and the tool expects it, we are nonetheless not getting the benefit of learning when the test starts failing.
See above. The data is all there, and it's somewhat a question of what you want to surface where. We have largely attempted to build tools that get the best of both worlds. Maybe this is partially a question of what you would like the main waterfall and consoles to tell you, and perhaps I do not fully understand how different ports would answer this question? -- Dirk
Great to see the progress with new-run-webkit-tests! We have been testing the new script for a while now on the QtWebKit: http://webkit.sed.hu/buildbot/waterfall?show=x86-32%20Linux%20Qt%20Release%2... We do not have a test_expectations file yet, but this should change soon. On Thu, 07 Apr 2011 04:39:38 +0200, Dirk Pranke <dpranke@chromium.org> wrote [...]
37736 new-run-webkit-tests output is different from (and worse than) original run-webkit-tests [...]
I think this issue makes the test results much less human readable, compared to old-run-webkit-tests. In the output of NRWT it is much harder to find the needed information, also because the stdio output is more verbose and arbitrary. But I see work is going on there. BR, Andras
participants (3)
-
Andras Becsi
-
Dirk Pranke
-
Maciej Stachowiak