[webkit-qt] The bot infrastructure and gardening.

Fri May 11 06:06:31 PDT 2012

Hi All,

Balazs Kelemen írta:

[snip]

>> The biggest problem is that folks who don't use Ubuntu 11.10 got thousands of 
 >> failing tests because of minor font differences. In this case the best solution
 >> isn't that "I can't reproduce the results, so I won't run layout tests anymore."
>> It would be more valuable for the whole project if font(config) experts try to 
>> make the WebKit, Qt, fontconfig or anything else to use same fonts. I don't know 
 >> if it is possible or not, I don't know anything about fonts. Is it possible somehow
 >> to bundle a chosen fontconfig to Qt or to WebKit and use it for regression testing
 >> on all distro instead of sweating because of different system fontconfig versions?

> You are speaking about Linux, but it's not the only system where we want coverage.
It is a good idea to make tests pass on Linux, Mac and Windows too. But it will
be still dream until we can't make tests pass on more than one Linux distro.

> For example on Mac fontconfig does not play a role in the font game. We could 
 > use it, but than we would lose the coverage for the real use case.
Now, the test coverage on Mac is terrible small: 66.2% (29579 tests/10000 skipped)
(3777 skipped on Qt 4.8 Linux - 87.1%)

But now this 66.2% test coverage doesn't mean anything, because nobody watch
and maintains the Mac results, we skip new failing tests automatically, because
we don't have manpower to maintain two sets of *expected.txt . And if we had
more manpower, it's impossible with only one Mac hardware we have ...

Additionally I can't remember if anyone in the past tried to reproduce the Mac
results we get from the Mac Lion builder and tester bot (hosted in Boston).

 > Btw, there is some light in the dark land of fonts:
>     - I have done some work to unify test results between Linux and Mac, 
> hopefully I could finish it in the near future.
>     - In Ubutu 12.0, a strange bug have been fixed in freetype which made the Ahem
> font produce wrong metrics (WidthXHeight=NxN+1 instead of NxN). Ahem is used in a lot 
 > of tests in the css* directories. Currently our expectations are wrong, but if we fix
 > them these metrics will match across distros (everybody use the newer freetype for a
 > long time except our beloved, stable Debian :D )

Starting a distro war here is unfair. Who talked about Debian and how is it related
to Debian? You have to know that expected results match on Debian Squeeze, Ubuntu 11.10,
Ubuntu 12.04. (But ~5-6 additional test fail on Ubuntu 12.04) Maybe on Ubuntu 11.04 too,
but I didn't check it long time ago.

>> Not only the missing gardening plan is the problem. In my
>> opinion introducing contributing rules would be more important.
>> For example:
>>  - Developers should build the patch and run tests before committing.
>>    (Or at least watch the bots after landing and fix/rollout quick if 
>> something goes wrong)
>>  - What should I do if I broke the build / a layout test / API test ?
>>  - What should a gardener do if somebody doesn't care with the 
>> regression he/she caused ?
>>  - What should do the boss if somebody usually and intentionally hurt 
>> the rules? :)
> 
> I have to protest a bit. As Ossy describes it, it's really simple and straightforward. 
 > When somebody breaks a test than it means his patch is buggy and he should find the error
 > in his changes, and everything will be fine. In reality, this is not always the case. When
 > you break a test, it could mean different things:
> 
>     1. you did it wrong Obviously you need to fix your patch

I absolutely agree with 1.

>     2. there is a bug in the system that you triggered somehow (with even a totally right 
 > change on it's own) Of course the right thing to do is to investigate in the problem. But
 > it could be very complex, maybe the bug exists in a different subsystem that you don't know
 > well. I don't think it is always possible to find the manpower to fight with these bugs.

This kind of bugs can cause the biggest problem. Nowadays if a patch triggers a bug,
the author of the patch doesn't do anything, but say: "My patch is correct."

It can be very dangerous. New flakey fails, crashes can make the regression testing
infrastructure unusable. Until they aren't fixed, we can't catch new regressions.
Or we can, but with 2-3-4x manpower. (But we don't have more manpower.)

>     3. there is a bug/imperfection in the test infrastructure that you triggered Well, this is 
 > pretty annoying and relatively common. We should detect and solve these issues but it's not really
 > fair to stop a good patch to land until somebody fixes the tools. Note that working out of trunk
 > upon your previous work is possible but it's not fun because you have to struggle more with rebasing.

What do you mean? I don't remember if there were too many ORWT/NRWT bug triggered by a good patch.
If sometimes/rarely an absolutely good patch trigger a test infrastructure bug, our first priority
is fixing the test infrastructure. But leaving the tree broken for days and 100x commits isn't a
good idea in this case. It isn't a good joke to find the new regression occured during broken tree. :-/
I think rolling out the good patch in this case until proper fix should be reasonable (1-2-3 days).

>     4. you caused some change that is not really a bug
> Like some pixel differences that the actual users could not even notice. I would say if you do such 
 > a change than let's update the expectations, but it's not always possible since you cannot test your
 > patch in each environment where we want coverage. (And if you don't use Ubuntu or Debian you
> cannot even produce results locally for Linux-destop.)

We arrived again to the root problem: the expected results aren't same on each Linux distros.
And then folks say: "I can't reproduce the bot results, so I won't care the layout tests."
It is the worst decision ...

In my opinion there are two possible solution:
  - font(config) experts fix Qt/WebKit/... somehow to make (almost) all results match on the main Linux distros.
  - We (or some boss)choice one given Linux distro, and all developers _must_ test his/her patch on it.

> After all, I think we should be careful about what rules we introduce. 
> They should satisfy two requirement:
>     - we have to keep them. not just the first week, not just the first month, but always. :)
>     - they must not block the development too much. How cares if we are 
> rock stable if we cannot follow the evolution of the web?!

I absolutely agree with it. We have to keep the rules, of course flexible. :)
For example: There is a general WebKit rule that your patch must build on all platform.
But everyone can make a mistake and break the build with a typo accidentally. If he/she
fix it immediately or rollout the buggy patch, it isn't problem. But if the developer make
a nasty mistake, he/she shouldn't expect others fix his/her fault instead of simple rollout.
Of course if I have free time and the fix is trivial, I'll fix it, as I did it many times
in the past.

> I agree with Ossy in that we should allocate more efforts on bug fixing 
> / stabilisation but I don't agree that we should banish the
> skip list once and forever. Actually there is no stable port of WebKit 
> where the skip list is unused. I would say, let's try to find a better
> balance between stability and the speed of development.

I didn't say that we should banish the skiplist, it is impossible. :) The main problem
is that it is only growing, because there weren't too many bugfix in the past.

Be optimistic and let's see the next bug fixing week. ;-)

[snip]

> Not strictly in connection to your points but another infrastructural thing:
> when will we able to run tests in parallel? Is it reliable right now? Could we
> make it the default configuration of nrwt - except on bots, until it is really stable -
> so folks were not have to know the command line switch by heart (as I know it's
> not simple because you need to call the real nrwt and not the pearl wrapper and it's
> slightly different). It would be much more fun to run the tests before uploading / landing
> your patch if it were not run for years.

The question is good. We are working hard to make it work stable. But
unfortunately it isn't stable now. Not stable enough for the buildbots.

After the following changes it works quite good (1-2 false crash in 20-30% of NRWT runs):
https://trac.webkit.org/changeset/116134
https://trac.webkit.org/changeset/116211
https://trac.webkit.org/changeset/116212

Of course you can try it locally with passing "--child-processes=4 --max-locked-shards=4"
options to Tools/Scripts/run-webkit-tests. (The second one is to run http tests parallel too.)
(And watch our experimental NRWT bot here: http://build.webkit.sed.hu/builders/x86-32%20Linux%20Qt%20Release%20NRWT )

br,
Ossy