[webkit-qt] The bot infrastructure and gardening.

Thu May 10 08:01:05 PDT 2012

Hi All,

Alexis Menard írta:
> Hi,
> 
> By reading the email of Simon about removing Qt4 I have seen there was
> plan to move to Amazon EC2.
> 
> State of art of gardening Qt :
> 
> - Mostly Ossy alone is gardener, which is unacceptable. Apple made a
> move towards improving their bots (when you see kling gardening it
> tells you they changed something), Google is already pretty good, GTK
> also, we need to be better. While at the summit people praised Qt bots
> being green all the time I do think it hides a terrible truth : our
> skiplist grows grows grows and nobody look after it which conflicts a
> bit with trying to release a stable trunk for Qt 5.0. How many mails
> we receive from Ossy complaining about quality?

It's not exactly true that I'm the only one gardener. I'm working on
gardening with my buildbot group together. They are part time developers,
because they are students and have courses, exams, etc. Most of them
aren't WebKit committers yet, so their gardening patches are usually
committed by me or anybody else from here. ( But you can find their
names in the commit logs, changelogs, of course. :) )

With the other thing I have to agree, with this small group we have resource
(enough time) for only fire-fighting: detect who/which commit broke which tests,
update expected results if it is needed, filing bug reports, commenting bugs,
buildfixes, etc. We don't have enough time to fix all bugs instead of who caused.

But gardening is so hard if most of the developer don't care with QA at all.  When
I comment a bug with "your patch broke X.Y. layout/API test, diff: ...", I regulary
get the question: "How can I run this test?" And it isn't good, because it means
this developer never run tests before. (But everybody should before commiting.)
Other problem is that many developers insist on their buggy patch being in
trunk, but they don't care fixing the bug. In this case we can only do that
we skip the new failing tests, because red bots with many failing tests would
make catching new regression much more complex, sometimes impossible. But
in my opinion rolling-out a buggy patch and reland after fixing it would
cause less pain for everybody than growing, growing and growing skiplist.
I don't know why folks hate rolling out patches. It doesn't mean that the
patch is wrong at all. It isn't a capital sentence for the patch or the author. :)
It only means that the patch caused some trouble/regression an should be fixed.
And fixing offline is less painful for others than leaving buggy patch in trunk.
Chromium guys usually rollout their own patches if they broke a test on the Qt
bot before I noticed. Really. We should follow their good practice. ;)

> - Their is a huge delta machine wise with what the bot is running and
> what people use to develop. The bot runs Ubuntu, many of us run
> ArchLinux/OpenSuse while some us run Ubuntu. It leads to results
> different from what the bot produce and what you see and your machine.
> We have encountered many many many times people saying : "it passes on
> my machine but not on the bot" -> Added to the Skiplist because nobody
> can really see what's going with the bot. Szeged tried their best to
> provide a virtual machine but it was a bit of a failure as the VM
> doesn't behave the same as the bot, and the VM behave differently
> whether your run it on VMWare or VirtualBox.

Unfortunately the VMWare image wasn't the best solution. And then we
created a meta package for Ubuntu 11.10 which installs all dependency:
https://launchpad.net/~u-szeged/+archive/sedkit
With this meta package you can install a full QtWebKit development environment in an hour.

Now the dircetion is moving to an Amazon Ubuntu image. But I think it is still
papering over the problem. It is _very good_ (but expensive) for ensuring everybody
can simple reproduce the bot results. But we don't develop for only one platform.
More platform show more hidden and maybe serious bugs. If your patch works fine
on the only one reference platform, it doesn't mean there isn't any bug in it.

The biggest problem is that folks who don't use Ubuntu 11.10 got thousands of failing
tests because of minor font differences. In this case the best solution isn't that
"I can't reproduce the results, so I won't run layout tests anymore." It would be
more valuable for the whole project if font(config) experts try to make the WebKit,
Qt, fontconfig or anything else to use same fonts. I don't know if it is possible
or not, I don't know anything about fonts. Is it possible somehow to bundle a chosen
fontconfig to Qt or to WebKit and use it for regression testing on all distro instead
of sweating because of different system fontconfig versions?

> - We don't have any gardening plan.
Not only the missing gardening plan is the problem. In my
opinion introducing contributing rules would be more important.
For example:
  - Developers should build the patch and run tests before committing.
    (Or at least watch the bots after landing and fix/rollout quick if something goes wrong)
  - What should I do if I broke the build / a layout test / API test ?
  - What should a gardener do if somebody doesn't care with the regression he/she caused ?
  - What should do the boss if somebody usually and intentionally hurt the rules? :)

> What could be improved :
> 
> - We need to make a gardening plan. We can't be serious about making
> web browsers/APIs without improving our coverage. I know we don't have
> much resources but I think it should be ok to have one person doing it
> for a week and then turn. Really it's a week maybe boring but it's
> once every long time (almost one time every two-three months). This
> will make Ossy more free to do something else so Ossy can go back
> proper coding. I can make that list if people agree. Also it needs to
> be enforced (maybe reviews could be the exception).

Gardening isn't so simple that only one person can be done. It can be enough
for fire-fighting: buildfixes, updating expected files, reporting bugs, fix
some trivial bug. But isn't enough to fix all regression caused by others
who aren't responsible at all or the regression occured on the part of WebKit
you don't know anything. Not to mention there are many complex tests, and
there isn't trivial to decide if the new result is correct or not.

I added our gardening timetable to this wiki:
https://trac.webkit.org/wiki/QtWebKitBuildBots

All new volunteers are very welcome. ;-) It would be great if you guys in INdT
could be join, you are near to PDT timezone. And handling problems freshly is
always simpler than waiting for hungarian morning and trying solve dozens of
new regressions, broken builds, assertions, flakey tests, ...

> - We need to be able to test/stress/break the bot environment. Today
> the fact that none of us can mess up with the bot make it hard to
> reproduce the failures of the bot that you can't see on your machine.
> While I do understand (and we don't want that) that Ossy doesn't give
> us the key to the bot, we still need to have one to mess around. 

We hacked too many times in the past to make layout test system be able run
more than one bot on the same 8-24 cores machine. But the limitation is still
for one linux user. We still have a strict limitation: An other user trying to
run tests on the same machine can kill all the bots, so now only one user is
allowed. In this case it isn't a good idea if anybody logs in and hacking
something. When I have to do it, I'm very very careful, but sometimes I
broke everything accidentally.

> So if we are moving to EC2 could we create one instance there that would be
> be the exact clone of the real bot (the only running layout test, WK2)
> and with a free access for devs? That would allow us to mess
> around/figure out problems, come up with a list of things to do on the
> main bot (that Ossy or whatever admin could do) and then we rollback
> the dev instance to a clone of the main one and it gets free for the
> next dev. 

EC2 is simpler. You don't have to rollback the instance you hacked on.
Just stop it and then delete. And the next developer run a new instance.
Running a new instance takes only several minutes. And you only pay for
runtime (all started hours) and a little bit for storage, IO and network.

But there are still some technical problem (policy, account, users, ...)
should be solved. We are working hard to find all necessarry thing,
and then I will talk to Simon about the details.

> This will allow every single of us to have the exact same
> environment with freedom to test what's going on. I know Linode
> supports cloning instances but EC2 supports it? Also Linode allows you
> to rollback the VM to any state you saved before (so you take the VM,
> save the state, do your testing, fix, then rollback for the next guy).

I don't know anything about Linode. But EC2 supports cloning AMI images.
We are going to maintain a master image and create new AMI when it is
necessary and then replace the bots with some clicks instead of installing
all new packages, security updates, update Qt5 on each machine parallelly.

> Also Ossy could you describe a bit that move to EC2? We're moving all
> the bots there (I'm not sure the bots which builds only brings much
> value but whatever)?

I think moving all bots is impossible and absolutely unnecessary. Now
we have only one EC2 instance (High-Memory Quadruple Extra Large) with
8 cores and a clean build takes ~15 minutes. (~$7200/a year + IO + network
if you buy a reserved instance for a year)

Migrating build only buildbots and EWS bots to Amazon would be absolutely
unnecessary and vasting money ... Our build farm has 150-200 cores and
building for all testers isn't a big deal for them. And for testing a
2 cores machine is more than enough and is cheaper (~$850/a year) Now our
bots are two-in-one builder and tester, but separating them to builder and
tester is so simple. Only debug bots shouldn't be separated, because uploading
1.2Gb from Szeged to Apple master and then download it from Apple master
to Amazon EC2 would be very slow and very expensive.

Additionally our ARM bots, Windows cross compiler bots isn't ready for migrating
to Amazon now. Of course, it is possible in the future, but it needs more working
on them. And migration perf bots is impossible, they need dedicated hardware.

br,
Ossy