[webkit-dev] A post-mordem of today's tree redness
Adam Barth
abarth at webkit.org
Mon Apr 5 21:58:47 PDT 2010
We had some trouble today keeping the tree green. In this email, I
present a post-mordem analysis of what happened and what we can learn
from these events. I've removed most of the names from this account
because the purpose isn't to assign blame but to document what
happened in the hopes that we can learn from it.
== Timeline ==
04/03/10 10:27 am - r57051 lands, breaking five tests that enumerate
all the objects in the global scope. This breakage goes undetected
because an incremental build fails to rebuild the necessary files due
to bug 37102.
04/05/10 10:00 am - r57078 lands breaking Leopard Release, Leopard
Debug, and SnowLeopard, ALSO triggering a rebuild of DOMWindow.idl,
revealing that r57051 broke multiple platforms.
10:18 am - r57080 lands into a broken tree. The committer landing
this patch expected the change to break virtually all the bots and was
landing the change to gather expected results from platforms he didn't
have on hand. As expected, r57080 breaks essentially all of the bots.
10:27 am - r57081 lands, apparently innocuously amid the burning tree.
10:35 am - r57082 rolls out r57078, fixing 3 bots, but the
r57051-triggered failures remain.
10:36 am - r57083 lands, adding Qt expected results for r57080.
11:24 am - r57086 lands, adding Mac expected results for r57080.
11:27 am - r57087 lands, fixing expected results for r57051. (Further
expected result fixes follow in r57092 and r57097.)
12:28 pm - The dust settles and we realize that two tests are still
broken on Tiger. The blame list has five patches on it. Through some
(perhaps misguided) reasoning, we conclude that r57080 is still
responsible.
12:58 pm - We trigger a clean build of Tiger, hoping the clean build
will solve our problem.
2:33 pm - The clean rebuild rolls green. (Yes, a clean rebuild of
Tiger takes an hour and a half.)
2:52 pm - The next Tiger cycle reveals that the broken tests are still
plaguing the Tiger bot. We continue investigating assuming r57080 is
the culprit.
~4:00pm - Ossy mentions on #webkit that reverting r57081 locally fixes
these tests on Qt.
4:07 pm - r57100 rolls out r57081. James Robinson continues to investigate.
~4:30 pm - Tiger (and the rest of the core builders) roll green.
8:08 pm - r57116 lands with a correct version of r57081.
(I've omitted a pixel test regression during this time period, which
was invisible to the bots and rolled out, as well as a breakage later
in the day that we corrected shortly after landing.)
== Discussion ==
Dealing with the failures this morning was particularly difficult
because there were four overlapping failures revealed in the space of
27 minutes.
1) The HTMLProgressElement failures were difficult to track down
because they sat in the tree for days undetected because of incorrect
dependencies. Unfortunately, the failures were revealed at an
inopportune time in the midst of other failures. However, correcting
the failures was straightforward because the expected results simply
needed to be updated.
2) The failures caused by r57078 were disentangled 35 minutes later
because the patch was rolled out. I haven't followed up on this
change to see what the ultimate resolution of this issue was.
3) Landing r57080 into a broken tree during California business hours
seems like a questionable decision, especially given that the
committer knew the change would break essentially every bot.
4) Because the tree was on fire, it took us five and a half hours to
disentangle the failures caused by r57081 from the failures caused by
r57080.
Thanks go to Ossy for uncovering that r57081 (and not r57080) caused
the two failing tests on Tiger, and special thanks go to James
Robinson for continuing to investigate the r57081-related failures and
writing an improved version of the patch (even though the original
patch wasn't his).
== Recommendations ==
A) We should work to improve our dependencies so that breakages like
the one caused by r57051 show up immediately, when they are easier to
deal with, instead of days later in the midst of another fire-fight.
B) When landing a patch you know will break many or most bots,
consider landing the patch when you're sure the tree is green and
preferably outside of business hours in California (when the tree is
especially busy).
C) We should improve the EWS bots to produce expected results so folks
don't need to light the tree on fire to generate expected results for
platforms they don't personally have access to.
D) Instead of spending five and a half hours attempting to figure out
why r57080 caused the regressions, we should have confirmed our
hypothesis by rolling out the patch. We could then have eliminated
r57080 from suspicion and rolled it back into the tree much more
quickly.
Rolling changes out of the tree isn't the end of the story. We still
want the improvements in the patch, we just want them without the
regressions. Leaving failures in the tree make it difficult to track
down subsequent failures. Rolling out a change means more work for
you, but less costs imposed on the rest of the project.
Adam
More information about the webkit-dev
mailing list