[webkit-dev] Postmortum on webkit bug 71122

Wed Nov 2 09:06:10 PDT 2011

I just finished fixing http://crbug.com/75604 aka
https://bugs.webkit.org/show_bug.cgi?id=71122 , and it was a long
process, and pretty involved, and had a lot of cycles of change, wait
for crash data, etc...  So here's my postmortum:

TL;DR No matter how sure you are you fixed a crashing bug, keep
reading crash reports to be sure.

TL;DR2 Practice release crash dump debugging regularly.  Read
dissassembled code and never trust symbols.

TL;DR3 When in doubt, add more state to enhance crash dumps.

TL;DR4 WebKit can save stacks in memory now.  gavinp at webkit.org can
help you use this.

WHAT WAS HAPPENING

The ScriptRunner was crashing, and we were getting a lot of uploads.
We couldn't reproduce it by going to the same URIs, but it was
definitely happening a lot in the wild.  It didn't seem like other
WebKit ports were seeing this bug either, which was interesting.

WHAT WE DID

1. The crash in ScriptRunner was happening in
ScriptRunner::timerFired(), and it looked like a CachedScript in the
list of runnable scripts was NULL.  How did that happen?  So I added a
CRASH() to ScriptRunner::queueScriptForExecution().  This meant we
would crash before setting up the Async event, and maybe get a more
interesting stack on our crash dumps?
https://bugs.webkit.org/show_bug.cgi?id=65563

2. From this, we learned a few things.  Crash dumps told us we
definitely had a broken ScriptElement, and every stack showed a
network error was in progress.  The CachedScript in the ScriptElement
was bad, and we couldn't figure out why.  So, we added a hidden state
machine to ScriptElement to track down how it was being zeroed.  Was
it double notification, or was the load continuing after being
stopped?  Crash dumps would tell.
https://bugs.webkit.org/show_bug.cgi?id=66939

3. It was double notification.  Nobody could figure out how on earth
double notification happened.  Boy, I'd sure like to know the stack at
the time of the first ScriptElement::notifyFinished() call...  So we
added a new stack saving facility to WebKit, and used it here to save
a stack inside the ScriptElement on every call to
ScriptElement::notifyFinished().  Mosquito, meet our sledgehammer.
https://bugs.webkit.org/show_bug.cgi?id=69453

4. Oh, so the earlier stack was data received, but with a response
code >= 400?  How interesting...  Finally I could reproduce the bug,
and introduce a fix.
https://bugs.webkit.org/show_bug.cgi?id=71122

WHAT WORKED

1. Windows minidumps.  In Chrome, our dev and canary channels have
enabled a very useful option that gives you some heap.  In walking the
stack on Windows when crashing, any stack element interpretable as a
pointer gets the 1024byte range [p-256,p+768] saved into the dump.
This feature makes debugging from dev or canary windows minidumps
superior to anything else.  Learn how to use windb(a)g.

2. Progressive instrumentation.  As usual, once I had a reproduction a
fix was pretty easy.  But, while I didn't have a reproduction, it was
useful to add release assertions and watch crash dumps, to
progressively zone in on this bug.

3. Adding new facilities to WebKit.  You want stacks saved in memory
on every platform?  OK.
https://bugs.webkit.org/show_bug.cgi?id=69018

WHAT DIDN'T WORK

1. Debuggers.  Hackers who should have known better looked at release
minidumps and symbols, and couldn't understand the variables.  For a
good example of this mistake from me, see
http://code.google.com/p/chromium/issues/detail?id=75604#c21 .  Want
to see someone even smarter make the same mistake?
http://code.google.com/p/chromium/issues/detail?id=75604#c43

The ScriptElement wasn't bogus, but the symbols data was.  The symbols
indicated ECX had this, when this was actually in EAX at the time, or
[ESP].  Don't trust symbols in release builds; instead dissassemble
the function, and figure out where the data is.  The symbols probably
said ECX because at the point of first dereference or maybe at call
time, in that scope, this was probably in ECX.  But in optimised code,
that doesn't have much to do with anything anywhere else in the code.
Read the dissassembly.

2. Reading code.  I spent far, far too much time doing this.  You know
what, usually this works for me: I read code, consider the stack, have
an AHA! moment, and fix my bug.  Too bad this bug was beyond my
cognitive horizon, and reading code was just a great way to waste
time.

3. Not watching crashes.  See comments 30 and 31 in the bug:
http://code.google.com/p/chromium/issues/detail?id=75604#c30 .  We
should not have had confidence the problem was solved; since, hey, it
wasn't.

SPECIAL THANKS TO

ap at webkit.org for encouraging me to instrument where it helps.
cbentzel at chromium.org for putting up with my yak shaving.
eroman at chromium.org for showing me how to use windb(a)g.
japhet at chromium.org for never giving up hope.