[webkit-dev] Postmortum on webkit bug 71122
Gavin Peters (蓋文彼德斯)
gavinp at chromium.org
Wed Nov 2 09:18:03 PDT 2011
[resend with nicer formatting]
I just finished fixing http://crbug.com/75604 aka
https://bugs.webkit.org/show_bug.cgi?id=71122 , and it was a long process,
and pretty involved, and had a lot of cycles of change, wait for crash
data, etc... So here's my postmortum:
TL;DR No matter how sure you are you fixed a crashing bug, keep reading
crash reports to be sure.
TL;DR2 Practice release crash dump debugging regularly. Read dissassembled
code and never trust symbols.
TL;DR3 When in doubt, add more state to enhance crash dumps.
TL;DR4 WebKit can save stacks in memory now. gavinp at webkit.org can help
you use this.
WHAT WAS HAPPENING
The ScriptRunner was crashing, and we were getting a lot of uploads. We
couldn't reproduce it by going to the same URIs, but it was definitely
happening a lot in the wild. It didn't seem like other WebKit ports were
seeing this bug either, which was interesting.
WHAT WE DID
1. The crash in ScriptRunner was happening in ScriptRunner::timerFired(),
and it looked like a CachedScript in the list of runnable scripts was NULL.
How did that happen? So I added a CRASH() to
ScriptRunner::queueScriptForExecution(). This meant we would crash before
setting up the Async event, and maybe get a more interesting stack on our
crash dumps?
https://bugs.webkit.org/show_bug.cgi?id=65563
2. From this, we learned a few things. Crash dumps told us we definitely
had a broken ScriptElement, and every stack showed a network error was in
progress. The CachedScript in the ScriptElement was bad, and we couldn't
figure out why. So, we added a hidden state machine to ScriptElement to
track down how it was being zeroed. Was it double notification, or was the
load continuing after being stopped? Crash dumps would tell.
https://bugs.webkit.org/show_bug.cgi?id=66939
3. It was double notification. Nobody could figure out how on earth double
notification happened. Boy, I'd sure like to know the stack at the time of
the first ScriptElement::notifyFinished() call... So we added a new stack
saving facility to WebKit, and used it here to save a stack inside the
ScriptElement on every call to ScriptElement::notifyFinished(). Mosquito,
meet our sledgehammer.
https://bugs.webkit.org/show_bug.cgi?id=69453
4. Oh, so the earlier stack was data received, but with a response code >=
400? How interesting... Finally I could reproduce the bug, and introduce
a fix.
https://bugs.webkit.org/show_bug.cgi?id=71122
WHAT WORKED
1. Windows minidumps. In Chrome, our dev and canary channels have enabled
a very useful option that gives you some heap. In walking the stack on
Windows when crashing, any stack element interpretable as a pointer gets
the 1024byte range [p-256,p+768] saved into the dump. This feature makes
debugging from dev or canary windows minidumps superior to anything else.
Learn how to use windb(a)g.
2. Progressive instrumentation. As usual, once I had a reproduction a fix
was pretty easy. But, while I didn't have a reproduction, it was useful to
add release assertions and watch crash dumps, to progressively zone in on
this bug.
3. Adding new facilities to WebKit. You want stacks saved in memory on
every platform? OK.
https://bugs.webkit.org/show_bug.cgi?id=69018
WHAT DIDN'T WORK
1. Debuggers. Hackers who should have known better looked at
release minidumps and symbols, and couldn't understand the variables. For
a good example of this mistake from me, see
http://code.google.com/p/chromium/issues/detail?id=75604#c21 . Want to see
someone even smarter make the same mistake?
http://code.google.com/p/chromium/issues/detail?id=75604#c43
The ScriptElement wasn't bogus, but the symbols data was. The
symbols indicated ECX had this, when this was actually in EAX at the time,
or [ESP]. Don't trust symbols in release builds; instead dissassemble the
function, and figure out where the data is. The symbols probably said ECX
because at the point of first dereference or maybe at call time, in that
scope, this was probably in ECX. But in optimised code, that doesn't have
much to do with anything anywhere else in the code. Read the dissassembly.
2. Reading code. I spent far, far too much time doing this. You
know what, usually this works for me: I read code, consider the stack,
have an AHA! moment, and fix my bug. Too bad this bug was beyond
my cognitive horizon, and reading code was just a great way to waste time.
3. Not watching crashes. See comments 30 and 31 in the bug:
http://code.google.com/p/chromium/issues/detail?id=75604#c30 . We should
not have had confidence the problem was solved; since, hey, it wasn't.
SPECIAL THANKS TO
ap at webkit.org for encouraging me to instrument where it helps.
cbentzel at chromium.org for putting up with my yak shaving.
eroman at chromium.org for showing me how to use windb(a)g.
japhet at chromium.org for never giving up hope.
On 2 November 2011 12:06, Gavin Peters (蓋文彼德斯) <gavinp at chromium.org> wrote:
> I just finished fixing http://crbug.com/75604 aka
> https://bugs.webkit.org/show_bug.cgi?id=71122 , and it was a long
> process, and pretty involved, and had a lot of cycles of change, wait
> for crash data, etc... So here's my postmortum:
>
> TL;DR No matter how sure you are you fixed a crashing bug, keep
> reading crash reports to be sure.
>
> TL;DR2 Practice release crash dump debugging regularly. Read
> dissassembled code and never trust symbols.
>
> TL;DR3 When in doubt, add more state to enhance crash dumps.
>
> TL;DR4 WebKit can save stacks in memory now. gavinp at webkit.org can
> help you use this.
>
>
> WHAT WAS HAPPENING
>
> The ScriptRunner was crashing, and we were getting a lot of uploads.
> We couldn't reproduce it by going to the same URIs, but it was
> definitely happening a lot in the wild. It didn't seem like other
> WebKit ports were seeing this bug either, which was interesting.
>
>
> WHAT WE DID
>
> 1. The crash in ScriptRunner was happening in
> ScriptRunner::timerFired(), and it looked like a CachedScript in the
> list of runnable scripts was NULL. How did that happen? So I added a
> CRASH() to ScriptRunner::queueScriptForExecution(). This meant we
> would crash before setting up the Async event, and maybe get a more
> interesting stack on our crash dumps?
> https://bugs.webkit.org/show_bug.cgi?id=65563
>
> 2. From this, we learned a few things. Crash dumps told us we
> definitely had a broken ScriptElement, and every stack showed a
> network error was in progress. The CachedScript in the ScriptElement
> was bad, and we couldn't figure out why. So, we added a hidden state
> machine to ScriptElement to track down how it was being zeroed. Was
> it double notification, or was the load continuing after being
> stopped? Crash dumps would tell.
> https://bugs.webkit.org/show_bug.cgi?id=66939
>
> 3. It was double notification. Nobody could figure out how on earth
> double notification happened. Boy, I'd sure like to know the stack at
> the time of the first ScriptElement::notifyFinished() call... So we
> added a new stack saving facility to WebKit, and used it here to save
> a stack inside the ScriptElement on every call to
> ScriptElement::notifyFinished(). Mosquito, meet our sledgehammer.
> https://bugs.webkit.org/show_bug.cgi?id=69453
>
> 4. Oh, so the earlier stack was data received, but with a response
> code >= 400? How interesting... Finally I could reproduce the bug,
> and introduce a fix.
> https://bugs.webkit.org/show_bug.cgi?id=71122
>
>
> WHAT WORKED
>
> 1. Windows minidumps. In Chrome, our dev and canary channels have
> enabled a very useful option that gives you some heap. In walking the
> stack on Windows when crashing, any stack element interpretable as a
> pointer gets the 1024byte range [p-256,p+768] saved into the dump.
> This feature makes debugging from dev or canary windows minidumps
> superior to anything else. Learn how to use windb(a)g.
>
> 2. Progressive instrumentation. As usual, once I had a reproduction a
> fix was pretty easy. But, while I didn't have a reproduction, it was
> useful to add release assertions and watch crash dumps, to
> progressively zone in on this bug.
>
> 3. Adding new facilities to WebKit. You want stacks saved in memory
> on every platform? OK.
> https://bugs.webkit.org/show_bug.cgi?id=69018
>
>
> WHAT DIDN'T WORK
>
> 1. Debuggers. Hackers who should have known better looked at release
> minidumps and symbols, and couldn't understand the variables. For a
> good example of this mistake from me, see
> http://code.google.com/p/chromium/issues/detail?id=75604#c21 . Want
> to see someone even smarter make the same mistake?
> http://code.google.com/p/chromium/issues/detail?id=75604#c43
>
> The ScriptElement wasn't bogus, but the symbols data was. The symbols
> indicated ECX had this, when this was actually in EAX at the time, or
> [ESP]. Don't trust symbols in release builds; instead dissassemble
> the function, and figure out where the data is. The symbols probably
> said ECX because at the point of first dereference or maybe at call
> time, in that scope, this was probably in ECX. But in optimised code,
> that doesn't have much to do with anything anywhere else in the code.
> Read the dissassembly.
>
> 2. Reading code. I spent far, far too much time doing this. You know
> what, usually this works for me: I read code, consider the stack, have
> an AHA! moment, and fix my bug. Too bad this bug was beyond my
> cognitive horizon, and reading code was just a great way to waste
> time.
>
> 3. Not watching crashes. See comments 30 and 31 in the bug:
> http://code.google.com/p/chromium/issues/detail?id=75604#c30 . We
> should not have had confidence the problem was solved; since, hey, it
> wasn't.
>
>
> SPECIAL THANKS TO
>
> ap at webkit.org for encouraging me to instrument where it helps.
> cbentzel at chromium.org for putting up with my yak shaving.
> eroman at chromium.org for showing me how to use windb(a)g.
> japhet at chromium.org for never giving up hope.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.webkit.org/pipermail/webkit-dev/attachments/20111102/33d83e36/attachment.html>
More information about the webkit-dev
mailing list