[webkit-help] dylib strangeness

Tim Prepscius timprepscius at gmail.com
Sun May 30 23:16:51 PDT 2010


Well I wrote this e-mail to a friend, but perhaps one of you may read
it and see the solution,
any hints would be interesting:



I'm having a debugging problem like I've never had before with this
safari plugin.


Maybe explaining to you will help me gain some sort of insight.

Imagine this:
You have a program.  It is made of about 20 libraries.  Some other
peoples, some yours.
The difference between the safari plugin, and an executable, is about
100 lines of startup code.  Maybe 0.01% of the entire program.

On both windows and osx, the executable functions flawlessly.  80% of
the program has been around for 5 years atleast,  18% is in the last 2
years, 1-2% is in the last year.  So in other words, things have been
working for a long time.


As a safari plugin, there is a point A and a point B which crash.
Only in release build.

A crashes only 10% of the time.
I can increase the likely hood that A crashes by delaying the event by
about 10 seconds.  At which the likely hood is maybe 30%.  (but I'm
weirded out by this and don't trust this observation)  I can create
this delay by just pausing the debugger, or pausing the server with
which it is talking to.
A crashes during a dynamic cast of an object.
The same dynamic cast occurred a few moments before.
If it makes it past point A, that same code/dyamic_cast will work perpetually.
This same code is called millions of times.  The object it is casting
is allocated in the very beginning, and deallocated at the very end.


When I turn on logging, the crash does not occur.
When I turn off optimization, the crash does not occur.  However if I
turn off optimization of only the module in which the crash occurs (or
the call to dynamic cast), it still occurs.

The code in which A crashes looks like this:
void Dynamic::event (const Object::Event::Base *event)
{
	LogDebug (SnowCrash::Object::Dynamic::event, "object receiving event
" << this);

	std::list<Common::Object::Component *>::iterator i;

	for (i=orderedComponents.begin(); i!=orderedComponents.end(); ++i)
	{
		Common::Object::Component *_component = *i;
		LogDebug (SnowCrash::Object::Dynamic::event, "object distributing
event to " << _component);
		LogDebug (SnowCrash::Object::Dynamic::event, "object distributing
event to " << _component->getComponentID());

		Object::Component *component = CheckCastPtr(Object::Component, _component);
		if (component)
		{
			component->event (event);
		}
	}
}



B has no pattern.
It occurs when a piece of memory is deallocated twice.  When a
smartptr decs.  But this is impossible.  Unless either a copy
constructor or a copy operator is not being called.  It could be a
copy construct of the object which contains the smart pointer or the
smart pointer itself.  Either seem very unlikely.  Unfortunately this
bug occurs so rarely it is hard to catch.



--

So at first my theory was, well, let's see what is happening.
But after stepping through over and over, I can't see anything wrong
with the object it is trying to cast.  Obviously there is.


So then I thought, well, perhaps this is just a messed up build.  So I
rebuilt everything.  This occurs sometimes on win32 with me if I link
to a class of which I've changed the virtual methods, but not
recompiled modules depending on it.


So then I thought..  Well given that the executables operate fine.
Maybe there is some sort of bug in static initializations.
But they *seem* to be occurring.  At least some of them are.


So then I thought, maybe there is some sort of discord between
object-c and c++, with memory management.  And I investigated that for
a while.  However that would not explain the fact it always crashes in
the same place.  If it crashes at all.  It seems to me, that enough
people are mixing objective-c and c++ so that this should not be a
problem.


So then I thought..  Ok, I think that that memory is being modified,
either by safari.  Or by my own threads (which function fine as an
executable).  And it is suspicious that this problem seems linked to
time.  So I wrote a memory watcher.  I overwrote new and delete, kept
a set of memory, and did continuous CRC's on that memory, looking for
when bits changed.  [which it turns out is pretty interesting to watch
anyway]


However this new/delete overriding changed the timing of the program.
And it stopped crashing.
I tried to move the area which is watched only to a specific section,
however it continues to not crash.
But when I turn that memory watching off, it crashes again.

Also, perhaps that memory watching causes more allocations, and
perhaps that changes the overall structure of the allocations.
Because a *single time*, this memory watcher/debugger crashed.  Saying
that it was watching NULL memory.  Which was impossible.
Cause basically I have this:

new:
lock memory-mutex
  make memory, make memory tag
  if either is NULL, return NULL
  else add it to the set of memory to watch.
unlock memory-mutex

delete:
lock memory-mutex
  if the memory is tagged
  remove it from set and delete it
  else just delete it
unlock memory-mutex

test:
lock memory-mutex
  evaluate crc's of memory compare with tags, has anything changed,
print out a message
unlock memory-mutex


This crash of the memory watcher really weirded me out, cause it it
nearly impossible, unless boost+pthreads has problems on osx, so it
seems to me that some external process zero'd a segment of my memory.

Which would explain why the crash of the smart ptr dec, and also the
dynamic_cast failure.


So my current working theory is:
1.  a pointer somewhere, is initialized incorrectly, but always the same way.
2.  writing to it is zeroing out my memory.
3.  this pointer may or may not be within my dylib/process space


So my question to you is:

What would your approach to solving this be?  Cause my usual isn't
working.  Any magic bullets?
I'm up to maybe 50 hours on this bug.


-tim



On 5/28/10, Tim Prepscius <timprepscius at gmail.com> wrote:
> Greetings again,
>
> So I've been able to (perhaps) solve my opengl issues, by switching
> cocoa basically.  I'm still using agl via the window ref of the cocoa
> window.  Seems to function, I wonder if it will fail with some update
> of safari.  On a side note, if anyone sees this post while
> investigating opengl problems, don't bother with xulrunner on mac!  It
> will just be a waste of time.   It took me a while to figure out that
> npapi was in webkit as well.
>
>
> But now I'm seeing some extreme strangeness in other areas.
>
>
> So I have a Client application.
> It is made up of about 20 libraries and a bit of connecting code.
>
> One version links as a windowed executable.
> One version links as a plugin.
> (depending on which bit of connecting code you use)
> However the rest of the code for the application in both cases is
> exactly they same.  99.999% of it.
>
>
> The strangeness I'm seeing is this:
> The application version functions without problem both debug and
> release.  (as it has done for quite a while).
> The plugin version crashes.  But only the optimized non debug build.
>
> And it crashes is weird ways that are reminiscent of out of sync
> linking problems.  For instance "dynamic_cast" is failing and causing
> a crash in an area nearly impossible.  And that area of code has
> existed without problem for 9 years.
>
> There seem to be initialization problems of variables.  Or perhaps a
> copy operator/constructor is not being called correctly.
>
>
>
> I've spent the last two days investigating what could be causing this.
>  It is a mystery, cause the normal application just hums along fine,
> while the plugin crashes, not immediately, however in the first 5
> seconds or so, as significant events occur.
>
> My leaning is to think there is a problem with gcc and optimized code
> in dylibs, perhaps their static initializations are not being
> completely performed?  But I must think that the chances of this are
> fairly small, as apple uses dylibs everywhere, so they would make sure
> that these function correctly.
>
>
> Has anyone else seen a situation where optimized code doesn't perform
> as a dylib, while as an executable it does?  What was the work around?
>
> Or, does anyone know of problems with mixing objective-c and c++ in a
> dylib?
>
>
>
> As of now, I'm trying to isolate the module which causes the problem
> in release build, and see if I can isolate the code segment, but it is
> slow going, and I'm not sure whether this error will manifest
> somewhere else.
>
> -tim
>


More information about the webkit-help mailing list