lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Hi!

On Jan 29, 2008, at 10:11 PM, Mike Pall wrote:

Kay Roepke wrote:
After several lengthy gdb sessions and barking up entirely wrong trees, it
became apparent that 'gkey(mp)' in

  gkey(mp)->value = key->value; gkey(mp)->tt = key->tt;

was actually referring to dummynode (static const Node dummynode_ in
ltable.c:75) which was improperly aligned.

I think your analysis is incomplete. Please check the control
flow in newkey(). You should never get to the assignment you
quoted, if mp is equal to dummynode!

I did check the control flow, and that was exactly why I was wondering what happened.


Which in turn means the equality test for dummynode is failing.
This is the usual symptom, if you've linked two copies of the Lua
core into your application (causing two instances of dummynode to
appear).

And that's exactly what happened, yes.


A common error is to link C extension modules (shared libraries)
with the static library. The linker command line for extension
modules must not ever contain -llua or anything similar!

The code in question was recently refactored to support plugins, and the
-llua was carried into the plugin makefiles. Thus we ended up with several
dummynode objects, which obfuscated everything.

For instance, I used -DLUA_DEBUG to get luaH_isdummy in gdb, and that was actually telling me that mp (which was equal to the dummynode I was seeing in gdb) was not a dummy, thereby proving your point from above: There is at least one other dummynode
floating around.

I.e. check your build process. If you are unsure where the two
copies of the Lua core come from, grep the binaries for some
characteristic error message, like "table index is nil".

As I said, it was our own fault.

(Aside: Can someone comment on the issue that mp actually is dummynode at this point? Would that be correct at all? Just so I know next time when I'm
in there...)

It's a minimal (and read-only!) node structure which is used in
case a table does not contain any entries in the hash part (yet).
It only contains a single nil key/value pair and no free slots.
So there's always a hash part for every table. This saves some
conditional jumps in several important code paths.

So I have thought, thanks for the clarification!


The "fix" we did was to change the definition of dummynode_ to

  static volatile dummynode_ = {

which made the SIGBUS disappear.

This is not the right approach. You are only masking the
symptoms. The static dummynode should never, ever be written to
by the code quoted above!

"fix" ;) I was fully aware that this wasn't the correct approach, but was
curious why it actually worked.


Discussing this on #lua, we agreed that this smells like a compiler bug in
Apple's gcc and I will report it as a bug with them.

This is a different issue and you may have a point. I'm assuming
the two assignments are translated to an SSE2 instruction which
causes the alignment error (check the generated assembler code of
newkey for a movq or something similar).

This would be incorrect since the maximum alignment of the Node
struct is that of a double. Which is 4 (and not 8!) in both the
SYSV x86 ABI (used by Linux, BSD* and so on) and the (slightly
different) Mac OS X x86 ABI.

This means it's perfectly ok for dummynode to be only 4-byte
aligned. Check the assembler output for the section it's emitted
to and its alignment.

I haven't check the assembly code yet, but will do so just to be sure.


[That is unless you're using -malign-double inconsistently, which
is a really bad idea.]

:)


So the compiler is wrong to assume that Node is always 8-byte
aligned. Maybe there's some overly optimistic optimization which
takes a guess that Node * always points to heap-allocated objects
(which are always 16-byte aligned on OSX).

[Another theory to check: maybe the compiler does assume 8-byte
alignment, but either GCC or the MACH-O linker is moving
dummynode to the BSS (because it's all zero) and erroneously
drops the alignment.]

Good point, I'll include that when looking at it more closely.

Thanks for your excellent help folks! Removing the extraneous -llua in
the plugins solves the problem.

Leaves me wondering if there's anything Lua could check for this case and report an error message. This issue seems common enough to warrant a solution.

cheers,
-k

--
Kay Roepke, Software Engineer
MySQL GmbH, www.mysql.com
Office: +49 40 7889 16 51
Mobile: +49 171 75 72 503 (preferred)

Are you MySQL certified?  www.mysql.com/certification