Hi *!

Over the last couple of days we have had problems with a SIGBUS in newkey (ltable.c) on Mac OS X (Leopard).

Please bear with me as I explain what happened:

I've been using Apple's GCC (i686-apple-darwin9-gcc-4.0.1 (GCC) 4.0.1 (Apple Inc. build 5465)) on a Core 2 Duo, generating x86 32bit code. Our C code set up a bunch of tables/metatables in preparation for a setfenv call, but nothing really fancy. All attempts to reduce it to a manageable test case failed, though.

The relevant part of the stacktrace in question was:
#0 0x00053964 in newkey (L=0x50e030, t=0x50f3d0, key=0x510040) at ltable.c:425 #1 0x00053d05 in luaH_set (L=0x50e030, t=0x50f3d0, key=0x510040) at ltable.c:503 #2 0x0005520b in luaV_settable (L=0x50e030, t=0x813eb4, key=0x510040, val=0x813ec0) at lvm.c:142
#3  0x0005672f in luaV_execute (L=0x50e030, nexeccalls=1) at lvm.c:456
#4 0x00049131 in luaD_call (L=0x50e030, func=0x80d26c, nResults=1) at ldo.c:377 #5 0x000438b5 in lua_call (L=0x50e030, nargs=1, nresults=1) at lapi.c: 778
#6  0x00064626 in ll_require (L=0x50e030) at loadlib.c:484
#7 0x004914a1 in luaD_precall (L=0x50e030, func=0x80d224, nresults=0) at ldo.c:319
#8  0x004a02d7 in luaV_execute (L=0x50e030, nexeccalls=1) at lvm.c:589
#9 0x00491701 in luaD_call (L=0x50e030, func=0x80d218, nResults=0) at ldo.c:377
#10 0x0048bed9 in f_call (L=0x50e030, ud=0xbfffe92c) at lapi.c:796
#11 0x0049096e in luaD_rawrunprotected (L=0x50e030, f=0x48beaf <f_call>, ud=0xbfffe92c) at ldo.c:116 #12 0x00491a50 in luaD_pcall (L=0x50e030, func=0x48beaf <f_call>, u=0xbfffe92c, old_top=24, ef=0) at ldo.c:461 #13 0x0048bf76 in lua_pcall (L=0x50e030, nargs=0, nresults=0, errfunc=0) at lapi.c:817
#14 0x004838e4 in lua_register_callback (con=0x50cb20) at plugin.c:1387

The Lua code that was loaded just set an entry in a table we prepared in the C code in lua_register_callback (our own code) prior to the lua_pcall.

After several lengthy gdb sessions and barking up entirely wrong trees, it became apparent that 'gkey(mp)' in

   gkey(mp)->value = key->value; gkey(mp)->tt = key->tt;

was actually referring to dummynode (static const Node dummynode_ in ltable.c:75) which was improperly aligned. (Aside: Can someone comment on the issue that mp actually is dummynode at this point? Would that be correct at all? Just so I know next time when I'm in there...)

The "fix" we did was to change the definition of dummynode_ to

   static volatile dummynode_ = {

which made the SIGBUS disappear.

Discussing this on #lua, we agreed that this smells like a compiler bug in Apple's gcc and I will report it as a bug with them. The volatile-"fix" surely isn't the correct approach, but at least it allows further development with Lua on OS X x86, so it might be helpful for some people out there.

My next steps will be to try to compile the whole thing with LLVM [1] to see whether or not it gets it right. If this problem is present on pre-Leopard versions of OS X, it might be worthwile to try gcc-3.3 there (I can't do this on Leopard, though).

Sadly, I'm not familiar enough with the Lua internals to say where the misalignment of &dummynode_ makes things go wrong, maybe someone else can comment on it.

Just for people who find this in the archives, this probably is related:


Kay Roepke, Software Engineer

