lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Hi list,

I tried upgrading a 5.3 based project to 5.4.0 (alpha) today, and hit some issues. After some minimal adaptations to the native code (this is a project that embeds Lua in a Windows exe), everything looked fine but some of the project's unit tests started failing in odd and non-deterministic ways, whereas on 5.3.5 they ran fine. Tests that ran fine in isolation would fail in weird ways when run one after another. This is code that gets hammered fairly heavily (dozens of automation runs of the 5.3-based code per day), so I'm reasonably sure it's not doing anything too egregiously stupid (although it's always possible...).

I narrowed it down to something going wrong with the value of nCcalls in a coroutine's lua_State. I managed to get it to underflow and wrap around, and once it has a value like 0xFFFFFFFF all sorts of interesting things go wrong (spurious warnings about yielding across a C boundary, out of stack errors, etc).

When the issue occurs, the main lua_State is calling into a previously-yielded coroutine via a call to lua_resume(). The coroutine was previously yielded from a C function with a call to lua_yieldk() with a continuation function specified. The C callstack looks like this:

my.exe!luaE_shrinkCI(lua_State * L)
my.exe!luaD_shrinkstack(lua_State * L)
my.exe!recover(lua_State * L, int status)
my.exe!lua_resume(lua_State * L, lua_State * from, int nargs, int * nresults)
my.exe!my_call_into_coroutine(lua_State * L, int nargs, int nret)
my.exe!main(int argc, char * * argv)

An error() has just happened (inside a pcall, in the coroutine) and is being handled, by the looks of it. On entry to luaE_shrinkCI, L->nCcalls is 1 and L->nci is 14. At the end of the fn, nci is 12 and nCcalls is 0xffffffff. After that point, it appears that the damage is done and exactly what goes wrong varies. I haven't worked through just what is happening in luaE_shrinkCI() to achieve this, but it seems likely to be related to how I'm using the coroutine APIs in particular yielding and resuming into cfunctions. The Lua stack (both the current frame, and the CallInfo::previous links in L->ci) look sane, and the same code has been running fine with 5.3 for months.

I haven't yet worked out a minimal reproduction of this problem (although it reproduces reliably for me locally, I just can't share that code plus it's enormous), and I definitely haven't ruled out the possibility that it's my native code that's the problem, but I wanted to let people know now in case anyone else is seeing something similar. I'll report back if I get a minimal repro working, or indeed if I find it's a bug in my native code :-)

Cheers,

Tom