lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Hi, I found some interesting stack overflow crash from my project.

  Before we deep dive into the root cause of the crash, let’s try some interesting examples. I think test environment is not matter, but if you cannot reproduce results of below examples, try at
- OS: Ubuntu 20.04 LTS
- glibc: UBUNTU GLIBC 2.3.1
- Lua: Lua 5.4.4 (commit hash 0e5071b5fbcc244d9f8c4bae82e327ad59bccc3f)
which is the same as mine.

---------------------------------------------------------------------------------------------------
[example1.lua] -- case normal

local function func()
  coroutine.wrap(func)()
end
func()

[result of example1.lua]
(Some repeated lines are skipped.)
.\example1.lua:2: .\example1.lua:2: .\example1.lua:2: .\example1.lua:2: C stack overflow
stack traceback:
        [C]: in ?
        .\example1.lua:2: in local 'func'
        .\example1.lua:4: in main chunk
        [C]: in ?

---------------------------------------------------------------------------------------------------
[example2.lua] -- case normal

local function func()
  print(“Hello, lua!”)
  coroutine.wrap(func)()
end
func()

[result of example2.lua]
(Some repeated lines are skipped.)
Hello, Lua!
Hello, Lua!
Hello, Lua!
.\example1.lua:3: .\example1.lua:3: .\example1.lua:3: .\example1.lua:3: C stack overflow
stack traceback:
        [C]: in ?
        .\example1.lua:2: in local 'func'
        .\example1.lua:4: in main chunk
        [C]: in ?

---------------------------------------------------------------------------------------------------
  As you can find from the examples, lua interpreter is implemented well to deal with stack burst from recursive coroutine. You can find detail logics in ldo.c file, mainly in lua_resume, resume and luaD_rawrunprotected. Let me explain the logic briefly. Resuming coroutine, the value of nCcalls in the caller’s state is saved by LUAI_TRY macro to deal with error inside it. Also, when we resume a coroutine, the caller’s state is copied into the newly created state. Note that resume function itself increases the value of nCcalls. As a result, if we recursively resume coroutine, the value of nCcalls on state will be higher and higher, triggering an error, handled by LUAI_THROW and LUAI_TRY recursively. (LuaE_checkcstack function in lstate.c handles this error.)

  Okay, that’s how the recursive coroutine is handled. The errors above are no surprise. But How about the next example, crash.lua?

---------------------------------------------------------------------------------------------------
[crash.lua] -- case crash
local function func()
  pcall(1)
  coroutine.wrap(func)()
end
func()

[result of crash.lua]
Segmentation fault(core dumped)

---------------------------------------------------------------------------------------------------
  This is the interesing crash I found. The recursive coroutine cannot be handled in inside logic, propagating crash into whole program(in this case, lua interpreter). For more information, I tried ASAN option.

---------------------------------------------------------------------------------------------------
[Address Sanitizer report of crash.lua]
==31777==ERROR: AddressSanitizer: stack-overflow on address 0xff7fdde4
   (pc 0xf7a18344 bp 0xff7fe248 sp 0xff7fdde8 T0)
#0 0xf7a18343 (/lib32/libasan.so.5+0x74343)
#1 0x56589510 in luaO_pushvfstring
#2 0x56572179 in luaG_runerror
#3 0x5657232b in luaG_typeerror
#4 0x56572588 in luaG_callerror
#5 0x56575835 in luaD_tryfuncTM
#6 0x56576d46 in luaD_precall
#7 0x565777d3 in ccall
#8 0x5656b1a2 in lua_pcallk
#9 0x565d8637 in luaB_pcall
#10 0x56577219 in luaD_precall
#11 0x565aecda in luaV_execute
#12 0x56577814 in ccall
#13 0x56573c83 in luaD_rawrunprotected
#14 0x56577d08 in lua_resume
#15 0x565d87fc in auxresume
#16 0x565d89e9 in luaB_auxwrap
#17 0x56577219 in luaD_precall
#18 0x565aecda in luaV_execute
#19 0x565762d4 in unroll
#20 0x56573c83 in luaD_rawrunprotected
(... #13~#19 is repeated. Skipped it for convenience.)

---------------------------------------------------------------------------------------------------
  Hmm, That’s weired. Before the stack burst, value of nCcalls should be high enough to raise error. HOWEVER, using gdb, I can find that the value of nCcalls is not changed for each call of resume function (not changed from 2). I think there should be a problem in dealing with pcall, so analyze it further.

[Root cause]
  The reason for the crash is quite simple.

 When we resume coroutine, the state of the caller is saved, including the value of nCcalls. You can find this logic from luaD_rawrunprotected and lua_resume in ldo.c. As resume function itself also increases the value of nCcalls, it seems no problem. However, a problem occurs when the first statement of coroutine “pcall(1)” raises an error.

  The pcall function is handled with lua_pcallk in luaB_pcall. Inside of lua_pcallk, you can find luaD_call is called, not luaD_pcall which is the conventional way (as call is already protected by resume). After the logic, luaD_call function directly leads to luaD_precall. At this point, Interpreter finds that the statement is wrong - try to call a non-function object ‘1’. You can find luaD_tryfuncTM function try to search meta-method of the constant (of course not exist), raising error in flow of luaG_callerror > typeerror > luaG_runerror > luaG_errormsg > luaD_throw > LUAI_THROW. Maybe you remember that we saved state using LUAI_TRY, which means that LUAI_THROW function finally reset the value of the nCcalls into saved state (in my case, 2).

  Resetting the value of nCcalls into 2, finishpcallk function return error status. And the error status is handled in lua_resume function with precover function. In precover function, as this action happens inside of pcall, it calls the next recursive coroutine without halting whole interpreter. This step allows the value of nCcalls not to be increased during recursive calls.

  In short, the root cause of the problem is

1. Before resuming coroutine, Interpreter saves the value of nCcalls using LUAI_TRY.
2. In the first statement of coroutine, pcall triggers LUAI_THROW.
3. During step 2, the value of nCcalls is reset.
4. As the first statement runs inside of the pcall function, coroutine continues.
5. Calling next recursive coroutine.

  For details, look at the implementation of lua_resume, luaD_rawrunpotected, and precover in ldo.c.

---------------------------------------------------------------------------------------------------
[How to Patch (Suggestion)]
  I think the problem can be solved by adding one line in lua_resume function. If luaD_rawrunprotected returns error status, let’s just increase the value of nCcalls. I tested this patch using crash.lua and it works well (handle stack burst). However, as I’m not an expert about the implementation of lua, this patch can have side-effects that I can’t expect. I hope someone will solve the problem in the proper way.

---------------------------------------------------------------------------------------------------
[Before]
ldo.c: line 793, (lua_resume)

status = luaD_rawrunprotected(L, resume, &nargs);
status = precover(L, status);

[After]
ldo.c: line 793, (lua_resume)

status = luaD_rawrunprotected(L, resume, &nargs);
if( status != LUA_OK ) L->nCcalls++;   // to handle nCcall reset problem triggered by pcall.
status = precover(L, status);

---------------------------------------------------------------------------------------------------
  Thanks for reading. Any comments are welcomed. If you have a problem with reproducing the error or following analysis, feel free to make a comment. (You can simply reproduce it by “lua crash.lua” on bash shell.)

Found by: JIHOI KIM (team Nil Armstrong)