lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


I've been playing around a bit with benchmarking Lua coroutines.

Creating a Lua thread allocates three major datastructures: a stack, a callinfo stack, and a lua_State. The memory useds for these, on x86/double, are as follows (in work6):

stack:   45 slots @ 12 bytes/slot  540
cistack:  8 slots @ 24 bytes/slot  192
state:                             192

for a total used of 924 bytes.

However, the FreeBSD system malloc() always allocates blocks whose size is a power of two; consequently, these three mallocs actually consume 1.5K.

In particular, the stack allocation is almost pessimal.

A slight adjustment in lua.h makes a notable difference; changing LUA_MINSTACK from 20 to 18, which has virtually no performance impact as far as I can see, reduces the initial stack from 45 slots (540 bytes) to 41 slots (492 bytes), halving the actual memory used. Furthermore, this continues to be beneficial as the stack grows, since it typically doubles in size on every reallocation:

default (MINSTACK = 20):
  initial stack     45 slots   alloc:  540 bytes   used:  1k
  first increment   90 slots   alloc: 1080 bytes   used:  2k
  second increment 180 slots   alloc: 2160 bytes   used:  4k
  third increment  360 slots   alloc: 4320 bytes   used:  8k
  fourth increment 720 slots   alloc. 8640 bytes   used: 12k*

* useds of more than one page -- 4k on x86 -- are rounded
  to an integer number of pages

adjusted (MINSTACK = 18):
  initial stack     41 slots   alloc:  492 bytes   used:  512 bytes
  first increment   82 slots   alloc:  984 bytes   used:  1k
  second increment 164 slots   alloc: 1968 bytes   used:  2k
  third increment  328 slots   alloc: 3936 bytes   used:  4k
  fourth increment 656 slots   alloc: 7872 bytes   used:  8k

An alternative is to leave LUA_MINSTACK as 20, but change BASIC_STACK_SIZE in src/lstate.h, which is (LUA_MINSTACK*2). This could be changed to 36 (or even 37); however, that would be a somewhat more fragile change.

Also in src/lstate.h, BASIC_CI_SIZE is defined as 8. On FreeBSD, changing this to 5 halves the initial allocation for the cistack, but for reasons which are not clear to me some of my benchmarks slow down by up to 7% with this change. (Although this is partially compensated for by a 20% improvement in the time to create the threads.) My first guess was that the change was leading to repeated reallocations of the ci-stack in the thread scheduler, but it turned out that the code in lgc.c which might shrink the ci-stack was never being called.

In any event, this minor exercise in tuning reduced the RSS for 100,000 threads from 150MB to 100MB (with only the change to MINSTACK) or 88 MB (with both changes). (This also includes an allocation for a table containing all the threads, and for a closure for each thread created by coroutine.wrap). The 50% saving resulting from changing MINSTACK from 20 to 18 strikes me as worthwhile (particularly as it also ran slightly faster on all benchmarks.)

The Linux malloc() is quite different from the FreeBSD malloc(). Windows and Mac OS X will also have different tuning optimizations. I haven't yet had a chance to play with OS's other than FreeBSD, but I suspect that the difference will be less marked.