[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Tuning for large number of Lua threads on FreeBSD
- From: Rici Lake <lua@...>
- Date: Fri, 5 Aug 2005 12:53:12 -0500
I've been playing around a bit with benchmarking Lua coroutines.
Creating a Lua thread allocates three major datastructures: a stack, a
callinfo stack, and a lua_State. The memory useds for these, on
x86/double, are as follows (in work6):
stack: 45 slots @ 12 bytes/slot 540
cistack: 8 slots @ 24 bytes/slot 192
for a total used of 924 bytes.
However, the FreeBSD system malloc() always allocates blocks whose size
is a power of two; consequently, these three mallocs actually consume
In particular, the stack allocation is almost pessimal.
A slight adjustment in lua.h makes a notable difference; changing
LUA_MINSTACK from 20 to 18, which has virtually no performance impact
as far as I can see, reduces the initial stack from 45 slots (540
bytes) to 41 slots (492 bytes), halving the actual memory used.
Furthermore, this continues to be beneficial as the stack grows, since
it typically doubles in size on every reallocation:
default (MINSTACK = 20):
initial stack 45 slots alloc: 540 bytes used: 1k
first increment 90 slots alloc: 1080 bytes used: 2k
second increment 180 slots alloc: 2160 bytes used: 4k
third increment 360 slots alloc: 4320 bytes used: 8k
fourth increment 720 slots alloc. 8640 bytes used: 12k*
* useds of more than one page -- 4k on x86 -- are rounded
to an integer number of pages
adjusted (MINSTACK = 18):
initial stack 41 slots alloc: 492 bytes used: 512 bytes
first increment 82 slots alloc: 984 bytes used: 1k
second increment 164 slots alloc: 1968 bytes used: 2k
third increment 328 slots alloc: 3936 bytes used: 4k
fourth increment 656 slots alloc: 7872 bytes used: 8k
An alternative is to leave LUA_MINSTACK as 20, but change
BASIC_STACK_SIZE in src/lstate.h, which is (LUA_MINSTACK*2). This could
be changed to 36 (or even 37); however, that would be a somewhat more
Also in src/lstate.h, BASIC_CI_SIZE is defined as 8. On FreeBSD,
changing this to 5 halves the initial allocation for the cistack, but
for reasons which are not clear to me some of my benchmarks slow down
by up to 7% with this change. (Although this is partially compensated
for by a 20% improvement in the time to create the threads.) My first
guess was that the change was leading to repeated reallocations of the
ci-stack in the thread scheduler, but it turned out that the code in
lgc.c which might shrink the ci-stack was never being called.
In any event, this minor exercise in tuning reduced the RSS for 100,000
threads from 150MB to 100MB (with only the change to MINSTACK) or 88 MB
(with both changes). (This also includes an allocation for a table
containing all the threads, and for a closure for each thread created
by coroutine.wrap). The 50% saving resulting from changing MINSTACK
from 20 to 18 strikes me as worthwhile (particularly as it also ran
slightly faster on all benchmarks.)
The Linux malloc() is quite different from the FreeBSD malloc().
Windows and Mac OS X will also have different tuning optimizations. I
haven't yet had a chance to play with OS's other than FreeBSD, but I
suspect that the difference will be less marked.