[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: Do LuaJit 2 support more than 2GB memory allocation on FreeBSD/x64 now?
- From: Dimiter 'malkia' Stanev <malkia@...>
- Date: Wed, 18 Apr 2012 11:26:25 -0700
This is great information, and I've learned a bit more about GC.
Mike, would it be possible to include some kind of stats (or would that
be wasteful in CPU resources) - for example count table - how many times
of each type, or each group have been found to be still alive.
print( gcstat["table"] ) -> 5000 tables
print( gcstat["thread"] ) -> 100 threads
or b type
print( gcstat ) -> 5100 (threads + tables).
> 0: nil, boolean, number, lightuserdata
> 1: string, cdata (LuaJIT only)
> 2: upvalue, userdata, closure, prototype (medium fan-out)
> 3: table (hash, array), thread (stack)
Even some heuristic would be good enough start?
This would definitely change my way of doing thing (I'm dealing with
lots of geometry - meshes, vertices, and I think I'll stick my guns to
use "C" structures and arrays for data that is cooked, and lua
structures for live/editing of it - and functions to transfer from one
On 4/18/2012 10:12 AM, Mike Pall wrote:
Miles Bader wrote:
Mike Pall<firstname.lastname@example.org> writes:
I just want to make sure that now LuaJit 2 whether surport more
than 2GB memory allocation on FreeBSD/x64 or not.
Rest assured, you do NOT want to allocate more than that amount of
Lua objects. The garbage collector in LuaJIT 2.0 (which is the
same as Lua 5.1) is simply not up to the task. Your program would
most certainly slow to a crawl with that many live objects.
Hmm, I've exceeded LuaJIT's memory limits, and haven't noticed any
_obvious_ slowness as I approached the limit...
The main challenge for GC on modern CPUs is to avoid cache
thrashing for large datasets. There are several things to
* All objects in a garbage collected VM form a connected graph, so
it makes sense to differentiate them by their graph theoretic
aspects, which has a direct impact on their cache effect:
0. value: no extra node
1. box: zero fan-out
2. link: low fan-out
3. list: high fan-out
All current versions of Lua and LuaJIT use variants of a
tagged-value representation, which means class 0 is non-empty.
Others, e.g. CPython don't use class 0 (every object reference
is a pointer) or use it to a lesser extent. E.g. most VMs have
to box FP numbers, whereas Lua has them 'inline' as a value.
Lua/LuaJIT object types fall into the following categories:
0: nil, boolean, number, lightuserdata
1: string, cdata (LuaJIT only)
2: upvalue, userdata, closure, prototype (medium fan-out)
3: table (hash, array), thread (stack)
* The mark phase of a GC has to follow the graph from all GC roots,
traverse all live objects and mark them. With that in mind, you
can estimate the amount of work that needs to be done for each
category. The following discussion assumes you know the category
of each reference before following any pointer, which is true
for a tagged-value representation.
For category 0, you don't have to follow a pointer and there's
nothing to mark. For category 1, you have to mark the object.
You only have to follow the pointer if the mark is part of the
object itself (true for Lua GC). There are other options for
marking an object (separate GC metadata or card marking) which
avoid the overhead of pointer following for category 1 objects.
For category 2 and 3 you always have to follow the pointer and
queue the outgoing references for later marking (shortcut
possible for fan-out 1, e.g. upvalues). Queuing objects with a
linked-list GC (Lua GC) is simple, but may be expensive for
large category 3 objects. This may cause evictions and then all
objects need to be loaded into the cache twice. Adding only the
pointers themselves to a growable list is more cache-conscious,
but probably only makes sense for an area-based GC (otherwise
you're wasting the memory for the links).
Summary: the overhead for a GC is highly dependent on the
structure of the object graph, which can be estimated by the
categories of its live objects.
* The overhead for all variants of the classic mark-sweep GC
algorithm is dominated by the size of the live set of GC
objects. Either by their number or by their size, depending on
the GC design and the object category.
The overhead of generational GCs is dominated by the size of the
recently modified part of the live set. This only helps if that
is substantially lower than the total size of the live set.
Otherwise the extra overhead for keeping the generations may
not pay off.
The overhead of moving GCs is mostly dominated by the size of
objects, independent of their category. OTOH non-moving GCs can
mostly ignore category 1 objects.
Linked-list GCs (such as Lua and LuaJIT up to 2.0) suffer more
from a large number of small objects. The traversed links are
often quite random wrt. memory addresses, which renders hardware
cache prefetching ineffective. A segregated, area-based GC with
separate GC metadata (plan for LuaJIT 2.1) can mitigate this
effect without penalizing category 1 objects.
Much of the memory use is in the form of humongous Lua strings (which
then I pass as input to LPEG), but there's plenty of other stuff
as... I don't think the strings count for more than about 1GB or so.
[I'm guessing strings don't count so much because they're single large
Strings are category 1 objects, which means they only need to be
marked, but not traversed. The current Lua and LuaJIT VMs use a
non-moving, linked-list GC, so the size of strings is mostly
irrelevant as far as the GC is concerned.
Thus if your memory usage is dominated by huge strings, you're
unlikely to see the GC slowdown I described. OTOH if you had tons
of tables, taking up gigabytes of memory, you'd certainly notice.