lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On 3 May 2012, at 16:00, Mike Pall wrote:

> Arran Cudbard-Bell wrote:
>> The string key and the return value are easy to represent in
>> cdata, but I can't figure out if it's possible to get or store
>> pointers to lua objects. GC isn't a problem because the tables
>> are referenced in other places.
> 
> Well, that's not gonna fly. Either go all the way and use cdata
> everywhere (except for rooting it somewhere in a table) or don't.
> You can't anchor, nor store, nor retrieve GC objects from cdata
> objects.
> 
> [One reason is that the compiler makes use of its knowledge about
> separate object domains for optimizations. E.g. storing something
> to a cdata object cannot possibly affect the contents of a Lua
> table. This in turn means a load from a hash table can be hoisted
> out of a loop across arbitrary cdata accesses and even across
> C calls performed via the FFI. Not so with the classic Lua/C API,
> where all guarantees are off.]

Understood. Thanks for the explanation.

> 
>> I've not tried embedding LUA in another application yet, so
>> maybe what i'm looking for is somewhere in the C API, if this is
>> the case then i'll gladly go RTFM, just want to know if it's
>> possible?
> 
> You'll certainly want to stay away from the classic Lua/C API as
> far as possible, if you want performance from LuaJIT. Try to use
> pure Lua code everywhere as much as possible. Call C functions
> only via the FFI, but never via the classic Lua/C API. Also, avoid
> callbacks from C at all cost. And only resort to C when it's an
> existing API, but don't try to prematurely convert something to
> C code, because you think it might be slow in Lua. And/or undo
> such conversions, in case you made them for the Lua interpreter.

Yes I made a point of reviewing the archived pearls of LuaJIT wisdom from the Lua list :)

As much as possible the code avoids nested and type instable loops, caching array/hash lookups, and making unnecessary calls to C functions.

Outside of system calls (which are in a different coroutine), and calls to libpcap via FFI, the only C functions we have are for string manipulation (which causes trace aborts in LuaJIT anyway), the main overhead from this function (we have some benchmarking stuff using nanosecond timers) is actually the call to ffi.string to convert the char array to a Lua string (something like 0.4 microseconds for the call and 0.6 microseconds for the conversion).

> 
>> * Slow is sort of relative here, the system can do about 130K
>> TPS traversing 5 nodes, and inserting 5 fields (yey LuaJIT), but
>> were looking for something in the range of 500K. Almost
>> certainly going to have to move to using CDATA instead of lua
>> tables at some point, but thats more complex, and this is more
>> of a POC currently.
> 
> Judging from past experience with these kind of integration
> projects, I guess your performance is most likely dominated by API
> friction (callbacks from C, marshalling to C calls) and not by the
> data structures themselves. Check the compiler output with -jv or
> -jdump or the equivalent Lua code: require("jit.dump").start()

The base rate of libpcap when called from Lua via FFI on my box is about 1.2M PPS (from capture file).

After performing packet dissection and converting some field values to Lua strings performance drops to around 700K PPS , once we introduce the datastore performance is between 80K and 220K, depending on whether new nodes are created.

The compiler trace for the artificial capture that gives 80K learns per second looks pretty clean:

[TRACE   1 rti.lua:1501 loop]
[TRACE   2 rti.lua:1006 loop]
[TRACE   3 (1/0) rti.lua:1502 -> 1]
[TRACE   4 (1/5) rti.lua:1502 -> 1]
[TRACE   5 (2/2) rti.lua:1007 -> 2]
[TRACE   6 (3/0) rti.lua:1502 loop]
[TRACE   7 rti.lua:1242 return]
[TRACE   8 (5/1) rti.lua:1006 -> 1]
[TRACE   9 (4/0) rti.lua:1502 -> 1]
[TRACE  10 (6/5) rti.lua:1358 -> 2]
[TRACE  11 (6/8) rti.lua:1502 -> 1]
[TRACE --- rti.lua:1276 -- leaving loop in root trace at rti.lua:1280]
[TRACE  12 (6/0) rti.lua:1502 loop]
[TRACE  13 (9/3) rti.lua:1312 -> 1]
[TRACE --- rti.lua:1608 -- inner loop in root trace at rti.lua:1502]
[TRACE  14 (7/0) rti.lua:1244 -> 1]
[TRACE  15 (12/20) rti.lua:1502 -> 1]
[TRACE  16 (12/0) rti.lua:1502 -> 1]

The abort at 1276  is a genuine nested loop that's only sometimes executed, and the other abort, is entering the insertion function itself.

I'm really at a loss as to how to improve performance, or even get stable performance between code modifications.

It seems the areas of the code being traced change wildly depending on whether functions are locally scope, and other seemingly random things like re-using variable names even if the variables are locally scoped.

It could be other factors influencing the performance, but taking the average of multiple benchmarks made before and after the changes, shows they do have a real and consistent effect.

Does tighter variable and function scope help the JIT optimize that much? Should I be looking at reducing the number of 'public'  (non-local) functions and variables in modules? 

-Arran