high precise profile solution to lua

I have read some documents about how to profile a lua program's performance, to find out the bottleneck.

The mostly mentioned method is 'lua_sethook' with `LUA_MASKCALL | LUA_MASKRET`, and `lua_getinfo`.

we maintain a call-tree according to `lua_getinfo`, the tree node looks like the follow pesudo code.

call_record_t = {

call_record_t *parent;

subs = {

[KEY] = call_record_sub,

[chunk1:line1] = call_record_sub1,

[chunk2:line2] = call_record_sub2,

inclusive = 100 ms,

exclusive = 30 ms,

hitcount = 12 times,

error_occur = 0 times,

}

when LUA_HOOKCALL, we enter a sub call_record according to 'KEY'(if not exist then new it), and when LUA_HOOKRET, we do exit.

the 'KEY' above is a signature of a function to identify the unique funciton, Chunkname+linedefined is adequate to act as the KEY

As is a common method, I won't repeat it more. My problems are follows:

1. In our project,(a 3D game application), when I turn on the profiler, the fps drops from 60 to 10~15, namely, it seriously slowdown the performance. Profiler isn't precise enough; the profile-hook itself affects the host program so mush.

Reasons may be that the hook runs frequently and slowly due to time-consuming `lua_getinfo` and the analyser routine calculate hash value of KEY to enter a sub call_record and so on.

2. `call` and `return` are not always in pair because of `error` and `resume/yield`. I can't track the call/return flow!

So, to the first issue, I use multithread to solve it. The host-program's profile-hook is responsible for sampling and pushes it to a Queue, and it's in main-thread. The analyser-thread pops sampling-data from the Queue and execute computing. Here is an optimization,

When lua compiled a function to a `Proto *`, I allocate an ID for the prototype. Just

static void body (LexState *ls, expdesc *e, int ismethod, int line) {

/* body -> '(' parlist ')' block END */

FuncState new_fs;

BlockCnt bl;

new_fs.f = addprototype(ls);

new_fs.f->linedefined = line;

open_func(ls, &new_fs, &bl);

new_fs.f->id = luaQ_addhistoryproto(G(ls->L), getstr(new_fs.f->source), line);

checknext(ls, '(');

`; and `luaQ_addhistoryproto` keeps the mapping from ID to {chunkname, linedefined}, (Proto would be destroyed by gc)

I just push the ID to the queue, which is at **low cost** versus pushing chunkname and linedef, and it's **cheaper** to find out which sub_call_record to enter comparing hashing chunkname and linedef. For asynchronous analysing, we have to push proto'ID instead of directly pushing address of 'Proto *p' to the Queue. We keep the mapping from ID to 'chunkname' and 'linedefined', the analyser retrieves those information by `arr[call_record->ID]`. Before lua calling the hook, we execute `ar.proto_id = p->id`.

lua_Debug {

...

int ci_depth;

...

union{

int proto_id;

lua_CFunction *f;

}

int functype;

}

if a C function is called, we also do `ar.f = CLcvalue(ci->func);`. functype+union identify the funciton called. ar.f is useful mapping address to C function name by debugging-info in final profile report.

The final purpose is to reduce side-effect of analyser routinue. I am not willing to call `lua_getinfo`, but without calling `lua_getinfo`, for lua functions, caller name(ar->what, ar->name) will be absent in final report! A pity, isn't it?

To the second issue, I add a new field `cur_depth` to track it. I define a new hook LUA_HOOKRESETCI

`CallInfo {

...

int ci_depth;

`CallInfo *luaE_extendCI (lua_State *L) {

...

L->nci++;

ci->ci_depth = ci->previous->ci_depth + 1; /*here to record depth*/

return ci;

int luaD_pcall (lua_State *L, Pfunc func, void *u,

ptrdiff_t old_top, ptrdiff_t ef) {

...

status = luaD_rawrunprotected(L, func, u);

if (status != LUA_OK) { /* an error occurred? */

...

ar.event = LUA_HOOKRESETCI; /*new hook here*/

ar.i_ci = old_ci;

ar.ci_depth = oldci->ci_depth;

...

L->hook(L, &ar);

...

}

and the hook will inform the analyser to relocate call_record hierarchy by moving along `call_record_t *parent`,

void analyser(int event, int ci_depth, int functype, union {int, lua_CFunction *}u, ...)

case LUA_HOOKCALL:

cur_depth++;

assert(cur_depth == ci_depth);

if (functype is lua function)

cur_record = cur_record[u->proto_id];

if (functype is c function)

cur_record = ...

break;

case LUA_HOOKRET:

cur_depth--;

assert(cur_depth == ci_depth);

cur_record = cur_record->parent;

break;

case LUA_HOOKRESETCI:

while (cur_depth-- > ci_depth)

{

cur_record->error_occur++;

cur_record = cur_record->parent;

}

break;

So totally, there are two extra values passed to the sampling-hook, 'current callinfo layer' and 'prototype-id/C function address'.

Forgive my poor English.

Any suggestions?

Best regards!