lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On 2018-05-27 21:22, Sven Olsen wrote:
Interesting. After mucking around in the VM for the purposes of applying your patch, I've started daydreaming about writing some instrumentation hooks of my own.

*snip*

Do you have any words of wisdom for someone just starting down this path? (It sounds like, maybe, implementing some sort of sampling-based instrumentation using OP_HALT-like hooks turns out to be faster than hacking a new switch into the core VM?

A while ago I hacked a stupid-simple profiler into Lua - just constantly
and unconditionally barfs tons of info into fd3 (or /dev/null, if you
don't assign that from the shell (3>somewhere))… It's counting all
calls, instructions (one counter per instruction), allocations
(increases counters for _all_ functions on the call stack), loads
(load,loadstring,require,… incl. dumping the loaded code), … and the
counters get dumped whenever the thing is GCed/freed.  All of that
causes it to run roughly 1.5-1.7x as long.  (Additionally dumping a full
stack trace every Nth call makes it… 1.7x (every 107th), 2.8x (every
11th), 13x (trace _ALL_ the calls!) slower in total for a very
call-heavy program (~70M calls, otherwise just summing values) with
mostly <=5 stack depth.)

So if you have some idea of what info you need, you can probably afford
to have that unconditionally enabled in a profiling build.

I don't have the time to clean up the changes & turn them into a patch,
but here's a bunch of notes that may be useful:

 *  lobject.h/ClosureHeader: nice place for counters (ncalls,nbytes,…)
    (initialize in lfunc.c/luaF_new[CL]closure)
 *  ldo.c/luaD_precall: all calls thru here -> ncalls++ / dump stack
 *  lapi.c/pushcclosure: Kill the `if (n == 0)` branch to disable light
    C functions / force C closures, so that you have the counter fields

 *  lobject.h/Proto: add instruction counters?
    (NULL-init in lfunc.c/luaFnewproto, alloc in lparser.c/close_func
    using luaM_newvector(L, fs->pc, size_t) and in lundump.c/LoadCode
    using luaM_newvector(S->L, n, size_t), then zero-init all counts)
    (change size_t to whatever counter size you're using)
 *  lvm.c/vmfetch, lvm.c/donextjump: increase instr.-counter:
    cl->p->prof_icounts[(ci->u.l.savedpc)-(cl->p->code)]++;
    (prof_icounts is whatever you're calling the instr. counter field)

 *  lstate.h/global_State: per Lua state, and
    lstate.h/lua_State: per thread within a Lua state
    (init in lstate.c: lua_newstate, preinit_thread (no allocs) or
    f_luaopen, lua_newthread (allocs ok))

 *  lmem.c/luaM_realloc_: all allocations go through here
 *  do whatever you do BEFORE the realloc call, as it might be moving
    the stuff that you wanted to touch
 *  if tracking allocations, blame (nsize-realosize) bytes if that's >0
 *  if block == NULL, osize may be != 0 but a type hint (LUA_TFOO),
    may want to count those to see who's slowing down the GC by creating
    lots of objects (tables, strings, …)
 *  may also want to walk a few stack levels & track indirect counts,
    just blaming your low-level constructors (Object.new, map, …)
    doesn't tell you what parts of the code are actually causing this

 *  when you want to touch the stack, guard:
    if (!G(L)->version || !L->ci)  return; /* still building state */
    CallInfo *ci = L->ci;
    if (ci->previous == ci->next)  return; /* setting up first func */
    (this *seems* to take care of every wonky stack state?)
 *  stack traversal: just walk the ci->previous chain until NULL

 *  dumping accumulated info from the lua*_free* functions works well
    if you properly close the state at the end (so no os.exit(foo), but
    os.exit(foo,true) is ok – or patch os.exit)

That's done against 5.3.4, but only used a couple of times so far, so
the above may be incomplete / missing critical things / contain bugs.

(For stack traces, you may want to make lgc.c/freeobj (cases LUA_TLCL,
LUA_TCCL) and lfunc.c/luaF_freeproto report the closure kind (C/Lua) /
closure->cfunc (gco2ccl(o)->f) / closure->proto (gco2lcl(o)->p) /
proto->source (f->source) mapping so your stack traces can simply be a
list of closure pointers, no need to constantly translate those when you
can do that later.  Then just keep a counter or timer in the state &
increment/check in ldo.c/luaD_precall whether you should dump a trace…
should be good enough, and fast.)

Have fun!
-- nobody