lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Sorry. There is a typo in the last sentence below. It should read:

Next step for me: do similar measurements on ARM-based devices, and WITHOUT debug hook, to check if i can get the same level of improvements on real-wold embedded Lua apps.

Jean-Luc

> Le 5 févr. 2016 à 15:29, Jean-Luc Jumpertz <jean-luc@celedev.eu> a écrit :
> 
> 
>> Le 4 févr. 2016 à 19:24, Roberto Ierusalimschy <roberto@inf.puc-rio.br> a écrit :
>> 
>> Well, I still hope to get some feedback.
> 
> Well, I compared the "computed goto » patch  with the vanilla Lua version this morning and got some interesting results.
> 
> The context: 
> - Mac OS X, 2012 Intel core i7 processor, Xcode 7.2 / clang (corresponding to clang version 3.7, I guess), -O2 -Os (standard compile options used for ‘release’ build)
> - Lua 5.2.4
> - Benchmark lua-image-ramp-bench.lua (see https://gist.github.com/jlj/de7d8be6f1160ea2963c), that makes use of various Lua opcodes, including lots of tables creation, table get, table set, and GC.
> - Low-overhead profiling using Xcode Instruments
> 
> Test have been run multiple times inside my CodeFlow IDE, with CALL and RETURN hooks active. So the values given here are average measurements.
> 
> 1) Vanilla Lua (without computed gotos)
> ==============================
> 
> Running Time		Self (ms)		Symbol Name
> ------------------------------------------------------------
> 19558.0ms   99.4%	2239,0	 	luaV_execute
> 7292.0ms   37.0%	144,0	 	       luaC_forcestep
> 4380.0ms   22.2%	193,0	 	       luaD_precall
> 1788.0ms    9.0%	138,0	 	       luaH_resize
> 1438.0ms    7.3%	426,0	 	       luaV_settable
> 1230.0ms    6.2%	59,0	 	                luaH_new
> 1107.0ms    5.6%	407,0	 	       luaV_gettable
> 43.0ms    0.2%	0,0	 	               <Unknown Address>
> 24.0ms    0.1%	24,0	 	               luaO_fb2int
> 10.0ms    0.0%	10,0	 	               luaC_step
> 5.0ms    0.0%		5,0	 	               luaH_get
> 2.0ms    0.0%		2,0	 	               luaC_barrierback_
> 
> Some highlights on this profiling result:
> - luaV_execute runs during 19558.0ms (99.4% of total profile time) and takes 2239,0 ms internally.
> - other functions below are called from luaV_execute and are sorted by decreasing running time
> - luaC_forcestep represents most of the cost of the GC, due to the high number of created short-lived tables
> - luaD_precall has a rather high running cost, caused mainly by the activity of the debug hook. Actually the only called function is math.floor (once per loop)
> 
> Inside luaV_execute, we can see what takes significant time:
> ---------------------------------------------------
> 53.38% 	vmcase(OP_NEWTABLE,
> 22.78% 	vmcase(OP_CALL,
> 7.55% 	vmcase(OP_SETTABUP,
> 5.93% 	vmcase(OP_GETTABUP,
> 1.60%	 ra = RA(i);
> 1.38%	 vmcase(OP_GETTABLE,
> 1.37%	 vmcase(OP_ADD,
> 1.31% 	vmcase(OP_MUL,
> 0.85% 	vmcase(OP_SETTABLE,
> 0.69% 	vmcase(OP_GETUPVAL,
> 0.52% 	Instruction i = *(ci->u.l.savedpc++);
> 0.49% 	lua_assert(base <= L->top && L->top < L->stack + L->stacksize);
> 0.46% 	vmdispatch (GET_OPCODE(i)) {
> 0.42% 	vmcase(OP_FORLOOP,
> 0.33% 	int counthook = ((mask & LUA_MASKCOUNT) && L->hookcount == 0);
> 0.29% 	base = ci->u.l.base;
> 0.23% 	vmcase(OP_LE,
> 0.18% 	lua_assert(base == ci->u.l.base);
> 0.17% 	if ((L->hookmask & (LUA_MASKLINE | LUA_MASKCOUNT)) &&
> 0.07% 	vmcase(OP_LOADNIL,
> 
> 
> 
> 2) lvm.c modified with computed goto
> ============================
> 
> Running Time		Self (ms)		Symbol Name
> ------------------------------------------------------------
> 18589.0ms   99.3%	2024,0	 luaV_execute
> 7254.0ms   38.7%	136,0	 	luaC_forcestep
> 4045.0ms   21.6%	227,0	 	luaD_precall
> 1553.0ms    8.2%	129,0	 	luaH_resize
> 1389.0ms    7.4%	455,0	 	luaV_settable
> 1223.0ms    6.5%	59,0	 	        luaH_new
> 1016.0ms    5.4%	374,0	 	luaV_gettable
> 40.0ms    0.2%	0,0	 	        <Unknown Address>
> 34.0ms    0.1%	34,0	 	        luaO_fb2int
> 7.0ms    0.0%		7,0	 	        luaH_get
> 4.0ms    0.0%		4,0	 	        luaC_step
> 
> Highlights:
> - the overall running time of luaV_execute is significantly reduced (18589.0ms vs. 19558.0ms, i.e. 5%);
> - the internal running time of luaV_execute is reduced too, by a smaller amount (2024ms vs. 2239ms) but this is still a  10% performance gain in the interpreter loop;
> - where do the remaining 800ms gain come from? I can’t see any clear reason for this in the profiling info, so I would suspect better caching or branch prediction (to be confirmed by further benchmarks).
> 
> And, if you are curious about it, here is how luaV_execute consumes running time in this (computed goto) case
> ———————————
> 39.25% checkGC(L, ra + 1);
> 21.84% if (luaD_precall(L, ra, nresults)) {  /* C function? */
> 9.05% Protect(luaV_settable(L, ra, RKB(i), RKC(i)));
> 8.66% luaH_resize(L, t, luaO_fb2int(b), luaO_fb2int(c));
> 7.67% Protect(luaV_gettable(L, RB(i), RKC(i), ra));
> 6.58% Table *t = luaH_new(L);
> 1.45% arith_op(luai_numadd, TM_ADD);
> 1.08% arith_op(luai_nummul, TM_MUL);
> 0.78% } vmbreak;  … after vmcase(OP_SETTABLE)
> 0.74% } vmbreak;  … after vmcase(OP_GETTABLE)
> 0.51% } vmbreak; … after vmcase(OP_MUL)
> 0.40% int b = GETARG_B(i);
> 0.33% } vmbreak; … after vmcase(OP_ADD)
> 0.22% } vmbreak; … after vmcase(OP_CALL)
> 0.22% } vmbreak; … after vmcase(OP_NEWTABLE)
> 0.18% sethvalue(L, ra, t);
> 0.14% setobj2s(L, ra, cl->upvals[b]->v);
> 0.13% int nresults = GETARG_C(i) - 1;
> 0.13% } vmbreak;  … after vmcase(OP_GETUPVAL)
> 0.12% lua_Number step = nvalue(ra+2);
> 0.10% } vmbreak;  … after vmcase(OP_FORLOOP)
> 0.06% lua_Number limit = nvalue(ra+1);
> 0.05% if (b != 0) L->top = ra+b;  /* else previous instruction set top */
> 0.05% if (luai_numlt(L, 0, step) ? luai_numle(L, idx, limit)
> 0.04% if (nresults >= 0) L->top = ci->top;  /* adjust results */
> 0.04% ci->u.l.savedpc += GETARG_sBx(i);  /* jump back */
> 0.03% setnvalue(ra+3, idx);  /* ...and external index */
> 0.03% lua_Number idx = luai_numadd(L, nvalue(ra), step); /* increment index */
> 0.03% if (b != 0 || c != 0)
> 0.02% int c = GETARG_C(i);
> 0.02% int b = GETARG_B(i);
> 0.02% base = ci->u.l.base;
> 0.01% int b = GETARG_B(i);
> 0.01% vmdispatch (GET_OPCODE(i)) {
> 0.01% } vmbreak;
> 0.01% arith_op(luai_numsub, TM_SUB);
> 
> 
> (If you are still reading at this point, you are very brave :-)
> 
> 
> Next step for me: do similar measurements on ARM-based devices, and with debug hook, to check if i can get the same level of improvements on real-wold embedded Lua apps.
> 
> Note: it could be really interesting if other people can report profiling results on this topic…
> 
> Regards,
> Jean-Luc
> 
>