[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Improving performance of Lua on low-end systems
- From: William Ahern <william@...>
- Date: Tue, 28 Oct 2014 19:48:59 -0700
On Tue, Oct 28, 2014 at 05:31:57PM -0400, Rena wrote:
<snip>
> Someone mentioned that calling across the language boundary (C to Lua
> and vice-versa) is expensive? (I could never remember if that applies
> to Lua, LuaJIT, or both.) So it might be worth trying to reduce C
> calls?
At the end are the little scripts I threw together to benchmark C function
calls. For the built-in os.time, LuaJIT takes ~2/3 the execution time of Lua
5.1/5.2. But I really needed to crank up the loop unrolling. Fortunately
both Lua 5.1, 5.2, and LuaJIT did their best around 2^11 calls per loop,
although the difference was much more pronounced for PUC Lua.
For 2^27 calls and 2^11 calls per loop I got
Lua5.1: 3.005s
Lua5.2: 2.908s
LuaJIT: 2.007s
I used os.time() because it's built-in, not a pure function, and figured it
was something LuaJIT couldn't optimize very well. On Linux time(2) does a
read from shared memory with the kernel so it has a negligible cost.
_However_, it turns out that LuaJIT apparently does optimize calls to its
built-ins. If I replace os.time with a C function
static int inc(lua_State *L) {
lua_Number i = lua_tonumber(L, 1);
lua_pushnumber(L, i + 1);
return 1;
} /* inc() */
and replace the call-out in the script below with
j = inc(j) --> inc is a local
then I got
Lua5.1: 2.932s
Lua5.2: 2.885s
LuaJIT: 2.232s
which is only ~3/4 the execution time of Lua 5.1/5.2.
Out of curiosity I recompiled Lua 5.2 like so
make MYCFLAGS="-D'luai_apicheck(...)=(void)0' -march=native -O3 -flto \
-fwhole-program -fprofile-use=/tmp/gcov" linux
where I had first compiled with "-fprofile-path=/tmp/gcov -lgcov" and ran
the script once. For os.time I got
Lua5.2: 2.621s
which is a nearly 10% improvement. (LTO, profiling, and a noop luai_apicheck
all contributed, but LTO the most. -O3 vs -O2 wasn't consistently
discernable, nor was -fomit-frame-pointer.)
I couldn't get the run time down for the inc() external function test, even
when I compiled that module with profiling. But at least for Lua built-ins
you can improve things by enabling compiler optimizations. You can always
build your modules into the interpreter, e.g. by compiling Lua to liblua.a
and then building
NOTE: All benchmarks done on Ubuntu 14.04/x86_64 and, except where noted,
used the Ubuntu packaged compiler. I just ran each test twice in quick
succession, and took the user time as reported by the time(1) utility of the
second run. If the second run deviated substantially from the first or from
my pratice runs, I ran it twice again.
os.time test (time.lua):
local ncalls_n, unroll_n = ...
local function optnatural(n, d)
n = tonumber(n)
if n and n >= 0 then
return math.floor(n)
else
return d
end
end
local ncalls = 2^optnatural(ncalls_n, 27)
local unroll = 2^optnatural(unroll_n, 11)
local nloops = ncalls / unroll
local code = string.format([[
local n = ...
local j = 0
local time = os.time
for i=1,n do
%s
end
return j
]], string.rep("j = j + time()", unroll, "\n"))
local f = assert((loadstring or load)(code))
local j = f(nloops)
print(string.format("j:%d ncalls:%d unroll:%d", j, ncalls, unroll))
external function test (inc.lua):
local ncalls_n, unroll_n = ...
local function optnatural(n, d)
n = tonumber(n)
if n and n >= 0 then
return math.floor(n)
else
return d
end
end
local ncalls = 2^optnatural(ncalls_n, 27)
local unroll = 2^optnatural(unroll_n, 11)
local nloops = ncalls / unroll
local code = string.format([[
local n = ...
local j = 0
local ver = string.match(_VERSION, "(%%d%%.%%d)"):gsub("%%.", "")
local inc = require(string.format("inc%%s", ver))
for i=1,n do
%s
end
return j
]], string.rep("j = inc(j)", unroll, "\n"))
local f = assert((loadstring or load)(code))
local j = f(nloops)
print(string.format("j:%d ncalls:%d unroll:%d nloops:%d", j, ncalls, unroll, nloops))
external function test (inc.c):
// cc -o inc{51,52}.so inc.c -O3 -fPIC -shared -I/usr/include/lua{5.1,5.2}
#include <lua.h>
static int inc(lua_State *L) {
lua_Number i = lua_tonumber(L, 1);
// NOTE: calling lua_pop(L, 1) gives worse numbers in all VMs
lua_pushnumber(L, i + 1);
return 1;
} /* inc() */
int luaopen_inc(lua_State *L) {
lua_pushcfunction(L, &inc);
return 1;
}
int luaopen_inc51(lua_State *L) {
return luaopen_inc(L);
}
int luaopen_inc52(lua_State *L) {
return luaopen_inc(L);
}
int luaopen_inc53(lua_State *L) {
return luaopen_inc(L);
}