[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Performance improvement in luaL_addlstring
- From: Chuck Coffing <clc@...>
- Date: Fri, 12 Mar 2010 07:17:27 -0700
Hi list,
At work we use Lua in an embedded environment. It's a large project that
does a lot of network IO. After some profiling, I discovered that
luaL_addlstring was consuming a large amount of the processor; much of the
usage was initiated by luasockets.
It turns out that luaL_addlstring calls luaL_addchar for every byte, which
means that multiple dereferences, a test, and a jump occur for every byte
received over the network.
The buffer, however, already knows how much space it has available, so the
repeated tests are unnecessary. I changed luaL_addlstring to memcpy the
largest chunk that is known to fit, and then expand the buffer as needed.
For a trivial luasocket client that receives data, the change cuts the number
of instructions executed by more than 50%. (I use valgrind to count
instructions.) I recall (although it's been a while) that it cut the
instruction count of our app by about 13% overall.
The change only minimally increases the code size (16 bytes larger on x86):
chuck@magma:~/lua-perf$ nm -S lua.orig | grep addlstring
08059ec0 00000068 T luaL_addlstring
chuck@magma:~/lua-perf$ nm -S lua.perf | grep addlstring
08059ec0 00000078 T luaL_addlstring
I made the change on 5.1.4, but it looks like 5.2 has the same performance
issue.
Patch is below; the trivial test script and output from Valgrind is also below.
--
Chuck
--- lua-5.1.4.orig/src/lauxlib.c 2008-01-21 06:20:51.000000000 -0700
+++ lua-5.1.4/src/lauxlib.c 2010-03-12 05:48:39.000000000 -0700
@@ -434,8 +434,19 @@
LUALIB_API void luaL_addlstring (luaL_Buffer *B, const char *s, size_t l) {
- while (l--)
- luaL_addchar(B, *s++);
+ while (l) {
+ size_t min;
+ size_t avail = bufffree(B);
+ if (!avail) {
+ luaL_prepbuffer(B);
+ avail = bufffree(B);
+ }
+ min = avail <= l ? avail : l;
+ memcpy(B->p, s, min);
+ B->p += min;
+ s += min;
+ l -= min;
+ }
}
chuck@magma:~/lua-perf$ cat echo.lua
require "socket"
server, err = socket.bind("*", 0)
assert(server, err)
ip, port = server:getsockname()
print("Please telnet to localhost on port " .. port)
client, err = server:accept()
total = 0
while not err do
data, err = client:receive(4096)
if err then break end
client:send("receive(" .. tostring(#data) .. ")")
total = total + #data
if total >= 1024*1024 then break end
end
client:close()
chuck@magma:~/lua-perf$ valgrind --tool=cachegrind ./lua.orig echo.lua
==8565== Cachegrind, a cache and branch-prediction profiler
==8565== Copyright (C) 2002-2009, and GNU GPL'd, by Nicholas Nethercote et al.
==8565== Using Valgrind-3.5.0-Debian and LibVEX; rerun with -h for copyright info
==8565== Command: ./lua.orig echo.lua
==8565==
Please telnet to localhost on port 60342
==8565==
==8565== I refs: 16,345,890
==8565== I1 misses: 28,037
==8565== L2i misses: 2,426
==8565== I1 miss rate: 0.17%
==8565== L2i miss rate: 0.01%
==8565==
==8565== D refs: 8,229,564 (4,619,517 rd + 3,610,047 wr)
==8565== D1 misses: 53,970 ( 32,310 rd + 21,660 wr)
==8565== L2d misses: 5,991 ( 3,442 rd + 2,549 wr)
==8565== D1 miss rate: 0.6% ( 0.6% + 0.5% )
==8565== L2d miss rate: 0.0% ( 0.0% + 0.0% )
==8565==
==8565== L2 refs: 82,007 ( 60,347 rd + 21,660 wr)
==8565== L2 misses: 8,417 ( 5,868 rd + 2,549 wr)
==8565== L2 miss rate: 0.0% ( 0.0% + 0.0% )
chuck@magma:~/lua-perf$ valgrind --tool=cachegrind ./lua.orig echo.lua
[...snip...]
==8570== I refs: 16,345,917
[...snip...]
chuck@magma:~/lua-perf$ valgrind --tool=cachegrind ./lua.orig echo.lua
[...snip...]
==8575== I refs: 16,345,861
[...snip...]
chuck@magma:~/lua-perf$ valgrind --tool=cachegrind ./lua.perf echo.lua
==8580== Cachegrind, a cache and branch-prediction profiler
==8580== Copyright (C) 2002-2009, and GNU GPL'd, by Nicholas Nethercote et al.
==8580== Using Valgrind-3.5.0-Debian and LibVEX; rerun with -h for copyright info
==8580== Command: ./lua.perf echo.lua
==8580==
Please telnet to localhost on port 43713
==8580==
==8580== I refs: 7,179,931
==8580== I1 misses: 26,354
==8580== L2i misses: 2,422
==8580== I1 miss rate: 0.36%
==8580== L2i miss rate: 0.03%
==8580==
==8580== D refs: 4,563,501 (2,786,905 rd + 1,776,596 wr)
==8580== D1 misses: 54,462 ( 32,711 rd + 21,751 wr)
==8580== L2d misses: 5,989 ( 3,440 rd + 2,549 wr)
==8580== D1 miss rate: 1.1% ( 1.1% + 1.2% )
==8580== L2d miss rate: 0.1% ( 0.1% + 0.1% )
==8580==
==8580== L2 refs: 80,816 ( 59,065 rd + 21,751 wr)
==8580== L2 misses: 8,411 ( 5,862 rd + 2,549 wr)
==8580== L2 miss rate: 0.0% ( 0.0% + 0.1% )
chuck@magma:~/lua-perf$ valgrind --tool=cachegrind ./lua.perf echo.lua
[...snip...]
==8586== I refs: 7,179,933
[...snip...]
chuck@magma:~/lua-perf$ valgrind --tool=cachegrind ./lua.perf echo.lua
[...snip...]
==8591== I refs: 7,179,922
[...snip...]
chuck@magma:~/lua-perf$