Simd Experiment |
|
Importantly, the initial implementation here is ANSI C, and it does not use any SSE instructions or multithreading. How can this be? The opcode dispatch in the Lua VM imposes a non-negligible overhead. If, however, we interpret each opcode (instruction) once and execute that opcode on multiple data elements, we could expect to reduce the relative overhead of the opcode dispatch, even if the data is processed serially in each opcode.
The current implementation is an experimental prototype and not fully implemented. The initial goal was merely to estimate conceivable performance gains. The design favored implementation simplicity. In fact, the simplest approach, requiring the least patching, seemed to be to redefine the TValue type to contain an array of values. This obviously imposes overhead on non-SIMD operations, but that is ok for now given the goal above. (Redefining the lua_Number type in luaconf.h may be another somewhat simple approach, but it too is a global change.) A more refined implementation might leave TValue as is and define SIMD operations over a sequence of TValue's, possibly with new SIMD opcodes, thereby allowing both SIMD and non-SIMD operations.
In this implementation, each Lua number contains a certain number of sub-values (this number is defined in the global variable _SIMD_LEN and can be adjusted in luaconf.h). Individual sub-values in each Lua number can be indexed with the "packed" operator. Syntactically, the operator looks like a function call here, but semantically it is more like an indexing operator on register lexicals, and it has it's own op-codes.
-- ex1.lua -- If you don't use the SIMD features, numbers will behave -- as usual. This is because any SIMD unaware operations -- (e.g. print) only use the first sub-value of a number. local x = 10 print(x + 50) --> 60 -- Individual subvalues can be accessed with the packed operator: assert(packed(x,1) == 10) -- success packed(x,2) = 20 assert(packed(x,1) == 10) assert(packed(x,2) == 20) local y; packed(y,1) = 100; packed(y,2) = 200; local z = x + y -- this is a SIMD instruction assert(packed(z,1) == 110) assert(packed(z,2) == 220) print(z) --> 110 (print is not currently SIMD-aware) print(_SIMD_LEN) --> prints some number (e.g. 2)
Here's what the opcodes look like for that. Note the new GETPACKED and SETPACKED instructions. All other opcodes are unchanged.
$ ./src/luac -p -l ex1.lua main <ex.lua:0,0> (51 instructions, 204 bytes at 0x682930) 0+ params, 5 slots, 0 upvalues, 3 locals, 12 constants, 0 functions 1 [4] LOADK 0 -1 ; 10 2 [5] GETGLOBAL 1 -2 ; print 3 [5] ADD 2 0 -3 ; - 50 4 [5] CALL 1 2 1 5 [8] GETGLOBAL 1 -4 ; assert 6 [8] GETPACKED 2 0 -5 ; 1 7 [8] EQ 1 2 -1 ; - 10 8 [8] JMP 1 ; to 10 9 [8] LOADBOOL 2 0 1 10 [8] LOADBOOL 2 1 0 11 [8] CALL 1 2 1 12 [9] SETPACKED 0 -6 -7 ; 2 20 13 [10] GETGLOBAL 1 -4 ; assert 14 [10] GETPACKED 2 0 -5 ; 1 15 [10] EQ 1 2 -1 ; - 10 16 [10] JMP 1 ; to 18 17 [10] LOADBOOL 2 0 1 18 [10] LOADBOOL 2 1 0 19 [10] CALL 1 2 1 20 [11] GETGLOBAL 1 -4 ; assert 21 [11] GETPACKED 2 0 -6 ; 2 22 [11] EQ 1 2 -7 ; - 20 23 [11] JMP 1 ; to 25 24 [11] LOADBOOL 2 0 1 25 [11] LOADBOOL 2 1 0 26 [11] CALL 1 2 1 27 [13] LOADNIL 1 1 28 [13] SETPACKED 1 -5 -8 ; 1 100 29 [13] SETPACKED 1 -6 -9 ; 2 200 30 [14] ADD 2 0 1 31 [15] GETGLOBAL 3 -4 ; assert 32 [15] GETPACKED 4 2 -5 ; 1 33 [15] EQ 1 4 -10 ; - 110 34 [15] JMP 1 ; to 36 35 [15] LOADBOOL 4 0 1 36 [15] LOADBOOL 4 1 0 37 [15] CALL 3 2 1 38 [16] GETGLOBAL 3 -4 ; assert 39 [16] GETPACKED 4 2 -6 ; 2 40 [16] EQ 1 4 -11 ; - 220 41 [16] JMP 1 ; to 43 42 [16] LOADBOOL 4 0 1 43 [16] LOADBOOL 4 1 0 44 [16] CALL 3 2 1 45 [17] GETGLOBAL 3 -2 ; print 46 [17] MOVE 4 2 47 [17] CALL 3 2 1 48 [19] GETGLOBAL 3 -2 ; print 49 [19] GETGLOBAL 4 -12 ; _SIMD_LEN 50 [19] CALL 3 2 1 51 [19] RETURN 0 1
Now, let's do some benchmarks: summing the integers from 1 to 2^28. In standard Lua it would look like this:
-- test1-standard.lua local sum = 0 for i=1,2^28 do sum = sum + i end print(sum)
Using this SIMD patch, we could rewrite it like this:
-- test1-simd.lua local N=_SIMD_LEN local j; for k=1,N do packed(j,k) = k end local psum; for k=1,N do packed(psum,k)= 0 end local fi; for k=1,N do packed(fi,k) = k end local fs; for k=1,N do packed(fs,k) = N end for i=fi,2^28,fs do psum = psum + i end local sum = 0 for i=1,N do -- print('partial sum', i, packed(psum,i)) sum = sum + packed(psum,i) end print('sum:', sum)
The SIMD version computes N partial sums (each computed in parallel) and then sums the partial sums.
Here's the benchmark runtimes as a function of the compiled value of _SIMD_LEN (N).
Lua version run-time ---------------- -------- unpatched 8.095s SIMD patch N=1 9.265s SIMD patch N=2 5.660s SIMD patch N=3 3.566s SIMD patch N=4 3.248s SIMD patch N=8 2.688s SIMD patch N=32 2.406s SIMD patch N=256 2.529s SIMD patch N=1024 2.894s
Using the SIMD patch with N=1 effectively negates the patch and is useful to compare to the unpatched version. With N set to 32, the runtimes are reduced by a significant 70%, at least on this very simple example.
The patch against Lua 5.1.4 is in Appendix A below.
A simple extension to this patch would be to use real SSE operations [2] (which hopefully will improve performance performance further).
diff -ur lua-5.1.4/src/lbaselib.c lua-5.1.4-simd/src/lbaselib.c --- lua-5.1.4/src/lbaselib.c 2008-02-14 11:46:22.000000000 -0500 +++ lua-5.1.4-simd/src/lbaselib.c 2009-01-01 22:43:22.671875000 -0500 @@ -642,6 +642,11 @@ lua_setfield(L, -2, "__mode"); /* metatable(w).__mode = "kv" */ lua_pushcclosure(L, luaB_newproxy, 1); lua_setglobal(L, "newproxy"); /* set global `newproxy' */ + + /* SIMD-PATCH */ + lua_pushnumber(L, LUA_SIMD_LEN); + lua_setglobal(L, "_SIMD_LEN"); + } diff -ur lua-5.1.4/src/lcode.c lua-5.1.4-simd/src/lcode.c --- lua-5.1.4/src/lcode.c 2007-12-28 10:32:23.000000000 -0500 +++ lua-5.1.4-simd/src/lcode.c 2009-01-01 22:20:37.890625000 -0500 @@ -317,6 +317,16 @@ e->k = VRELOCABLE; break; } + + /* SIMD-PATCH */ + case VPACKED: { + freereg(fs, e->u.s.aux); + freereg(fs, e->u.s.info); + e->u.s.info = luaK_codeABC(fs, OP_GETPACKED, 0, e->u.s.info, e->u.s.aux); + e->k = VRELOCABLE; + break; + } + case VINDEXED: { freereg(fs, e->u.s.aux); freereg(fs, e->u.s.info); @@ -486,6 +496,14 @@ luaK_codeABx(fs, OP_SETGLOBAL, e, var->u.s.info); break; } + + /* SIMD-PATCH */ + case VPACKED: { + int e = luaK_exp2RK(fs, ex); + luaK_codeABC(fs, OP_SETPACKED, var->u.s.info, var->u.s.aux, e); + break; + } + case VINDEXED: { int e = luaK_exp2RK(fs, ex); luaK_codeABC(fs, OP_SETTABLE, var->u.s.info, var->u.s.aux, e); diff -ur lua-5.1.4/src/llex.c lua-5.1.4-simd/src/llex.c --- lua-5.1.4/src/llex.c 2007-12-27 08:02:25.000000000 -0500 +++ lua-5.1.4-simd/src/llex.c 2009-01-01 22:20:17.531250000 -0500 @@ -37,7 +37,11 @@ const char *const luaX_tokens [] = { "and", "break", "do", "else", "elseif", "end", "false", "for", "function", "if", - "in", "local", "nil", "not", "or", "repeat", + "in", "local", "nil", "not", "or", + + "packed", /* SIMD-PATCH */ + + "repeat", "return", "then", "true", "until", "while", "..", "...", "==", ">=", "<=", "~=", "<number>", "<name>", "<string>", "<eof>", diff -ur lua-5.1.4/src/llex.h lua-5.1.4-simd/src/llex.h --- lua-5.1.4/src/llex.h 2007-12-27 08:02:25.000000000 -0500 +++ lua-5.1.4-simd/src/llex.h 2009-01-01 22:21:41.859375000 -0500 @@ -25,7 +25,11 @@ /* terminal symbols denoted by reserved words */ TK_AND = FIRST_RESERVED, TK_BREAK, TK_DO, TK_ELSE, TK_ELSEIF, TK_END, TK_FALSE, TK_FOR, TK_FUNCTION, - TK_IF, TK_IN, TK_LOCAL, TK_NIL, TK_NOT, TK_OR, TK_REPEAT, + TK_IF, TK_IN, TK_LOCAL, TK_NIL, TK_NOT, TK_OR, + + TK_PACKED, /* SIMD-PATCH */ + + TK_REPEAT, TK_RETURN, TK_THEN, TK_TRUE, TK_UNTIL, TK_WHILE, /* other terminal symbols */ TK_CONCAT, TK_DOTS, TK_EQ, TK_GE, TK_LE, TK_NE, TK_NUMBER, diff -ur lua-5.1.4/src/lobject.h lua-5.1.4-simd/src/lobject.h --- lua-5.1.4/src/lobject.h 2008-08-06 09:29:48.000000000 -0400 +++ lua-5.1.4-simd/src/lobject.h 2009-01-01 22:42:06.609375000 -0500 @@ -70,8 +70,16 @@ #define TValuefields Value value; int tt +/* SIMD-PATCH */ +typedef struct lua_TValueItem { + TValuefields; +} TValueItem; + typedef struct lua_TValue { TValuefields; + + TValueItem more[LUA_SIMD_LEN - 1]; /* SIMD-PATCH */ + } TValue; @@ -102,6 +110,9 @@ #define l_isfalse(o) (ttisnil(o) || (ttisboolean(o) && bvalue(o) == 0)) +/* SIMD-PATCH */ +#define packeditem(o,i) ( (TValue*)((TValueItem*)(o) + (i)) ) + /* ** for internal debug only */ diff -ur lua-5.1.4/src/lopcodes.c lua-5.1.4-simd/src/lopcodes.c --- lua-5.1.4/src/lopcodes.c 2007-12-27 08:02:25.000000000 -0500 +++ lua-5.1.4-simd/src/lopcodes.c 2009-01-01 22:19:43.640625000 -0500 @@ -52,6 +52,11 @@ "CLOSE", "CLOSURE", "VARARG", + + /* SIMD-PATCH */ + "GETPACKED", + "SETPACKED", + NULL }; @@ -98,5 +103,10 @@ ,opmode(0, 0, OpArgN, OpArgN, iABC) /* OP_CLOSE */ ,opmode(0, 1, OpArgU, OpArgN, iABx) /* OP_CLOSURE */ ,opmode(0, 1, OpArgU, OpArgN, iABC) /* OP_VARARG */ + + /* SIMD-PATCH */ + ,opmode(0, 1, OpArgR, OpArgK, iABC) /* OP_GETPACKED */ + ,opmode(0, 0, OpArgK, OpArgK, iABC) /* OP_SETPACKED */ + }; diff -ur lua-5.1.4/src/lopcodes.h lua-5.1.4-simd/src/lopcodes.h --- lua-5.1.4/src/lopcodes.h 2007-12-27 08:02:25.000000000 -0500 +++ lua-5.1.4-simd/src/lopcodes.h 2009-01-01 22:23:40.531250000 -0500 @@ -205,10 +205,18 @@ OP_CLOSURE,/* A Bx R(A) := closure(KPROTO[Bx], R(A), ... ,R(A+n)) */ OP_VARARG/* A B R(A), R(A+1), ..., R(A+B-1) = vararg */ -} OpCode; + +/* SIMD-PATCH */ +,OP_GETPACKED/* A B C R(A) := packed(R(B), RK(C)) */ +,OP_SETPACKED/* A B C packed(R(A), RK(B)) := RK(C) */ +} OpCode; + +#if 0 /* SIMD-PATCH */ #define NUM_OPCODES (cast(int, OP_VARARG) + 1) +#endif +#define NUM_OPCODES (cast(int, OP_SETPACKED) + 1) diff -ur lua-5.1.4/src/lparser.c lua-5.1.4-simd/src/lparser.c --- lua-5.1.4/src/lparser.c 2007-12-28 10:32:23.000000000 -0500 +++ lua-5.1.4-simd/src/lparser.c 2009-01-01 22:20:07.343750000 -0500 @@ -691,6 +691,22 @@ /* primaryexp -> prefixexp { `.' NAME | `[' exp `]' | `:' NAME funcargs | funcargs } */ FuncState *fs = ls->fs; + + /* SIMD-PATCH */ + if (ls->t.token == TK_PACKED) { + expdesc t; + expdesc k; + luaX_next(ls); + checknext(ls, '('); + expr(ls, &t); + checknext(ls, ','); + expr(ls, &k); + checknext(ls, ')'); + init_exp(v, VPACKED, luaK_exp2anyreg(fs, &t)); + v->u.s.aux = luaK_exp2RK(fs, &k); + return; + } + prefixexp(ls, v); for (;;) { switch (ls->t.token) { diff -ur lua-5.1.4/src/lparser.h lua-5.1.4-simd/src/lparser.h --- lua-5.1.4/src/lparser.h 2007-12-27 08:02:25.000000000 -0500 +++ lua-5.1.4-simd/src/lparser.h 2009-01-01 22:48:25.640625000 -0500 @@ -26,12 +26,17 @@ VLOCAL, /* info = local register */ VUPVAL, /* info = index of upvalue in `upvalues' */ VGLOBAL, /* info = index of table; aux = index of global name in `k' */ + + /* SIMD-PATCH */ + VPACKED, /* info = local register; aux = index register (or `k') */ + VINDEXED, /* info = table register; aux = index register (or `k') */ VJMP, /* info = instruction pc */ VRELOCABLE, /* info = instruction pc */ VNONRELOC, /* info = result register */ VCALL, /* info = instruction pc */ VVARARG /* info = instruction pc */ + } expkind; typedef struct expdesc { diff -ur lua-5.1.4/src/luaconf.h lua-5.1.4-simd/src/luaconf.h --- lua-5.1.4/src/luaconf.h 2008-02-11 11:25:08.000000000 -0500 +++ lua-5.1.4-simd/src/luaconf.h 2009-01-01 23:42:57.625000000 -0500 @@ -510,6 +510,12 @@ */ #define LUAI_UACNUMBER double +/* SIMD-PATCH */ +/* +@@ LUA_SIMD_LEN is the number of values computed in parallel. +** This is readable via the global _SIMD_LEN. +*/ +#define LUA_SIMD_LEN 8 /* @@ LUA_NUMBER_SCAN is the format for reading numbers. diff -ur lua-5.1.4/src/lvm.c lua-5.1.4-simd/src/lvm.c --- lua-5.1.4/src/lvm.c 2007-12-28 10:32:23.000000000 -0500 +++ lua-5.1.4-simd/src/lvm.c 2009-01-01 22:43:45.437500000 -0500 @@ -357,6 +357,7 @@ #define Protect(x) { L->savedpc = pc; {x;}; base = L->base; } +#if 0 /* SIMDPATCH */ #define arith_op(op,tm) { \ TValue *rb = RKB(i); \ TValue *rc = RKC(i); \ @@ -367,6 +368,23 @@ else \ Protect(Arith(L, ra, rb, rc, tm)); \ } +#endif +#define arith_op(op,tm) { \ + TValue *rb = RKB(i); \ + TValue *rc = RKC(i); \ + if (ttisnumber(rb) && ttisnumber(rc)) { \ + int j; \ + for (j=0; j < LUA_SIMD_LEN; j++) { \ + TValue * rai = packeditem(ra,j); \ + lua_Number nb = nvalue(rb), nc = nvalue(rc); \ + setnvalue(rai, op(nb, nc)); \ + (*(TValueItem**)(void*)&rb) ++; \ + (*(TValueItem**)(void*)&rc) ++; \ + } \ + } \ + else \ + Protect(Arith(L, ra, rb, rc, tm)); \ + } @@ -401,7 +419,15 @@ lua_assert(L->top == L->ci->top || luaG_checkopenop(i)); switch (GET_OPCODE(i)) { case OP_MOVE: { + + #if 0 /* SIMD-PATCH */ setobjs2s(L, ra, RB(i)); + #endif + int j; + for (j=0; j < LUA_SIMD_LEN; j++) { + setobjs2s(L, packeditem(ra,j), packeditem(RB(i),j)); + } + continue; } case OP_LOADK: { @@ -654,8 +680,21 @@ if (luai_numlt(0, step) ? luai_numle(idx, limit) : luai_numle(limit, idx)) { dojump(L, pc, GETARG_sBx(i)); /* jump back */ + + #if 0 /* SIMD-PATCH */ setnvalue(ra, idx); /* update internal index... */ setnvalue(ra+3, idx); /* ...and external index */ + #endif + { int j; + for (j=0; j<LUA_SIMD_LEN; j++) { + TValue* ra_ = packeditem(ra, j); + TValue* step_ = packeditem(ra+2, j); + lua_Number idx_ = luai_numadd(nvalue(ra_), nvalue(step_)); + setnvalue(ra_, idx_); + setnvalue(ra_+3, idx_); + } + } + } continue; } @@ -670,7 +709,19 @@ luaG_runerror(L, LUA_QL("for") " limit must be a number"); else if (!tonumber(pstep, ra+2)) luaG_runerror(L, LUA_QL("for") " step must be a number"); + + #if 0 /* SIMD-PATCH */ setnvalue(ra, luai_numsub(nvalue(ra), nvalue(pstep))); + #endif + { int j; + for (j=0; j < LUA_SIMD_LEN; j++) { + setnvalue(packeditem(ra,j), + luai_numsub(nvalue(packeditem(ra,j)), + nvalue(packeditem(pstep,j))) ); + } + } + + dojump(L, pc, GETARG_sBx(i)); continue; } @@ -757,6 +808,26 @@ } continue; } + + /* SIMD-PATCH */ + case OP_GETPACKED: { + size_t off = (size_t)nvalue(RKC(i)); + if (off < 1 || off > LUA_SIMD_LEN) { + luaG_runerror(L, "packed offset out of range"); + } + setobjs2s(L, ra, (TValue*)((TValueItem*)RB(i) + off - 1) ); + continue; + } + case OP_SETPACKED: { + size_t off = (size_t)nvalue(RKB(i)); + if (off < 1 || off > LUA_SIMD_LEN) { + luaG_runerror(L, "packed offset out of range"); + } + setobjs2s(L, (TValue*)((TValueItem*)ra + off - 1), RKC(i) ); + continue; + } + + } } } diff -ur lua-5.1.4/src/print.c lua-5.1.4-simd/src/print.c --- lua-5.1.4/src/print.c 2007-03-25 20:17:38.000000000 -0400 +++ lua-5.1.4-simd/src/print.c 2009-01-01 22:19:20.562500000 -0500 @@ -116,6 +116,9 @@ printf("\t; %s",svalue(&f->k[bx])); break; case OP_GETTABLE: + + case OP_GETPACKED: /* SIMD-PATCH */ + case OP_SELF: if (ISK(c)) { printf("\t; "); PrintConstant(f,INDEXK(c)); } break; @@ -128,6 +131,9 @@ case OP_EQ: case OP_LT: case OP_LE: + + case OP_SETPACKED: /* SIMD-PATCH */ + if (ISK(b) || ISK(c)) { printf("\t; ");