Simd Experiment

lua-users home
wiki

The following patch to Lua provides an experimental implementation of a type of Single Instruction Multiple Data (SIMD) [1] capability in the Lua VM.

Importantly, the initial implementation here is ANSI C, and it does not use any SSE instructions or multithreading. How can this be? The opcode dispatch in the Lua VM imposes a non-negligible overhead. If, however, we interpret each opcode (instruction) once and execute that opcode on multiple data elements, we could expect to reduce the relative overhead of the opcode dispatch, even if the data is processed serially in each opcode.

The current implementation is an experimental prototype and not fully implemented. The initial goal was merely to estimate conceivable performance gains. The design favored implementation simplicity. In fact, the simplest approach, requiring the least patching, seemed to be to redefine the TValue type to contain an array of values. This obviously imposes overhead on non-SIMD operations, but that is ok for now given the goal above. (Redefining the lua_Number type in luaconf.h may be another somewhat simple approach, but it too is a global change.) A more refined implementation might leave TValue as is and define SIMD operations over a sequence of TValue's, possibly with new SIMD opcodes, thereby allowing both SIMD and non-SIMD operations.

In this implementation, each Lua number contains a certain number of sub-values (this number is defined in the global variable _SIMD_LEN and can be adjusted in luaconf.h). Individual sub-values in each Lua number can be indexed with the "packed" operator. Syntactically, the operator looks like a function call here, but semantically it is more like an indexing operator on register lexicals, and it has it's own op-codes.

-- ex1.lua

-- If you don't use the SIMD features, numbers will behave
-- as usual.  This is because any SIMD unaware operations
-- (e.g. print) only use the first sub-value of a number. 
local x = 10
print(x + 50)  --> 60

-- Individual subvalues can be accessed with the packed operator:
assert(packed(x,1) == 10)  -- success
packed(x,2) = 20
assert(packed(x,1) == 10)
assert(packed(x,2) == 20)

local y; packed(y,1) = 100; packed(y,2) = 200;
local z = x + y     -- this is a SIMD instruction
assert(packed(z,1) == 110)
assert(packed(z,2) == 220)
print(z) --> 110   (print is not currently SIMD-aware)

print(_SIMD_LEN) --> prints some number (e.g. 2)

Here's what the opcodes look like for that. Note the new GETPACKED and SETPACKED instructions. All other opcodes are unchanged.

$ ./src/luac -p -l ex1.lua

main <ex.lua:0,0> (51 instructions, 204 bytes at 0x682930)
0+ params, 5 slots, 0 upvalues, 3 locals, 12 constants, 0 functions
        1       [4]     LOADK           0 -1    ; 10
        2       [5]     GETGLOBAL       1 -2    ; print
        3       [5]     ADD             2 0 -3  ; - 50
        4       [5]     CALL            1 2 1
        5       [8]     GETGLOBAL       1 -4    ; assert
        6       [8]     GETPACKED       2 0 -5  ; 1
        7       [8]     EQ              1 2 -1  ; - 10
        8       [8]     JMP             1       ; to 10
        9       [8]     LOADBOOL        2 0 1
        10      [8]     LOADBOOL        2 1 0
        11      [8]     CALL            1 2 1
        12      [9]     SETPACKED       0 -6 -7 ; 2 20
        13      [10]    GETGLOBAL       1 -4    ; assert
        14      [10]    GETPACKED       2 0 -5  ; 1
        15      [10]    EQ              1 2 -1  ; - 10
        16      [10]    JMP             1       ; to 18
        17      [10]    LOADBOOL        2 0 1
        18      [10]    LOADBOOL        2 1 0
        19      [10]    CALL            1 2 1
        20      [11]    GETGLOBAL       1 -4    ; assert
        21      [11]    GETPACKED       2 0 -6  ; 2
        22      [11]    EQ              1 2 -7  ; - 20
        23      [11]    JMP             1       ; to 25
        24      [11]    LOADBOOL        2 0 1
        25      [11]    LOADBOOL        2 1 0
        26      [11]    CALL            1 2 1
        27      [13]    LOADNIL         1 1
        28      [13]    SETPACKED       1 -5 -8 ; 1 100
        29      [13]    SETPACKED       1 -6 -9 ; 2 200
        30      [14]    ADD             2 0 1
        31      [15]    GETGLOBAL       3 -4    ; assert
        32      [15]    GETPACKED       4 2 -5  ; 1
        33      [15]    EQ              1 4 -10 ; - 110
        34      [15]    JMP             1       ; to 36
        35      [15]    LOADBOOL        4 0 1
        36      [15]    LOADBOOL        4 1 0
        37      [15]    CALL            3 2 1
        38      [16]    GETGLOBAL       3 -4    ; assert
        39      [16]    GETPACKED       4 2 -6  ; 2
        40      [16]    EQ              1 4 -11 ; - 220
        41      [16]    JMP             1       ; to 43
        42      [16]    LOADBOOL        4 0 1
        43      [16]    LOADBOOL        4 1 0
        44      [16]    CALL            3 2 1
        45      [17]    GETGLOBAL       3 -2    ; print
        46      [17]    MOVE            4 2
        47      [17]    CALL            3 2 1
        48      [19]    GETGLOBAL       3 -2    ; print
        49      [19]    GETGLOBAL       4 -12   ; _SIMD_LEN
        50      [19]    CALL            3 2 1
        51      [19]    RETURN          0 1

Now, let's do some benchmarks: summing the integers from 1 to 2^28. In standard Lua it would look like this:

-- test1-standard.lua
local sum = 0
for i=1,2^28 do sum = sum + i end
print(sum)

Using this SIMD patch, we could rewrite it like this:

-- test1-simd.lua

local N=_SIMD_LEN

local j;    for k=1,N do packed(j,k)   = k end
local psum; for k=1,N do packed(psum,k)= 0 end
local fi;   for k=1,N do packed(fi,k)  = k end
local fs;   for k=1,N do packed(fs,k)  = N end

for i=fi,2^28,fs do
  psum = psum + i
end
local sum = 0
for i=1,N do
  -- print('partial sum', i, packed(psum,i))
  sum = sum + packed(psum,i)
end
print('sum:', sum)

The SIMD version computes N partial sums (each computed in parallel) and then sums the partial sums.

Here's the benchmark runtimes as a function of the compiled value of _SIMD_LEN (N).

Lua version       run-time
----------------  --------
unpatched         8.095s
SIMD patch N=1    9.265s
SIMD patch N=2    5.660s
SIMD patch N=3    3.566s
SIMD patch N=4    3.248s
SIMD patch N=8    2.688s
SIMD patch N=32   2.406s
SIMD patch N=256  2.529s
SIMD patch N=1024 2.894s

Using the SIMD patch with N=1 effectively negates the patch and is useful to compare to the unpatched version. With N set to 32, the runtimes are reduced by a significant 70%, at least on this very simple example.

The patch against Lua 5.1.4 is in Appendix A below.

A simple extension to this patch would be to use real SSE operations [2] (which hopefully will improve performance performance further).

--DavidManura

Appendix A - Lua Patch

diff -ur lua-5.1.4/src/lbaselib.c lua-5.1.4-simd/src/lbaselib.c
--- lua-5.1.4/src/lbaselib.c	2008-02-14 11:46:22.000000000 -0500
+++ lua-5.1.4-simd/src/lbaselib.c	2009-01-01 22:43:22.671875000 -0500
@@ -642,6 +642,11 @@
   lua_setfield(L, -2, "__mode");  /* metatable(w).__mode = "kv" */
   lua_pushcclosure(L, luaB_newproxy, 1);
   lua_setglobal(L, "newproxy");  /* set global `newproxy' */
+
+  /* SIMD-PATCH */
+  lua_pushnumber(L, LUA_SIMD_LEN);
+  lua_setglobal(L, "_SIMD_LEN");
+
 }
 
 
diff -ur lua-5.1.4/src/lcode.c lua-5.1.4-simd/src/lcode.c
--- lua-5.1.4/src/lcode.c	2007-12-28 10:32:23.000000000 -0500
+++ lua-5.1.4-simd/src/lcode.c	2009-01-01 22:20:37.890625000 -0500
@@ -317,6 +317,16 @@
       e->k = VRELOCABLE;
       break;
     }
+
+    /* SIMD-PATCH */
+    case VPACKED: {
+      freereg(fs, e->u.s.aux);
+      freereg(fs, e->u.s.info);
+      e->u.s.info = luaK_codeABC(fs, OP_GETPACKED, 0, e->u.s.info, e->u.s.aux);
+      e->k = VRELOCABLE;
+      break;
+    }
+
     case VINDEXED: {
       freereg(fs, e->u.s.aux);
       freereg(fs, e->u.s.info);
@@ -486,6 +496,14 @@
       luaK_codeABx(fs, OP_SETGLOBAL, e, var->u.s.info);
       break;
     }
+
+    /* SIMD-PATCH */
+    case VPACKED: {
+      int e = luaK_exp2RK(fs, ex);
+      luaK_codeABC(fs, OP_SETPACKED, var->u.s.info, var->u.s.aux, e);
+      break;
+    }
+
     case VINDEXED: {
       int e = luaK_exp2RK(fs, ex);
       luaK_codeABC(fs, OP_SETTABLE, var->u.s.info, var->u.s.aux, e);
diff -ur lua-5.1.4/src/llex.c lua-5.1.4-simd/src/llex.c
--- lua-5.1.4/src/llex.c	2007-12-27 08:02:25.000000000 -0500
+++ lua-5.1.4-simd/src/llex.c	2009-01-01 22:20:17.531250000 -0500
@@ -37,7 +37,11 @@
 const char *const luaX_tokens [] = {
     "and", "break", "do", "else", "elseif",
     "end", "false", "for", "function", "if",
-    "in", "local", "nil", "not", "or", "repeat",
+    "in", "local", "nil", "not", "or",
+
+    "packed", /* SIMD-PATCH */
+    
+    "repeat",
     "return", "then", "true", "until", "while",
     "..", "...", "==", ">=", "<=", "~=",
     "<number>", "<name>", "<string>", "<eof>",
diff -ur lua-5.1.4/src/llex.h lua-5.1.4-simd/src/llex.h
--- lua-5.1.4/src/llex.h	2007-12-27 08:02:25.000000000 -0500
+++ lua-5.1.4-simd/src/llex.h	2009-01-01 22:21:41.859375000 -0500
@@ -25,7 +25,11 @@
   /* terminal symbols denoted by reserved words */
   TK_AND = FIRST_RESERVED, TK_BREAK,
   TK_DO, TK_ELSE, TK_ELSEIF, TK_END, TK_FALSE, TK_FOR, TK_FUNCTION,
-  TK_IF, TK_IN, TK_LOCAL, TK_NIL, TK_NOT, TK_OR, TK_REPEAT,
+  TK_IF, TK_IN, TK_LOCAL, TK_NIL, TK_NOT, TK_OR,
+  
+  TK_PACKED, /* SIMD-PATCH */
+  
+  TK_REPEAT,
   TK_RETURN, TK_THEN, TK_TRUE, TK_UNTIL, TK_WHILE,
   /* other terminal symbols */
   TK_CONCAT, TK_DOTS, TK_EQ, TK_GE, TK_LE, TK_NE, TK_NUMBER,
diff -ur lua-5.1.4/src/lobject.h lua-5.1.4-simd/src/lobject.h
--- lua-5.1.4/src/lobject.h	2008-08-06 09:29:48.000000000 -0400
+++ lua-5.1.4-simd/src/lobject.h	2009-01-01 22:42:06.609375000 -0500
@@ -70,8 +70,16 @@
 
 #define TValuefields	Value value; int tt
 
+/* SIMD-PATCH */
+typedef struct lua_TValueItem {
+  TValuefields;
+} TValueItem;
+
 typedef struct lua_TValue {
   TValuefields;
+
+  TValueItem more[LUA_SIMD_LEN - 1]; /* SIMD-PATCH */
+
 } TValue;
 
 
@@ -102,6 +110,9 @@
 
 #define l_isfalse(o)	(ttisnil(o) || (ttisboolean(o) && bvalue(o) == 0))
 
+/* SIMD-PATCH */
+#define packeditem(o,i) ( (TValue*)((TValueItem*)(o) + (i)) )
+
 /*
 ** for internal debug only
 */
diff -ur lua-5.1.4/src/lopcodes.c lua-5.1.4-simd/src/lopcodes.c
--- lua-5.1.4/src/lopcodes.c	2007-12-27 08:02:25.000000000 -0500
+++ lua-5.1.4-simd/src/lopcodes.c	2009-01-01 22:19:43.640625000 -0500
@@ -52,6 +52,11 @@
   "CLOSE",
   "CLOSURE",
   "VARARG",
+
+  /* SIMD-PATCH */
+  "GETPACKED",
+  "SETPACKED",
+
   NULL
 };
 
@@ -98,5 +103,10 @@
  ,opmode(0, 0, OpArgN, OpArgN, iABC)		/* OP_CLOSE */
  ,opmode(0, 1, OpArgU, OpArgN, iABx)		/* OP_CLOSURE */
  ,opmode(0, 1, OpArgU, OpArgN, iABC)		/* OP_VARARG */
+
+ /* SIMD-PATCH */
+ ,opmode(0, 1, OpArgR, OpArgK, iABC)		/* OP_GETPACKED */
+ ,opmode(0, 0, OpArgK, OpArgK, iABC)		/* OP_SETPACKED */
+
 };
 
diff -ur lua-5.1.4/src/lopcodes.h lua-5.1.4-simd/src/lopcodes.h
--- lua-5.1.4/src/lopcodes.h	2007-12-27 08:02:25.000000000 -0500
+++ lua-5.1.4-simd/src/lopcodes.h	2009-01-01 22:23:40.531250000 -0500
@@ -205,10 +205,18 @@
 OP_CLOSURE,/*	A Bx	R(A) := closure(KPROTO[Bx], R(A), ... ,R(A+n))	*/
 
 OP_VARARG/*	A B	R(A), R(A+1), ..., R(A+B-1) = vararg		*/
-} OpCode;
+
+/* SIMD-PATCH */
+,OP_GETPACKED/*	A B C	R(A) := packed(R(B), RK(C))				*/
+,OP_SETPACKED/*	A B C	packed(R(A), RK(B)) := RK(C)				*/
 
 
+} OpCode;
+
+#if 0 /* SIMD-PATCH */
 #define NUM_OPCODES	(cast(int, OP_VARARG) + 1)
+#endif
+#define NUM_OPCODES	(cast(int, OP_SETPACKED) + 1)
 
 
 
diff -ur lua-5.1.4/src/lparser.c lua-5.1.4-simd/src/lparser.c
--- lua-5.1.4/src/lparser.c	2007-12-28 10:32:23.000000000 -0500
+++ lua-5.1.4-simd/src/lparser.c	2009-01-01 22:20:07.343750000 -0500
@@ -691,6 +691,22 @@
   /* primaryexp ->
         prefixexp { `.' NAME | `[' exp `]' | `:' NAME funcargs | funcargs } */
   FuncState *fs = ls->fs;
+
+  /* SIMD-PATCH */
+  if (ls->t.token == TK_PACKED) {
+    expdesc t;
+    expdesc k;
+    luaX_next(ls);
+    checknext(ls, '(');
+    expr(ls, &t);
+    checknext(ls, ',');
+    expr(ls, &k);
+    checknext(ls, ')');
+    init_exp(v, VPACKED, luaK_exp2anyreg(fs, &t));
+    v->u.s.aux = luaK_exp2RK(fs, &k);
+    return;
+  }
+
   prefixexp(ls, v);
   for (;;) {
     switch (ls->t.token) {
diff -ur lua-5.1.4/src/lparser.h lua-5.1.4-simd/src/lparser.h
--- lua-5.1.4/src/lparser.h	2007-12-27 08:02:25.000000000 -0500
+++ lua-5.1.4-simd/src/lparser.h	2009-01-01 22:48:25.640625000 -0500
@@ -26,12 +26,17 @@
   VLOCAL,	/* info = local register */
   VUPVAL,       /* info = index of upvalue in `upvalues' */
   VGLOBAL,	/* info = index of table; aux = index of global name in `k' */
+
+  /* SIMD-PATCH */
+  VPACKED,	/* info = local register; aux = index register (or `k') */
+
   VINDEXED,	/* info = table register; aux = index register (or `k') */
   VJMP,		/* info = instruction pc */
   VRELOCABLE,	/* info = instruction pc */
   VNONRELOC,	/* info = result register */
   VCALL,	/* info = instruction pc */
   VVARARG	/* info = instruction pc */
+
 } expkind;
 
 typedef struct expdesc {
diff -ur lua-5.1.4/src/luaconf.h lua-5.1.4-simd/src/luaconf.h
--- lua-5.1.4/src/luaconf.h	2008-02-11 11:25:08.000000000 -0500
+++ lua-5.1.4-simd/src/luaconf.h	2009-01-01 23:42:57.625000000 -0500
@@ -510,6 +510,12 @@
 */
 #define LUAI_UACNUMBER	double
 
+/* SIMD-PATCH */
+/*
+@@ LUA_SIMD_LEN is the number of values computed in parallel.
+** This is readable via the global _SIMD_LEN.
+*/
+#define LUA_SIMD_LEN 8
 
 /*
 @@ LUA_NUMBER_SCAN is the format for reading numbers.
diff -ur lua-5.1.4/src/lvm.c lua-5.1.4-simd/src/lvm.c
--- lua-5.1.4/src/lvm.c	2007-12-28 10:32:23.000000000 -0500
+++ lua-5.1.4-simd/src/lvm.c	2009-01-01 22:43:45.437500000 -0500
@@ -357,6 +357,7 @@
 #define Protect(x)	{ L->savedpc = pc; {x;}; base = L->base; }
 
 
+#if 0 /* SIMDPATCH */
 #define arith_op(op,tm) { \
         TValue *rb = RKB(i); \
         TValue *rc = RKC(i); \
@@ -367,6 +368,23 @@
         else \
           Protect(Arith(L, ra, rb, rc, tm)); \
       }
+#endif
+#define arith_op(op,tm) { \
+        TValue *rb = RKB(i); \
+        TValue *rc = RKC(i); \
+        if (ttisnumber(rb) && ttisnumber(rc)) { \
+          int j; \
+          for (j=0; j < LUA_SIMD_LEN; j++) { \
+            TValue * rai = packeditem(ra,j); \
+            lua_Number nb = nvalue(rb), nc = nvalue(rc); \
+            setnvalue(rai, op(nb, nc)); \
+            (*(TValueItem**)(void*)&rb) ++; \
+            (*(TValueItem**)(void*)&rc) ++; \
+          } \
+        } \
+        else \
+          Protect(Arith(L, ra, rb, rc, tm)); \
+      }
 
 
 
@@ -401,7 +419,15 @@
     lua_assert(L->top == L->ci->top || luaG_checkopenop(i));
     switch (GET_OPCODE(i)) {
       case OP_MOVE: {
+
+        #if 0 /* SIMD-PATCH */
         setobjs2s(L, ra, RB(i));
+        #endif
+        int j;
+        for (j=0; j < LUA_SIMD_LEN; j++) {
+          setobjs2s(L, packeditem(ra,j), packeditem(RB(i),j));
+        }
+
         continue;
       }
       case OP_LOADK: {
@@ -654,8 +680,21 @@
         if (luai_numlt(0, step) ? luai_numle(idx, limit)
                                 : luai_numle(limit, idx)) {
           dojump(L, pc, GETARG_sBx(i));  /* jump back */
+
+          #if 0 /* SIMD-PATCH */
           setnvalue(ra, idx);  /* update internal index... */
           setnvalue(ra+3, idx);  /* ...and external index */
+          #endif
+          { int j;
+            for (j=0; j<LUA_SIMD_LEN; j++) {
+              TValue* ra_ = packeditem(ra, j);
+              TValue* step_ = packeditem(ra+2, j);
+              lua_Number idx_ = luai_numadd(nvalue(ra_), nvalue(step_));
+              setnvalue(ra_, idx_);
+              setnvalue(ra_+3, idx_);
+            }
+          }
+
         }
         continue;
       }
@@ -670,7 +709,19 @@
           luaG_runerror(L, LUA_QL("for") " limit must be a number");
         else if (!tonumber(pstep, ra+2))
           luaG_runerror(L, LUA_QL("for") " step must be a number");
+
+        #if 0 /* SIMD-PATCH */
         setnvalue(ra, luai_numsub(nvalue(ra), nvalue(pstep)));
+        #endif
+        { int j;
+          for (j=0; j < LUA_SIMD_LEN; j++) {
+            setnvalue(packeditem(ra,j),
+              luai_numsub(nvalue(packeditem(ra,j)),
+                          nvalue(packeditem(pstep,j))) );
+          }
+        }
+
+
         dojump(L, pc, GETARG_sBx(i));
         continue;
       }
@@ -757,6 +808,26 @@
         }
         continue;
       }
+
+      /* SIMD-PATCH */
+      case OP_GETPACKED: {
+        size_t off = (size_t)nvalue(RKC(i));
+        if (off < 1 || off > LUA_SIMD_LEN) {
+          luaG_runerror(L, "packed offset out of range");
+        }
+        setobjs2s(L, ra, (TValue*)((TValueItem*)RB(i) + off - 1) );
+        continue;
+      }
+      case OP_SETPACKED: {
+        size_t off = (size_t)nvalue(RKB(i));
+        if (off < 1 || off > LUA_SIMD_LEN) {
+          luaG_runerror(L, "packed offset out of range");
+        }
+        setobjs2s(L, (TValue*)((TValueItem*)ra + off - 1), RKC(i) );
+        continue;
+      }
+
+
     }
   }
 }
diff -ur lua-5.1.4/src/print.c lua-5.1.4-simd/src/print.c
--- lua-5.1.4/src/print.c	2007-03-25 20:17:38.000000000 -0400
+++ lua-5.1.4-simd/src/print.c	2009-01-01 22:19:20.562500000 -0500
@@ -116,6 +116,9 @@
     printf("\t; %s",svalue(&f->k[bx]));
     break;
    case OP_GETTABLE:
+
+   case OP_GETPACKED: /* SIMD-PATCH */
+
    case OP_SELF:
     if (ISK(c)) { printf("\t; "); PrintConstant(f,INDEXK(c)); }
     break;
@@ -128,6 +131,9 @@
    case OP_EQ:
    case OP_LT:
    case OP_LE:
+
+   case OP_SETPACKED: /* SIMD-PATCH */
+
     if (ISK(b) || ISK(c))
     {
      printf("\t; ");

RecentChanges · preferences
edit · history
Last edited May 2, 2009 2:32 am GMT (diff)