lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Wednesday 27, Mike Pall wrote:
> Robert G. Jakabosky wrote:
> > The first time I ran the
> > meteor.lua script it used more then 30 seconds to codegen all the Lua
> > code, where as the normal Lua vm took less then 4 seconds to run the
> > whole script.
>
> The generated source from meteor.lua weighs in at 5600 lines with
> plenty of conditionals. Parse time (Lua -> bytecode) is 13ms,
> compile time (bytecode -> machine code) for LuaJIT 1.1 is 25ms
> (compare to the 30s for LLVM). Is this due to suboptimal use of
> LLVM or does it really mean LLVM as a JIT compiler is 1200x
> slower? *ick*
The LLVM JIT has to codegen a lot of LLVM IR for each of the inlined opcode 
function.  The codegen phase has convert the generic IR into machine code, 
compared to LuaJIT that only has to merge machine code snippets.

This is my first first time using LLVM and I don't know all the internals, so 
there might be a simple way to improve the codegen phase (at least I hope 
there is).

> > See the attached benchmark.log comparing the normal Lua vm with llvm-lua,
> > native, and luajit.
>
> I think you'd get a more meaningful comparison if you'd run LuaJIT
> with "luajit -O". Running all of the scripts in sequence within
> the same Lua state will distort the results since some tests turn
> off the GC or the JIT compiler and all subsequent tests will
> suffer.
Attached is a new run of the benchmark results with all the VMs being started 
from a bash script once per script (just like the 'native' method from the 
last benchmark).  Also I have added the "-O" flag to luajit.

> You haven't stated how Lua was compiled (for my comparisons I use
> -O3 -fomit-frame-pointer for all sources). It's usually easier to
> compare performance by taking a baseline (the standard Lua VM) and
> giving the speedup relative to that baseline (e.g. 2.00 = takes
> half the time). And summing up the times of all benchmarks is not
> a useful metric at all.
Sorry about that.  I think Lua from the original bechmark run was compiled 
with just "-O2".  This time they are all compiled 
with "-march=athlon64 -O3 -fomit-frame-pointer".  Only LuaJIT was compiled as 
32bit code the others are 64bit.  That might not be fair to LuaJIT, but the 
reason I started this project was to get JIT support on x86_64.  I might do 
another benchmark run tonight with them all compiled as 32bit code.

Also attached is a patch to 'lcoco.c' to add x86_64 assembly coroutine support 
to LuaCoco.  It saves 9 64bit registers (It might be possible to lower that 
count, since the assembly code is inlined and the parent function might not 
use all those registers).

> Some of the benchmarks seem to be not the best/latest version
> (e.g. pidigits) or are missing in the comparison (mandelbrot?).
> Many of the older benchmarks are obsolete and contain rather
> untuned Lua code. This doesn't matter as long as one only compares
> between Lua VMs. But the moment someone gets his hands on these
> and compares them to other languages ...
They are a year or more old, from a cvs snapshoot(from 2007-5-14 maybe) of the 
language shootout.  I collected those scripts together as a testsuite when 
working on the Emergency GC patch.  It would be nice to collect together a 
set of scripts for benchmarking and testing Lua vm implementations.  Parrot 
VM has a good set of test scripts for testing the Lua implementation that 
they have.

I had looked at using the Parrot VM instead of LLVM since they already had a 
Lua->PIR compiler, but it was very slow and not 100% compatible yet (upvalues 
where not correctly implemented).  Also Parrot doesn't have a working x86_64 
JIT yet (If I remember right it was mostly missing).

> BTW: Latest SciMark for Lua is here:
>   http://luajit.org/download/scimark-2008-01-22.lua
Thanks I will take a look at that.

-- 
Robert G. Jakabosky
--- LuaJIT-1.1.4/src/lcoco.c	2008-02-05 08:00:00.000000000 -0800
+++ llvm-lua/src/lcoco.c	2008-08-23 12:47:46.000000000 -0700
@@ -134,6 +134,39 @@
   coco->arg0 = (size_t)(a0);
 #define COCO_STATE_HEAD		size_t arg0;
 
+#elif defined(__x86_64__)
+
+typedef void *coco_ctx[9];  /* rip, rsp, rbp, rbx, r12, r13, r14, r15, rdi */
+static inline void coco_switch(coco_ctx from, coco_ctx to)
+{
+  __asm__ __volatile__ (
+    "leaq 1f(%%rip), %%rax\n\t"
+    "movq %%rax, (%0)\n\t" "movq %%rsp, 8(%0)\n\t" "movq %%rbp, 16(%0)\n\t"
+		"movq %%rbx, 24(%0)\n\t" "movq %%r12, 32(%0)\n\t" "movq %%r13, 40(%0)\n\t"
+		"movq %%r14, 48(%0)\n\t" "movq %%r15, 56(%0)\n\t" "movq %%rdi, 64(%0)\n\t"
+		"movq %1, %%rax\n\t"
+    "movq 64(%%rax), %%rdi\n\t" "movq 56(%%rax), %%r15\n\t" "movq 48(%%rax), %%r14\n\t"
+    "movq 40(%%rax), %%r13\n\t" "movq 32(%%rax), %%r12\n\t" "movq 24(%%rax), %%rbx\n\t"
+		"movq 16(%%rax), %%rbp\n\t" "movq 8(%%rax), %%rsp\n\t"
+		"jmp *(%%rax)\n"
+		"1:\n"
+    : "+S" (from), "+D" (to) : : "rax", "rcx", "rdx", "r8", "r9", "r10", "r11", "memory", "cc");
+}
+
+#define COCO_CTX		coco_ctx
+#define COCO_SWITCH(from, to)	coco_switch(from, to);
+#define COCO_MAKECTX(coco, buf, func, stack, a0) \
+  buf[0] = (void *)(func); \
+  buf[1] = (void *)(stack); \
+  buf[2] = (void *)0; \
+  buf[3] = (void *)0; \
+  buf[4] = (void *)0; \
+  buf[5] = (void *)0; \
+  buf[6] = (void *)0; \
+  buf[7] = (void *)0; \
+  buf[8] = (void *)a0; /* rdi == argument 0 */\
+  stack[0] = 0xdeadc0c0deadc0c0;  /* Dummy return address. */ \
+
 #elif __mips && _MIPS_SIM == _MIPS_SIM_ABI32 && !defined(__mips_eabi)
 
 /* No way to avoid the function prologue with inline assembler. So use this: */
script        lua      llvm-lua  native   luajit   
ackermann     2.20     2.31      1.88     0.40     
ary           1.06     0.60      0.54     0.27     
binarytrees   1.62     1.47      1.36     0.97     
chameneos     1.13     0.71      0.63     0.23     
except        1.12     0.97      0.87     0.49     
fannkuch      1.40     0.81      0.69     0.29     
fibo          0.83     0.54      0.50     0.13     
harmonic      1.28     0.42      0.28     0.23     
hash          1.08     1.12      0.91     0.89     
hash2         1.07     0.90      0.86     0.49     
heapsort      1.09     0.61      0.50     0.25     
hello         0.00     0.07      0.00     0.00     
knucleotide   0.66     0.72      0.56     0.44     
lists         1.04     0.83      0.69     0.43     
matrix        1.06     0.64      0.55     0.29     
meteor        3.95     6.03      4.52     1.20     
methcall      1.14     0.85      0.70     0.42     
moments       0.93     0.87      0.74     0.87     
nbody         1.05     0.78      0.54     0.23     
nestedloop    1.06     0.39      0.28     0.16     
nsieve        1.12     0.89      0.79     0.59     
nsievebits    1.39     0.84      0.66     0.23     
objinst       1.02     0.97      0.84     0.77     
partialsums   1.01     0.74      0.63     0.30     
pidigits      1.21     1.72      1.30     0.56     
process       0.00     0.82      0.90     0.82     
prodcons      0.98     0.68      0.58     0.29     
random        1.09     0.54      0.50     0.22     
recursive     1.27     0.80      0.65     0.15     
regexdna      0.97     1.04      1.23     0.96     
regexmatch    0.26     0.34      0.28     0.26     
revcomp       0.36     0.50      0.39     0.24     
reversefile   0.44     0.45      0.45     0.40     
sieve         1.03     0.62      0.55     0.28     
spectralnorm  1.23     0.66      0.54     0.27     
spellcheck    0.74     0.77      0.76     0.72     
strcat        0.87     0.82      0.70     0.63     
sumcol        0.81     0.88      0.81     0.85     
takfp         0.43     0.35      0.27     0.07     
wc            0.92     1.02      1.09     1.40     
wordfreq      0.68     0.73      0.65     0.62     
Total         42.60    37.82     32.17    19.31