[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Lua is faster than Java?
- From: Mike Pall <mikelu-0507@...>
- Date: Mon, 25 Jul 2005 15:19:24 +0200
Hi,
David Given wrote:
> I'm not sure a Lua-to-C compiler would help --- Lua's so dynamic that you
> can't actually convert a Lua expression into a single machine code routine.
> Even something as simple as "a + b" could do *anything*, depending on what
> types a and b were, what their metatables were defined as, the enclosing
> scopes, etc.
But you can take a good guess what the most common types for
a specific opcode are. This allows you to inline the common
case, provided you first check for the expected type (and use
a fallback handler if that fails).
E.g. here is 'x=x+2' (OP_ADD with 1 variable and 1 constant):
080731A0 cmp dword [ebx+0x8],byte +0x3 // LUA_TNUMBER in stack slot #0?
080731A4 jnz 0x80731ba // No, use fallback code
080731A6 fld qword [0x8074450] // Operand #2: FP constant '2'
080731AC fadd qword [ebx] // Operand #1: stack slot #0
080731AE fstp qword [ebx] // Destination: stack slot #0
080731B0 ...
// POST code segment:
080731BA lea ecx,[ebx] // Operand #1
080731BC mov edx,0x8074450 // Operand #2
// Destination: BASE += 0 omitted
080731C1 mov eax,0x5 // TM_ADD
080731C6 mov dword [esi+0x1c],0x806e9c8 // L->ctx = &pt->code[pc+1]
080731CD call 0x806e928 // Call fallback handler
080731D2 jmp short 0x80731b0 // Continue with next opcode
That's only 5 (five!) instructions in the fast path. And this
is even without data flow analysis, which might allow you to
remove the check (2 instructions less). But it probably doesn't
pay off here since the forward branch is statically predicted as
'not-taken' and the FP pipeline hides some of the latency.
Similar things can be done for x..y (probably strings) or *x
(probably a table or a string) and so on ...
And (most important) you can inline the whole hashtable lookup
for t.foo (aka t["foo"]) in a few instructions ("foo" is a constant
and you know both the hash value and the TString pointer at
compile time). This makes table-based dispatch competitive with
static method dispatch (like in compiled C++).
> All naive compilation would gain is saving the overhead of doing
> the instruction decode. (Which may still be worthwhile, of course --- how
> much overhead is there?)
Instruction decoding is only one part of the overhead for a
bytecode interpreter. The other major problem is the unpredictable
branch for the opcode dispatch (C switch statement).
Here's the timing on a code snippet which alternates between
two opcodes (LOADBOOL and FORLOOP):
$ TIMEFORMAT="%U"
$ time lua -e 'local x; for i=1,1e8 do x = true end'
6.106
$ time luajit -e 'local x; for i=1,1e8 do x = true end'
0.893
The machine code is almost 7 times faster. So yes, it _does_ pay off.
> However, a dynamically recompiling JIT like Python's Psyco would
> probably make Lua fly (even more). Unfortunately, these require
> deep knowledge of the black arts to make work effectively...
Python has a plethora of different types. And almost all of
them can be used with any opcode. It's a lot more difficult
to statically predict the types for (say) a[b]. A dynamically
specializing compiler (such as Psyco) is the only realistic
approach for Python.
With Lua one can do far better with far less effort. The reduced
type variability pays off here, too.
Ok, so you read past all this, so here is the hidden message:
I'm working on LuaJIT, but didn't want to announce it before
I was ready to release the code. I've gotten pretty far, but
there's still a lot to do. Initial results are very promising
(see above). ETA for the first release is 8/2005. Please be
patient. Thank you.
Bye,
Mike