[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: Computed goto optimization of vanilla Lua
- From: sur-behoffski <sur_behoffski@...>
- Date: Fri, 5 Feb 2016 08:16:20 +1030
On 02/04/16 22:16, firstname.lastname@example.org wrote:
Thanks for sharing this. It looks like a really interesting and
> easy-to-perform optimization (even if not new to everyone in the list ;).
I will definitely check how it performs in my Lua implementation on iOS. [...]
I was playing with threaded code and computed GOTOs in 8086 assembly back
in the late '90s, using C to handle the higher-level task analysis and
setup of the state tables. See:
The main code fitted into a 256-byte page, hence sat in a 386 L1
instruction cache. Code could jump to locations outside this page
for trickier things.
The trick was that a 2-byte or three-byte jump could be replaced with
(sort-of) a 4-byte tail call:
lodsb ; one byte
xlat ; one byte
jmp ax ; two bytes
In the publication, I presented three applications:
1. A simple word-counting state machine (increment a word counter on
each non-word-to-word state transition);
2. A simple regular expression matcher, although based on
non-deterministic (backtracking) finite-state automata, and
therefore vulnerable to exponential slowdowns in some cases; and
3. A simple static Huffman decoder.
In the regular expression case, I was able to optimise away some of the
backtracking cases in some tables, as an inspection of the next table
showed that any such backtracking would lead to a dead end.
In relation to the present Lua discussion:
- I received feedback that the _64 architectures were moving towards
a RISC-style load/store style, and that slow microcode was used to
emulate older instructions;
- Modern architectures are heavily pipelined, and also use both
speculative execution and jump prediction; the threaded jmp causes
a pipeline stall, which is a major performance hit.
I haven't researched the latest enhancements, but it's possible that
jump prediction could mitigate some of this performance hit... but
this is starting to go very close to the JIT recoding of the VM
There were other very naive reasons why the code ran slowly when an
underlying OS used an MMU was use to present the file to the code to the
program (I was trying to merge CR/LF line termination and LF termination
into a single case, and so incurred heavy copy-on-write page faults).
That was a case of my microprocessor background slipping through the net.
So, when examining code performance, pipeline stalls, jump prediction and
speculative execution are things I look for fairly early on. My gut feel
is that the interaction of these issues and VMs gets fairly hairy fairly
quickly. [The other issue I mentioned, RISC load/store versus microcode,
is something I haven't tracked over time.]
Hope this helps.
sur-behoffski (Brenton Hoff)
Programmer, Grouse Software