[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: Threaded vm core loop
- From: sur-behoffski <sur_behoffski@...>
- Date: Thu, 6 Apr 2017 22:06:25 +0930
Way back in 1997, I used the x86 instruction sequence "lodsb; xlat; jmp ax"
(all of 4 bytes) to do an unrolled threaded finite-state machine. I
published an article in Dr. Dobb's Journal ("High-Speed Finite-State
Machines"), showing this technique in use for a word-counting utility, a
static Huffman decoder, and a stripped-down non-deterministic (sigh)
regular expression search engine (incorporating a Boyer-Moore string
One of the benefits was that the entire code page for the state machine
fitted into 256 bytes, which gave it high cache coherency (even on a 386,
I think the instruction cache was L1). I could also analyse relationships
between state tables and sometimes substitute cheaper opcodes where I
could see that the next state, if not successful, would not have a
However, the project only met with muted interest. I found that the
Pentium was becoming the dominant PC processor, and it was moving towards
a load-store architecture, with branch prediction, with microcode being
used to support instructions less relevant as compiler technology also
evolved. I suspect that lodsb and xlat were candidates for microcode.
Where I saw potential with the threaded code I was playing with, but
never really got to look at closely, was in places where a programmer
could apply "vertical merging", e.g. combining a UTF-8 decoder with a
word count algorithm. However, this also has been made less relevant
in recent times, with speculative execution and multiple execution
units meaning that sequential operation could be kept front and centre...
... this is because an execution pipeline stall, such as would almost
certainly occur from a jump-table branch, is really expensive, as most of
the tricks noted above become unavailable. This is where "vertical
merging" is my hope: You cram various vertical abstract layers into a
single finite-state machine, such that the state machine does enough work
with each table-lookup dispatch, that you gain enough performance to
afford the significant performance hit of the pipeline stall.
The likely biggest downside of such merging is that state tables may very
quickly become too numerous to provide good performance, first in cache
coherency, and then perhaps in exceeding available memory.
My gut-feel reaction to performance differences on different machines
would be to look at cache coherency, perhaps microcode support, and then
branch prediction efficiency. JIT is very likely to write moderately
longer machine-code sequences than the tight opcode versions of code
segments, and so may suffer consequences in a higher rate of cache misses,
but I suspect this is a fairly minor consideration in modern hardware.
Anyway, hope that you found this threaded code/VM loop message useful.