lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Jun 5, 2012, at 6:59 AM, Rob Kendrick wrote:

> On Tue, Jun 05, 2012 at 10:15:00AM +0200, Jo-Philipp Wich wrote:
>> Hi.
>> 
>> The OpenWrt Lua is patched and uses a different bytecode format
>> compared to the official vanilla Lua release.
>> 
>> You can find the list of used patches here:
>> https://dev.openwrt.org/browser/trunk/package/lua/patches
> 
> Some of these patches terrify me.  Specifically, the opcode performance
> patch.  Does that really help that much?  It's a pretty simplistic use
> of computed goto to replace a pretty simple and small switch statement.

Well, it doesn't exactly terrify me, but I'm not sure what good it does in position-independent code. Looking at the -Os output on mips2 (close to the dominant OpenWrt architecture) the switch statement uses two additional instructions over the computed goto (a range check). My intuition is there is slightly better load scheduling for the goto version, but it's far enough down in the weeds the microarchitecture may be relevant. At this level, the memory system (including the TLB and cache associativity) is in play--benchmarks on live systems seem wise, especially because the low-end Linux embedded systems tend to have awful memory hierarchies.[1]

Note that in position-independent code, a table of label addresses ends up requiring relocations at runtime since the absolute address of the instructions are not known until then. The table may be const, but the program loader has to remap the table read/write in order to apply the relocations.

In theory GCC could be smart enough to convert a static const table of label addresses interior to a function to be offsets from function start, which would make the table values position-independent and truly read-only. But since this is exactly what the switch statement generator does there is little point. By using computed goto, you said you were smarter than the compiler, and it believed you....

College-level architecture courses had a dark age teaching assembly language and the machine model as if all code was statically linked on a machine with direct-mapped memory. H&P should have fixed that, but home computer weenies like me still have bogus intuition about what's cheap and what's expensive--the world isn't running on a 6502 or 68000 these days.

Jay

[1]: I remember telling somebody the NEC Vr4181 was just getting crushed by position-independent code's GOT indirection, since it seemed like the 8k data cache was getting killed by evictions. They said, "hmm, probably depends on the cache associativity", and I got to say, "*what* cache associativity?" In ~2000 a direct-mapped cache maybe made sense if you were writing bare-metal code, but it had pervasive performance effects on code from other models.