[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: bit.lshift and performance - luabitop v.s. lua-5.2.0-work4
- From: Mike Pall <mikelu-1010@...>
- Date: Wed, 13 Oct 2010 16:04:56 +0200
David Kastrup wrote:
> KHMan <keinhong@gmail.com> writes:
> > Anyway most app developers do not need to code DCTs and FFTs
> > nowadays.
>
> I find that C compilers tend to be actually astonishingly bad at that.
> Coding the inner loops in assembly language can often halve the
> execution time. Somehow the optimizers/register allocators don't seem
> to like complex butterflies all too well.
The overlapping lifetimes and the data dependencies in these
calculations represent a well known worst case for register
allocators.
Resource-constrained register allocation is an NP-complete
problem. That's why all practical register allocators defer to
heuristics. It seems the more complexity you throw into these
heuristics, the more brittle are the results. You'll notice when
tiny changes in the source code have huge effects on performance.
[BTW: That's one reason why DSPs and some vector units have an
abundance of registers. One might think that 128 registers is
absolute overkill. But this means the compiler never runs out of
registers in practice. This avoids the 'resource-constrained' part
of the problem and makes register allocation quite easy.]
But to get back on topic: I've spent quite some time to tune the
register allocator in LuaJIT. And it does a pretty good job on
FFT, too. The remaining slowdown compared to C is due to the type
checks on the array loads and some array-bounds checks that cannot
be eliminated easily. The planned low-level data type support
(part of the FFI) will resolve this.
--Mike