lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Hello Robert Paprocki,

I had a couple of hand-wavy supplemental thoughts from your original
presentation…it may or may not be useful. (It also may be inaccurate
as I am doing this from memory a day after the talk.) Feel free to
disregard.

Using the distinction of optimization that separates efficiency and
performance into two separate buckets, where efficiency is about
avoiding work (which is what your binary search does), my thoughts
come on the performance side, with respect to getting the most out of
the hardware for the work actually being done.

In your table of benchmarks in the presentation, for the smaller
numbers of addresses, you observed that a linear algorithm beat binary
search. You suspected this was due to the overhead of the JIT. While
there is probably a JIT overhead component to this, my guess is that
this is also an effect of memory locality in the hardware. If you are
not familiar with the issue, the gist of the idea is that the
differential between CPU clock speeds and the memory bus clock speed
is now so great, RAM access is like the new DiskIO. A modern Intel CPU
can do a sqrt in something like 12 cycles, but a memory fetch for just
4 bytes from RAM can be like 500 cycles. So your CPU is stalled every
time this happens. So the way to get more performance out of the
hardware is to try to have your data in your CPU cache, where an L2
cache hit is ~12 cycles vs. ~500 for RAM. A linear search through
contiguous memory is can be very fast because of the CPU doing
prefetching into the cache. A binary search on the other hand can
result in a cache miss every jump.

So, _IF_ this a significant component in the benchmarks you are
seeing, one additional optimization would be to try to harness the
performance of memory locality.

So my first thought is you could implement a hybrid algorithm. Start
with your binary search (for efficiency), but stop dividing when the
remaining block of remaining addresses is smaller than a certain
threshold (this number could be around those sizes in your benchmarks
where linear search beat your binary search), Then you do a linear
search within that block. The hardware idea is that your remaining
values are in a contiguous array of memory and you will be fetching
predictably in order, so the CPU prefetching will pull all the values
into your CPU cache in time for you to access them. The hope is (based
on the benchmarks you showed), you will get the efficiency wins of
avoiding work by doing most of the binary search, but in the final
parts of the binary search where the number of addresses is smaller
and linear wins (in your benchmarks), you get to take advantage of
that. (A large assumption in my thinking is that your binary search
even in the smaller range still causes lots of cache misses because
you are jumping around in memory.)

There are also some variations of this idea…
For example, I’m assuming you are on server class x86-64 machines with
lots of CPUs/cores, and possibly even running clusters. If you could
break up the list of addresses across all your cores/CPUs, where each
only needs to focus on a smaller subset of the list (basically
parallel binary search) then you could might be able to crunch the
result faster. I don’t know how easy it is to do this, especially with
LuaJIT, (separate lua_States or something like Lanes?)



Another idea is to also exploit SIMD. It seems like what you are doing
involves the same integer comparison over and over again. Seems like
their may be an opportunity to do them wide, especially in the linear
search phase. You should be able to do 4 32-bit comparisons all in one
shot for the same cost on any x86 hardware. If you have AVX-512, you
could do 16 comparisons in one shot. I don’t know what what kind
if-any SIMD support LuaJIT has.



Thank you and the rest of Kong for hosting the Lua Workshop.
- Eric Wing