[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: [Lua Workshop 2017] RE: Efficient Layer 7 Search of IP Address Space in LuaJIT/OpenResty
- From: Javier Guerra Giraldez <javier@...>
- Date: Fri, 20 Oct 2017 00:45:16 +0100
On 19 October 2017 at 22:52, Eric Wing <firstname.lastname@example.org> wrote:
> In your table of benchmarks in the presentation, for the smaller
> numbers of addresses, you observed that a linear algorithm beat binary
> search. You suspected this was due to the overhead of the JIT. While
> there is probably a JIT overhead component to this, my guess is that
> this is also an effect of memory locality in the hardware. If you are
I've done some experiments on this.
unfortunately, the obvious way to do binary search is very hostile to
LuaJIT, due to the way it builds individual traces. The condition at
the center of a binary search is (by design) well balanced, which
means the loop doesn't converge to a "main" path with seldom-executed
sidetraces. instead, it creates a knot of short traces tied into each
other, unable to get the biggest optimizations available in LuaJIT.
using bit arithmetic to replace the condition helps a lot, but
introduces other constraints. i wasn't able to keep the central
variables into CPU registers, so the code still wasn't as fast as
possible in C.
directly comparing IPv6 addresses (128 bits) using only 64-bit
quantities reintroduces conditionals within the inner loops. I think
the suggested strategy is to split IP addresses into layers. I've
seen IPv4 (32 bit) done with a 8-16-8 split. for ipv6 i guess it
would have to be something like 48-64-16.
when doing the same in C, i tried a B*tree approach with
cache-line-sized "blocks" and using SSE operations to scan within.
the idea is that the cache-memory split looks a lot like the
memory-disk split that makes B*trees so popular for databases and
filesystems. for IPv4 it makes some sense, since an AVX2 register can
hold 8 addresses, and a cache line holds 16; but for IPv6 it's only 2
and 4, respectively. worse, there's no 128-bit comparison, so it's
not easy to do operations "in parallel", SIMD style.
In the end, i used the simplest binary search, written in C, using the
non-standard-but-widely-supported __uint128_t integer size. the code
machine code generated for 128-bit comparisons avoids branches by
using conditional execution. that's a trick most C compilers can do,
but only for very short snippets. in this case, it's a single
instruction, so it neatly avoids code locality issues.