lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On 10 December 2017 at 04:21, Daurnimator <quae@daurnimator.com> wrote:
> I'm curious what the performance hit is in the general case for
> swapping to siphash.

I did some hash comparisons, not in Lua but on LuaJIT.

As some might know, LuaJIT uses a different string hash from Lua, it's
even simpler and easier to generate collisions.  The advantage is that
it's constant time; the string's length doesn't affect the time
required to hash it.

First, I took the string interning code (where the string gets hashed
and, if new, inserted into the unique-string table) to easily modify
and benchmark it.

for comparisons, i used SIP hash and Highway Hash (a SIMD-optimized
hash, which i ported from C++ to plain C, gaining a small speed
advantage at the cost of some compiler portability).

first, it was very obvious that the constant time nature of LuaJIT's
hash made it totally unfair to the other algorithms.  But since i was
interested in moderate-length strings (10-50 bytes long), i did a few
more optimizations about short length, with little effect.

then i realized an interesting fact: the hash operation is used to
choose one of two options:  if it's the same as a known string, it
then does a full compare.  if it's new, it's copied.  in both cases
the whole string is fully scanned.   what does that mean for cache
efficiency?

in previous experience, i've noted that a linear-time algorithm ( O(n)
) can beat a more advanced one ( O(log n), or even O(1) ) if the data
is has to scan happens to be already in the CPU cache.

So, I switched from measuring the hash time to measure the whole time
of hash+copy, and compare the difference made by the hash itself.   In
several cases it made a significant difference.  it was also fun to
see a nice 'staircase' on HighwayHash, with a step of exactly 32 bytes
each, fittint with the size of the SIMD chunks it used.

but still, the fastest algorithm, limited to les than 64 chars, was
more than 10 times slower than the original LuaJIT hash.

so.... how many collisions would a potential attacker had to do before
a "better" hash is faster than the "simplistic" one?  did a few tests
and found that it would have to be a few tens of thousands of unique
strings with identical hash.  very easy to generate; but in our code
base, all external inputs that end up as table keys are size limited
to at most a couple hundred.

so, the problem is real, but not that pressing.  and in our specific
use case, it has no effect.


of course, every case is different and it is important to throw big
collision chunks to any external input, as part of your regular tests.


-- 
Javier