• Subject: Re: A good hash algorithm
• From: RLake@...
• Date: Tue, 18 Jun 2002 14:51:16 -0500

```>>let the lua authors evaluate the well presented claims on the page.

> Perhaps a kind soul has the time for this. If someone comes up with a
better
> hash algorithm than the one used in Lua and has established that it is
better
> in Lua, we'd be happy to consider adding it to the code, but please make
sure
> that it's not slower than the current one.

I'm not volunteering to do that :)

One of the interesting things about the Lua hash algorithm (aside from the
fact that it doesn't depend on the 32-bitness of integers) is that it is
time-bounded for long strings -- there is a maximum on the number of
characters hashed. This may strike people like the author of the cited
algorithm as inferior, but one needs to balance costs. The "best" hash
algorithm is not necessarily the fastest in actual practice.

Consider the case of a hash-table with 20 keys, each of which is a string
of a million bytes. A "really good" hash algorithm would probably succeed
in creating a unique hash for each string, but in O(1,000,000) time. The
Lua hash algorithm might hash all of them onto the same value; the cost of
that would be that the table lookup would be essentially linear, or O(20).
In practice, scanning 20 table entries on each lookup is probably faster
than computing 20 hashes of a million bytes, unless there are a lot of
lookups.

Here's my S/0.02 contribution to Lua hashes, though (and I haven't looked
at v5 yet, so ignore me if you've done this):

Once a string is interned into the string hash table, its address becomes a
unique identifier. So there are two alternatives: use the hash as
originally computed as the string's lookup hash, or use the address as with
any other allocated object. If the original hash is sub-ideal, the address
may be a better lookup hash.

The objection might arise that there are not many random bits in a malloc'd
address. This is true: neither the high-order nor the low-order bits are
random. However, there are probably as many random bits as the log of the
number of elements in the largest table, because objects will be randomly
scattered around allocated memory, and allocated memory is larger the the
largest table. Consequently, dropping the last four bits or so of the
address and then using the rest as a hash value is likely to be pretty
good. (If I'm not mistaken, currently three bits are dropped, but four
would be better for malloc's that like to 16-byte align things.)

Rici

```