• Subject: Re: A good hash algorithm
• From: Paul Hsieh <qed@...>
• Date: Tue, 02 Jul 2002 12:08:35 -0700

```Sorry for being a Johnny come lately here, but I have just read this.

I have used the Bob Jenkins hash function (cited in the link:
http://burtleburtle.net/bob/hash/doobs.html) in production code, and read at least
one paper by the author which gives a very detailed explanation of it (I lead to it
because a coworker discovered it was being used and endorsed in another commercial
product).  My assessment of this algorithm is that it is excellent (though not
perfect, as the author indirectly admits -- I don't have the time to go pursue the
mammoth effort that would be required to make it that extra 2% better.)

One must be clear about what it is though.  Here's my synopsis of it:

1) What is described is a hash *MIXING* function.  It does not describe what the
composition of the hashed object must be.  You just have to break down your object
into some collection of byte blocks, be it an address or some length limited string.

2) Its the fastest algorithm I have ever seen that produces this quality of output.
While it is possible to make faster hashing functions, this can only be achieved if
the size of the objects is smaller than 12 bytes, and range of the hash output is
very small (i.e., you're hashing to an 8 bit index or something) or if you are
willing to sacrifice quality (i.e., increase the odds that two inputs map to the
same index.)

Of course you don't have to use the code verbatim.  In fact for small objects, I
would recommend specialized versions that would just inline the main mixing function
and not bother with his 12 x 4 byte blocking mechanism.

3) Yes it does require exact bit length descriptions, just like other high quality
hashing algorithms like CRCs/LFSRs.

4) The algorithm has something the author calls an "avalanching" property, which
means that the algorithm will make best use of the input bits.  Actually he can only
guarantee this for the bottom 31 bits of the 32 bit output, however I have never
seen a practical application for an output of 32 bits for a hash key (i.e., you
would always mask down the output to something significantly less than 32 bits
anyway.)

Now a comparison with existing Lua code.  In lstring.c (Lua v4.x) the hashing
algorithm for strings appears to be:

for (; l>=step; l-=step)
h = h ^ ((h<<5)+(h>>2)+(unsigned char)*(s++));

Hmm ... this function has number "two character poles".  For example, let's say that
the table size is 256, and after some part of the string, the value of h is 1.  Then
inserting any even number of 'a' characters will map the hash value back to 1
(modulo 256.)  There are numerous other scenarios as well:  Let's say that the table
size is 1023, and after some part of the string, the value of h is 509.  Then
inserting any even number of 'd' characters will map the hash value back to 509.
The Bob Jenkins algorithm has no such weakness -- you would have to search for 12
character sequences to create such repeating poles.

So the code above is fast (for small strings) however it has questionable quality.

Using addresses with bits hacked off for hashing is interesting.  However, what it
means is that if you have lots of sequentially allocated objects (which typically
will be of the same size) you'll run into a kind of "all or nothing" risk when
comparing to another collection of sequentially allocated objects.  I.e., either
they totally miss each other (ideal) or they completely overlap (worst case
scenario).  So when hashes for a pair of non-string, non-value object, collections
are compared, say, you'll typically pay either no hash penalty or the maximum
possible penalty.

My recommendation is that the Lua people at least consider using the Bob Jenkins
algorithm.

> >>let the lua authors evaluate the well presented claims on the page.
>
> > Perhaps a kind soul has the time for this. If someone comes up with a better
> > hash algorithm than the one used in Lua and has established that it is better
> > in Lua, we'd be happy to consider adding it to the code, but please make sure
> > that it's not slower than the current one.
>
> I'm not volunteering to do that :)
>
> One of the interesting things about the Lua hash algorithm (aside from the
> fact that it doesn't depend on the 32-bitness of integers) is that it is
> time-bounded for long strings -- there is a maximum on the number of
> characters hashed. This may strike people like the author of the cited
> algorithm as inferior, but one needs to balance costs. The "best" hash
> algorithm is not necessarily the fastest in actual practice.
>
> Consider the case of a hash-table with 20 keys, each of which is a string
> of a million bytes. A "really good" hash algorithm would probably succeed
> in creating a unique hash for each string, but in O(1,000,000) time. The
> Lua hash algorithm might hash all of them onto the same value; the cost of
> that would be that the table lookup would be essentially linear, or O(20).
> In practice, scanning 20 table entries on each lookup is probably faster
> than computing 20 hashes of a million bytes, unless there are a lot of
> lookups.
>
> Here's my S/0.02 contribution to Lua hashes, though (and I haven't looked
> at v5 yet, so ignore me if you've done this):
>
> Once a string is interned into the string hash table, its address becomes a
> unique identifier. So there are two alternatives: use the hash as
> originally computed as the string's lookup hash, or use the address as with
> any other allocated object. If the original hash is sub-ideal, the address
> may be a better lookup hash.
>
> The objection might arise that there are not many random bits in a malloc'd
> address. This is true: neither the high-order nor the low-order bits are
> random. However, there are probably as many random bits as the log of the
> number of elements in the largest table, because objects will be randomly
> scattered around allocated memory, and allocated memory is larger the the
> largest table. Consequently, dropping the last four bits or so of the
> address and then using the rest as a hash value is likely to be pretty
> good. (If I'm not mistaken, currently three bits are dropped, but four
> would be better for malloc's that like to 16-byte align things.)

--
Paul Hsieh
qed@pobox.com

```