lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

I don't think anyone responded to this:

> I was looking at the lua source in order to understand how it handles
> hashing, specifically collisions.  At the top of ltable.c there is a 
> explaining that the loading does not adversely effect performance. After
> digging around the code for a while, I'm still not sure I understand how 
> implements hash tables and handles collisions :-)  Can anybody help me 
> understand the implementation?

The best way to check is with benchmarking, of course, but I can give you
a couple of hints.

Basically, the hash value for a string is a function of the size of the
string and at most 32 of the bytes from the string. Other than the
size limitation, it is a pretty standard hash function and should do
just fine with strings created sequentially. However, if the strings
are 32 bytes long or more, you might run into problems. In particular,
a 32-byte string will be hashed only on every second character. The last
character, however, is always used, which ameliorates the problem for
sequentially selected strings.

The comments in ltable.c are pretty good, actually; it is an interesting
hash table implementation and works very well for small tables.

Anyway, the idea is that the hash chain starts at the position
indicated by the hash of the key; the rest of the nodes (if any)
with the same hash occupy positions in the table which don't (yet)
correspond to any key. When a new (key, value) pair is inserted,
it may be necessary to "evict" the node which is "borrowing"
the hash-position.

The whole scheme is quite fast and also quite clever; of course, 
like all hash tables, it depends on the distribution of hashes, but
I have never seen any problems with string hashes.

In short, you would have to work at it to do a better hash.

However, if you want truly fast access to values in a Lua table,
you are probably better off using consecutive integers starting at 1.
Since Lua 5, Lua tables have an array part as well as a hash part;
the array part is specifically used to store small integer keys. If
a key falls into the range of the array part, the access is direct.
The big advantage here is that it is not necessary to store keys
(or chain pointers) in the array part of the table, so there is
less bookkeeping and less storage used up. Also, these integer
keys do not collide with keys of any other data type. (Numeric
keys which do not fall within the capacity of the array part
are handled normally, so the whole mechanism is more or less
transparent; when the table is rehashed, the array part is
resized if necessary to guarantee at least 50% occupancy.)