Re: thought on Lua's hash-table invariants

Le sam. 25 avr. 2020 à 19:24, Philippe Verdy <verdyp@gmail.com> a écrit :

Note : when choosing the constant prime K used for advancing to the next candidate positions in collision chains, that constant K MUST be larger than half the maximum hash table size.
- For a maximum allowed maximum hash table size with up to 2^31 nodes (where your node index numbers will be 32-bit integers), just choose a prime that is between 2^30 and 2^31.
- For a maximum allowed maximum hash table size with up to 2^63 nodes (where your node index numbers will be 64-bit integers) , just choose a prime that is between 2^62 and 2^63.
It's trivial to find appropriate large constant prime numbers that fit this goal and will then work for ALL the possible hash table sizes (look at https://primes.utm.edu/largest.html or similar lists, or use a simple prime sieve program to compute one), then use that value as a static constant in the implementation of the hash table lookup.

Le sam. 25 avr. 2020 à 19:07, Philippe Verdy <verdyp@gmail.com> a écrit :

Le sam. 25 avr. 2020 à 15:26, Norman Ramsey <nr@cs.tufts.edu> a écrit :
> > If anyone on the Lua team cares to comment, I would be delighted!
>
> My pleasure! :-)
>
> Before we go on, note that, although the main algorithm has remained the
> same during the years, several details of the kind you are discussing
> have changed across versions. I will use the current version (5.4 :-)
> as the reference here. If you used another version for your analysis
> (2.5?), let me know.

I was studying version 5.3.4. Tolerably up to date!

Thanks for your corrections! I need to go study some more.
Here is the key thing I got wrong:

> > - When a key is removed, if it occupies its main spot, then the node
> > in the spot pointed to by the `next` pointer, if any, is migrated
> > into the main spot, and the key in the `next` spot is set to nil.

And as Philippe Verdy kindly points out, the issue with this plan is
that removal from the linked list can't be done in constant time.

(I am wondering about making the linked list circular, so an element
could be removed in time proportional to the length of a collision
chain, but I'll have to postpone idea that until I can set aside some
time to study it.)

You're right: I suggested using a double-linked list, but using only a circular list is enough to solve that problem without ading a "previous" link (this requires however a change in the can that lookup a key so that it will record the first node number visited and won't continue infinitely visiting the chain to the next node if the next node was the first visited). This is the easiest fix that can be done now.

But there are also other strategies: do not store ANY next pointer, just advance arbitrarily to a computed next node, by adding a constant prime number K, and then round it up modulo the hash table size: either you will necessarily reach a free node (that can be used as the target for storing a new added key), or you'll cycle back to the first node afte having visited ALL nodes in the hashtable (this can only happen if the hash table is completely full, and this can be detected even before performing the full cycle).

In that case you don't even need to store any "next" node pointer in each node, so the result is a hash table that uses less memory.

And you can avoid visiting the full table with a very long chain, only by ensuring that the hash table is not filled at more than 80% of its size. Otherwise you can realloc a new hash table size (which is not necessarily a power of two, you can grow its size by a factor of 150%, and then the fill factor will drop from 80% to 60%. In such event, all keys in the hash table have to be recomputed to round up their value module the new hash table size and relocate the items and that's why reallocating hash tables should not occur too often:

You must to do that when the current fill factor F is somewhere between 50% and 100% and the hash size growth factor G must be such that F times G remains below 75%.

As well you can decide to resize a large hash table to a lower size (when removing node) only if the fill factor falls below 25% (but not just below 50%), but in such event the reduction rate of the size should be 50%. In that case you don't even need to rehash the remaining keys, as you can simply divide the hash table size by 2, and only keys that were stored at odd positions have to be relocated in a chain (use the same constant prime number K modulo the new table size, to locate the "next" candidate location in the chain.

With a maximum fill factor of 80% (meaning that there are at at least 1 on 5 free nodes), and this strategy, you are SURE that the collision chains (formed by adding the constant prime K) will not be overlong, even for the worst case (for the worst case, your collision chain will have visited only these 80% nodes, that are exactly spread by filling ONLY the nodes that are separated exactly by this prime number K) before falling onto a free node.

You can tune up the minimum and maximum fill factors (above I indicated 25% and 80%), as well as the size growth and reduction factors (I suggested 150% and 50%, and when computing a new hash table size you can also round up the effective size to some minimum multiple such as keeping the final table size as an exact multiple of 16 but when doing that rounding up it is possible for the fill factor to fall a bit below the minimum fill factor)