Re: worst case time using hash-tables in Lua

As I replied in another message, it is absolutely NOT necessary to have ANY chaining pointer per node: collision lists can just add (resp. substrict) a constant prime to the current node index (and take the result module the hash table size), to find the next (resp.) previous node in the chain, until you find the node with the correct hash or a null node.

Note only this allows forward and backward navigation in the list, but it solves the problem: there will still be some chains colliding and sharing some node positions: that's why the constant needed to advance to the next or previous position must be a prime.

This strategy works very well provided that hash tables are not too much full (but with a maximum 85% fill level, and a correct hashing function, you warranty that the worst case will be rare and even if it occurs, you can predict the maximum length of the chain: with 85% all chains will be about 8 nodes at most.

And you also remove completely the need for the hash table to have a size set to a power of 2 (this is a problem for large hash tables that grow too much if you need to double its size once you reach the 85% fill level: after filling 85% of a table with 1 million nodes, allocating a node for a single new value would require doubling it to 2 millions nodes, i.e. leaving 1 millions 250 thousands free nodes from where you would take only 1. Growing a table not by a factor 2 but a smaller factor of 1.25 would still keep enough free nodes for lot of additional filled nodes.

You could as well preallocate the table size even if it's initially at much lower size than the minimum fill factor (which should for log term usage be larger than 50% but must still be still lower than or equal to the maximum fill level) to avoid the reallocation for growth: just set the minimum fill factor to 0% so that the hash table will not be reduced at any time and reallocated later. Once you've done that, you can set the final optimal maximum fill factor to 85% and the minimum fill factor to 75% for a table that will contain a relatively stable number filled nodes. And eah time if you need to make lot of adjustments in a table (for example for a fusion or suppression from another set, you can temporarily set the minimum fill factor to 0% (to avoid deallocations) and allow a bit larger fill growth factor (e.g. 2 instead of 1.25) to minimize the number or allocations/unallocations and rehashing operations, then restore the optimum fill factors and growth rate to get a compact table which will be rehashed only once (note that some objects types are not costly to hash: strings alreay have their own precomputed hash in a field, and the integer hash of arbitrary numbers is just a few arithmetic operations: adjusting these hashes to the table size is a simple modulo).

The only limit of this approach is that the constant prime K you use for advancing in the table of size S must be set so that (K mod S) and S have NO common divisors, i.e. GCD(S, K mod S)=1: when S is a power of 2, this is easy to satisfy as any large prime K will fit: the GCD (greatest common divisor) of S=2^n is necessarily a power of 2 and would have to be a divisor of (K mod 2^n), so (K mod 2^n) would need to be a power of 2 to break this condition, and this is impossible for any large prime value K > 2.

As a consequence: when your size table size S is not a power of two, but grows by a variable factor, the growth is an estimate: let's suppose your table has initial size S0 and is too much full for a new node, you want to compute a new size S = S0*G (where G is the growth factor) but need to check that GCD(S0*G, K mod S*G)=1: if this is not the case, you can still retry with S=S0*G+1, S=S0*G+2, or more (you can set S so that it is a multiple of 16 by rounding up this increment, assuming that the minimum size S for any table is 16 and table sizes can only be 16 nodes, or 32, 48, and so on.): once you compute the proper GCD that satisfies this goal, make sure you *reduce* (not keep and never increase) the minimum fill factor as needed, because it may no longer be satisfied after this sudden growth of table size and you don't want to be forced to unallocate instantly the extra space.

The trick here is just to find (only when the maximum fill factor threshold is reached) the next table size S given a specified growth factor G: this requires a small loop to compute GCD: there are good algorithms for computing GCD(S0,S1) values fast for several candidate values of S1=(S0*G - (S0*G mod16) + 16*i) where 16 is the size rounding pr the minimum table size and i is incremented from 1 at each loop), it is trivial when one of the two values S0 and S1 (or even both!) is a power of 2, but its complexity is small, this search is still fast but you won't perform search loop very often if you have a decent growth factor G>=1.25 (i.e. the table size will grow by at least +25%).

So a good hash table has four basic tuning parameters:

- the minimum size (even if it is empty) : 16 is a good value

- the minimum fill level (always satisfied except when the table size its at its minimum): 50% is a good value, you can increase it to 75% for maximum compaction.

- the maximum fill level: 85% is a good value (collision chains will be about 8 nodes)

- the size growth: 1.25 si a good choice for adding at least 25% supplementary free nodes (remember that for the smallest size, the table will actually grow of +100%, i.e. the effective growth factor will be 2, but it will be lower if the table size is larger).

These parameters can be changed during the lifetime of the hash table (when you need to perform many modifications such as: prefilling, merging two tables used as sets, or subtracting two tables used as sets) and restored instantly at end of the operations if you want to compact the hash table only at that time.

With all this, you can now remove all previous/next pointers in nodes of your hash tables (you save memory as well: the hash table only contains the pointer of the key and of its associative value for pairs). You never need to store hashes for keys in the table if the keys are strings or numbers whose hash is implicit or precomputed (or cached for mutable strings), but if you keys can be objects of arbitrary complex types, these types should have a "hash" field that can be used to avoid recomputing their hash values or some method to evaluate the hash of their own intertnal components that define their identity.