lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Note also that the article speaks about two distinct things:
- weak hashing functions: if the hashing function provides unique keys but the hash values are jsut partly used by truncating unnecessary bits, the hash slots will create many collisions. You need a better hasing function than just packing parts of the object in the sam order, you need to compbine these parts using addition of the new part and multiplication by a prime (the simplest approach that solves the problem in most cases without necessarily using a strong hashfing function, i.e. a secure digest  like SHA1 or even MD5).
- using chained pointers: the chains created are *partial* pointers which uses only 1 byte per node to skip a randomized number of nodes but with a limited range: probing still occurs, but it is divided by a factor of 256 in the worst case (still there's no upper bound on the length of chains when they are merged together like in Lua; in that case it just accelerates the lookup if chains are not too long, but long chains will still occur with unbounded limits in large tables)

But another approach using constant large prime for linar probing will behave better, won't need any extra storage, will never have any unbounded length for probing chains: the longest chains are warrantied to be limited (provided the hashing function is not very weak), so linear probing with a constant brime increment (even if it seems slower than linear probing by 1 fir individual probes) will globally offer a faster lookup in all cases. In addition the large prime increment offers a protection against weak hashing function. And the table size is not limited to a power of 2, the fill level an be arbitrarily set (the article spoke about increasing the fill level up to nearly 98%, I suggested 85% for keeping linear probing under 8 probes in worst cases; with 98% it could grow to about 20 probes for the worst cases of collisions, but there's still a bounded limit on the numer of probes when the hashing function is not weak and with chained hash this is still much worse because the increments for probing are too much limited in the approach used by the author of this paper).

The problem with Lua's approach is that it has ALL the disavantages, it is in fact absolutely not a hashed table, and it behaves in fact worse than if we just used a full table scan for looking up keys stored in random order (and without even having to hash them if keys are short strings... like what Lua requires by setting a limit to 12 bytes for string keys, after which the hashinig function is very weak and most keys will have to be fully compared, ignoring their needless weak hash): full table scan would be even more compact in memory as there would also not exist any need to store chainining pointers, only the key/value pairs in each node.

The four tuning parameters for hash tables (that I described previously) do not take lot of space: 4 bytes are enough per table, nothing is needed per node:
* the minimum and maximum fill levels can be represented as a fraction of 1 with 8 bits of precision (this allows setting them with an accuracy of less than 0.5%, more than needed, and their values can cover fill rates from 0% inclusively to 100% exclusively which is never needed).
* the growth (or reduction) factor is a rate between 100% (excluded) and 200% (included) and also represented as an integer in a single byte. Here also 8 bits of precision for the growth factor is enough; the reduction factor is the inverse factor; you may want to store a separate reduction factor as an additional tuning parameter.
* the minimum table size (under which the table size will never be reduced, and also defining the rounding up for table growth) is reasonnably well tuned with a value up to 256, but if you want more, you can store instead the log2 of the minimum stable size (0 will represent the minimum size 1, i.e. no rounding up needed for table growth)
These tuning parameters can be changed at any time if needed to recompact the table or to perform many modifications in it (such as loading a table or merging/differentiating two tables), so that dynamic resizing and rehashing will not occur in the middle.
* You may want to store two additional integers: the precomputed (and adjusted) minimum and maximum numbers of used nodes in the table size (they are computed from the minimum and maximum fill levels and the table size, but adjusted according to the GCD rule I already explained, and they are computed only during table size growth or reduction).

When computing a new size with the growth factor, the size must just match a simple condition, relative to the large prime constant used for linear probing (it's jus a test on the GCD, which is fast to compute on reasonably small integers like table sizes and the given "large prime" constant chosen) otherwise retry with a new size (such search loop is warrantied to find a solution in very few loops and this search for a suitable size will not occur often, only when a table needs to be resized; this done, the minimum oir maximum fill levels (or the minimum and maximum fmay, rarely, need to be adusted a bit to take into account the small extra space allocated).

The large prime constant (used for linear probing during key lookups for indexing the table for reading or writing their values, and possibly add a new key associated with a non-nil value, or remove a key associated with a nil value) does not need to be adjusted, it is simple to choose (according to the maximum table size you want to support) from well known lists of primes published on the web (their primality is already proven, you just need one such constant).

_______________________________________________
lua-l mailing list -- lua-l@lists.lua.org
To unsubscribe send an email to lua-l-leave@lists.lua.org