lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


A linear probing with a step of 1 only works when the hashing function is REALLY randomizing bits of the input; this is not true in general with the "fast hash" function used to hash numbers and strings. That's where the large prime value K solves the problem: a fast hashing function like Knuth's algorithm (multiply by some 17-bit prime and add the next value modulo 32 bits, or multiply by some 18-bit prime and add the next value module 64 bits) will work, as well as the basic hashing function for numbers, where you just combine groups of bits from their internal representation (if the bitsize of the number is larger than the bitsize of the hash value) with some fast "XOR" operation (though I recommend the multiply-and-add function to distribute their bit ranges more randomly).

Linear probing with step of 1 (possible in the two directions) don't work well in compact tables (with fill factor larger than 50%). And I see no advantages in terms of one's cache if the hash values are strongly randomized as you suggest: the nodes in the table will still be randomly accessed because of the hashing function, you can't predict at all in which node any key will be accessed for reading, writing or nulling a value given any reasonable sequence of keys, as the hash function will transform it to a randomized order of hash values throughout the table.

But of course with a linear proble of K=1, the loop to search the GCD when growing the table is trivial: this GCD is 1 in all cases, independantly of the table size and of its growth factor, so any table size can fit (and so you just have to manage the minimum and maximum fill factors): the table size is still not limited to be a power of 2.

If you want to get a table with low level of collisions (i.e. most collision chains having length 0 or 1 for very fast lookup), just choose a maximum fill factor below 50% (this is true independantly of the choice of K=1 or large prime). If you want to avoid unallocations for reduction of table size (and later reallocations for larger sizes), just set the minimum fill factor to 0%.

In all cases (even with K=1 for the linear probing), we never need the chaining pointers in each node. Lua's current implementation of hash tables using it was a bad design since the begining, based on false assumptions and false promisses of acceleration to avoid the linear probing (which is false each time the hash table is more than 50% full and starts being updated constantly, as multiple chains will merge definitively, until the full table is rehashed; Lua offers no way to prefill a table very fast without creating these long chains that have no limit at all, and could have long chains covering most nodes; with a linear probe with K=1 or K=large prime, you have a warranty that all modifications will never create overlong chains, only because of the contraint of the maximum fill factor; and with K=large prime you will preserve or even improve the initial fast-hashing function, especially if K is a prime different from the prime used for Knuths's fast hashing function: Knuth's fast hashing function most often uses a multiplication by a 17-bit prime to return a 32-bit or 64-bit value, possibly weak, while K for the hash table linear lookup will be a 24-to 31-bit integer; with the effect of these two primes combined together, you avoid most collisions for practical cases, and offer a warranty that the chains will be at most a few nodes long for the worst case, depending only on the maximum fill factor: this is false for the current broken Lua's approach with its statically chained pointers).







Le mar. 12 mai 2020 à 12:16, Robert Burke <sharpobject@gmail.com> a écrit :
> As I replied in another message, it is absolutely NOT necessary to have ANY chaining pointer per node: collision lists can just add (resp. substrict) a constant prime to the current node index (and take the result module the hash table size), to find the next (resp.) previous node in the chain, until you find the node with the correct hash or a null node.

If we are assuming a good hash function, using linear probing with a
prime step is is equivalent to using linear probing with a step of 1,
but using a step of 1 is somewhat nicer to one's cache.

On Tue, May 12, 2020 at 6:58 PM Philippe Verdy <verdyp@gmail.com> wrote:
>
> As I replied in another message, it is absolutely NOT necessary to have ANY chaining pointer per node: collision lists can just add (resp. substrict) a constant prime to the current node index (and take the result module the hash table size), to find the next (resp.) previous node in the chain, until you find the node with the correct hash or a null node.
>
> Note only this allows forward and backward navigation in the list, but it solves the problem: there will still be some chains colliding and sharing some node positions: that's why the constant needed to advance to the next or previous position must be a prime.
>
> This strategy works very well provided that hash tables are not too much full (but with a maximum 85% fill level, and a correct hashing function, you warranty that the worst case will be rare and even if it occurs, you can predict the maximum length of the chain: with 85% all chains will be about 8 nodes at most.
>
> And you also remove completely the need for the hash table to have a size set to a power of 2 (this is a problem for large hash tables that grow too much if you need to double its size once you reach the 85% fill level: after filling 85% of a table with 1 million nodes, allocating a node for a single new value would require doubling it to 2 millions nodes, i.e. leaving 1 millions 250 thousands free nodes from where you would take only 1. Growing a table not by a factor 2 but a smaller factor of 1.25 would still keep enough free nodes for lot of additional filled nodes.
>
> You could as well preallocate the table size even if it's initially at much lower size than the minimum fill factor (which should for log term usage be larger than 50% but must still be still lower than or equal to the maximum fill level) to avoid the reallocation for growth: just set the minimum fill factor to 0% so that the hash table will not be reduced at any time and reallocated later. Once you've done that, you can set the final optimal maximum fill factor to 85% and the minimum fill factor to 75% for a table that will contain a relatively stable number filled nodes. And eah time if you need to make lot of adjustments in a table (for example for a fusion or suppression from another set, you can temporarily set the minimum fill factor to 0% (to avoid deallocations) and allow a bit larger fill growth factor (e.g. 2 instead of 1.25) to minimize the number or allocations/unallocations and rehashing operations, then restore the optimum fill factors and growth rate to get a compact
  table which will be rehashed only once (note that some objects types are not costly to hash: strings alreay have their own precomputed hash in a field, and the integer hash of arbitrary numbers is just a few arithmetic operations: adjusting these hashes to the table size is a simple modulo).
>
> The only limit of this approach is that the constant prime K you use for advancing in the table of size S must be set so that (K mod S) and S have NO common divisors, i.e. GCD(S, K mod S)=1: when S is a power of 2, this is easy to satisfy as any large prime K will fit: the GCD (greatest common divisor) of S=2^n is necessarily a power of 2 and would have to be a divisor of (K mod 2^n), so (K mod 2^n) would need to be a power of 2 to break this condition, and this is impossible for any large prime value K > 2.
>
> As a consequence: when your size table size S is not a power of two, but grows by a variable factor, the growth is an estimate: let's suppose your table has initial size S0 and is too much full for a new node, you want to compute a new size S = S0*G (where G is the growth factor) but need to check that GCD(S0*G, K mod S*G)=1: if this is not the case, you can still retry with S=S0*G+1,  S=S0*G+2, or more (you can set S so that it is a multiple of 16 by rounding up this increment, assuming that the minimum size S for any table is 16 and table sizes can only be 16 nodes, or 32, 48, and so on.): once you compute the proper GCD that satisfies this goal, make sure you *reduce* (not keep and never increase) the minimum fill factor as needed, because it may no longer be satisfied after this sudden growth of table size and you don't want to be forced to unallocate instantly the extra space.
>
> The trick here is just to find (only when the maximum fill factor threshold is reached) the next table size S given a specified growth factor G: this requires a small loop to compute GCD: there are good algorithms for computing GCD(S0,S1) values fast for several candidate values of S1=(S0*G - (S0*G mod16) + 16*i) where 16 is the size rounding pr the minimum table size and i is incremented from 1 at each loop), it is trivial when one of the two values S0 and S1 (or even both!) is a power of 2, but its complexity is small, this search is still fast but you won't perform search loop very often if you have a decent growth factor G>=1.25 (i.e. the table size will grow by at least +25%).
>
> So a good hash table has four basic tuning parameters:
> - the minimum size (even if it is empty) : 16 is a good value
> - the minimum fill level (always satisfied except when the table size its at its minimum): 50% is a good value, you can increase it to 75% for maximum compaction.
> - the maximum fill level: 85% is a good value (collision chains will be about 8 nodes)
> - the size growth: 1.25 si a good choice for adding at least 25% supplementary free nodes (remember that for the smallest size, the table will actually grow of +100%, i.e. the effective growth factor will be 2, but it will be lower if the table size is larger).
>
> These parameters can be changed during the lifetime of the hash table (when you need to perform many modifications such as: prefilling, merging two tables used as sets, or subtracting two tables used as sets) and restored instantly at end of the operations if you want to compact the hash table only at that time.
>
> With all this, you can now remove all previous/next pointers in nodes of your hash tables (you save memory as well: the hash table only contains the pointer of the key and of its associative value for pairs). You never need to store hashes for keys in the table if the keys are strings or numbers whose hash is implicit or precomputed (or cached for mutable strings), but if you keys can be objects of arbitrary complex types, these types should have a "hash" field that can be used to avoid recomputing their hash values or some method to evaluate the hash of their own intertnal components that define their identity.
> _______________________________________________
> lua-l mailing list -- lua-l@lists.lua.org
> To unsubscribe send an email to lua-l-leave@lists.lua.org
_______________________________________________
lua-l mailing list -- lua-l@lists.lua.org
To unsubscribe send an email to lua-l-leave@lists.lua.org
_______________________________________________
lua-l mailing list -- lua-l@lists.lua.org
To unsubscribe send an email to lua-l-leave@lists.lua.org