lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Tue, May 12, 2020 at 8:52 PM Philippe Verdy <verdyp@gmail.com> wrote:
> A linear probing with a step of 1 only works when the hashing function is REALLY randomizing bits of the input; this is not true in general with the "fast hash" function used to hash numbers and strings. That's where the large prime value K solves the problem: a fast hashing function like Knuth's algorithm (multiply by some 17-bit prime and add the next value modulo 32 bits, or multiply by some 18-bit prime and add the next value module 64 bits) will work, as well as the basic hashing function for numbers, where you just combine groups of bits from their internal representation (if the bitsize of the number is larger than the bitsize of the hash value) with some fast "XOR" operation (though I recommend the multiply-and-add function to distribute their bit ranges more randomly).

> Linear probing with step of 1 (possible in the two directions) don't work well in compact tables (with fill factor larger than 50%). And I see no advantages in terms of one's cache if the hash values are strongly randomized as you suggest: the nodes in the table will still be randomly accessed because of the hashing function, you can't predict at all in which node any key will be accessed for reading, writing or nulling a value given any reasonable sequence of keys, as the hash function will transform it to a randomized order of hash values throughout the table.

I mean when you are performing the probing, it is nice to read entries
next to each other rather than not next to each other. You do raise a
legitimate issue with using the hash functions Lua actually uses
though.

Can you explain why some steps are bad for compact tables and others
are not? Does this also depend on assuming that our hash function is
not successfully randomizing the slots of the values? If it is doing a
good job, it seems like it should not matter what order we traverse
the entries in.

> In all cases (even with K=1 for the linear probing), we never need the chaining pointers in each node. Lua's current implementation of hash tables using it was a bad design since the begining, based on false assumptions and false promisses of acceleration to avoid the linear probing (which is false each time the hash table is more than 50% full and starts being updated constantly, as multiple chains will merge definitively, until the full table is rehashed; Lua offers no way to prefill a table very fast without creating these long chains that have no limit at all, and could have long chains covering most nodes; with a linear probe with K=1 or K=large prime, you have a warranty that all modifications will never create overlong chains, only because of the contraint of the maximum fill factor; and with K=large prime you will preserve or even improve the initial fast-hashing function, especially if K is a prime different from the prime used for Knuths's fast hashing function: Knuth's f
 ast hashing function most often uses a multiplication by a 17-bit prime to return a 32-bit or 64-bit value, possibly weak, while K for the hash table linear lookup will be a 24-to 31-bit integer; with the effect of these two primes combined together, you avoid most collisions for practical cases, and offer a warranty that the chains will be at most a few nodes long for the worst case, depending only on the maximum fill factor: this is false for the current broken Lua's approach with its statically chained pointers).

If you simply create a table and add a bunch of entries to it, Lua's
implementation guarantees that your chains will be at most as long as
the chains you would get if you used linear probing, excepting that
your hash table with linear probing may choose to grow at a different
time than Lua's implementation does. Linear probing "merges" chains
any time elements are inserted adjacent to each other (or, adjacent in
the permutation you get by stepping by your step anyway). Lua's
implementation will only merge chains in cases where you delete the
element with hash h(k1) ~= h(k2) that is in slot h(k2), then
subsequently insert k2. Any insertion other than an insertion of a key
into its empty main slot progresses one toward a rehash, so it seems
quite difficult to get a chain containing most nodes, and of course
one can only do this using a workload of mixed deletes and inserts. If
you have a workload in mind, we could work out the distribution of
chain lengths for that workload under various schemes. I think this
would be quite informative.

Because of the built-in next function, it seems like one would not be
permitted to un-merge chains when deleting entries when using linear
probing. That is, when the user runs t[k1] = nil, k1 must be left
where it is, and our probing code must treat k1's entry as nonempty
for the purpose of reads. So to prevent every node from merging into
the same chain in workloads of balanced deletes and insertions, it
seems like one would need to periodically traverse the entire array of
entries to remove empty entries with nonempty keys when performing an
insertion, just as Lua currently does. Is there some way to avoid this
while preserving the contract of next?

A much smaller concern is that Lua does not currently track the number
of nonempty entries in the table at all, so it would be necessary to
begin doing that. That seems likely to be fairly cheap.

I would sort of like to implement the proposals from the previous
discussion, so that they can be compared.

Aside: Apologies for quoting Philippe's entire email earlier.
_______________________________________________
lua-l mailing list -- lua-l@lists.lua.org
To unsubscribe send an email to lua-l-leave@lists.lua.org