Re: worst case time using hash-tables in Lua

Le sam. 16 mai 2020 à 11:27, Robert Burke <sharpobject@gmail.com> a écrit :

On Fri, May 15, 2020 at 7:48 PM Philippe Verdy <verdyp@gmail.com> wrote:
>
> I spoke about null nodes meaning nodes whose key part is not set (so the associated-value part is also not set, both can be will be null/nil pointers or values): if you don't like it, I cabn speak about free nodes, unused nodes, empty nodes.
Do you have a preferred scheme for cleaning up nodes whose key part is
set and whose value part is not set when performing insertions of new
keys?

> This is independant of the fact these nodes (also called buckets) may also contain internal chaining pointer (I think there's no need at all for them and their usage in Lua is based on a broken assumption that it could generate shorter chains, while in fact it creates much longer chains and worst cases are in fact easier to reach).
>
> Using linear probes with a constant step (1 or a large prime) *warranties* there's a limit to the worst case for long chains.
Well, it does this only probabilistically, which the current code does as well.

Have you had a look at Malte Skarupke's blog posts about this topic?
He created a hash table that uses linear probing and has an actual
upper bound on chain length, but he found that nobody would use it
because it replaced DOS attacks that degrade performance with DOS
attacks that cause your program to crash by running out of memory.

How can a program crash by running out of memory only because you perform linear probe instead of probes via a chain using "next" pointers stored in each node, which requires MORE memory in permanence.

Later he created another hash table that is fairly similar to Lua's
implementation, and one of his claimed benefits of the new one over
the old one is that is saves memory because it has short chains at
very high load factors. Here's the post about the new thing, which
links to the old thing:
https://probablydance.com/2018/05/28/a-new-fast-hash-table-in-response-to-googles-new-fast-hash-table/

These results are very doubtful. The tests performed must have used a bogous implementation which is not comparable to a real linear probing (independantly of the linear constant increment used which is always evaluated modulo the table size). I have still seen that using a large prime increment was avoiding many occurences of long chains that could be caused by collisions, notably when collisions come from a very weak hashing function (and the hashing function used in Lua is VERY weak for string keys).

This gives symptoms that are easy to see now in Wikimedia Commons, whose Lua scripts require now considerable memory and CPU time (notably for processing large volumes of JSON data from wikidata requests, which are deserialized into Lua tables with many collisions on their string keys). This created many problems like time or memory constraints exhauted, red errors displayed on pages, false autocategorization, and very slow response time for each edit when previewing, or for visiting any page still not rendered for a given user language (internally the translations are requiring translated labels from Wikidata, with large volumes of JSON data loaded in memoryn, consumeing lot of space, and then with very slow lookups for strings keys, **even** if most of these string keys are short. The CPU time used is largely excessive, hash tables are not really fast hashes but worse than when using a simple ipairs() full table scan to locate matching keys).

Hash tables using stored next pointers in each node are not necessary at all, they have *worse* worst-cases, they are very slow to scan for key lookups, they use too much memory for ALL tables.

> And finally Lua's hash table are in fact even slower than using a linear global scan of the full table !
This claim seems like it should be fairly easy to prove. What
operation is slower than a linear scan of an array under what
circumstances?

Probing by a linear scan (whever the increment is 1 or a prime) is still faster than probing by following a chain of stored pointers. And it will be always faster is the chains are warrantied to be smaller than following an arbitrarily long chain of pointers.