lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


This occurs because "main positions", once they have been used by any key (even if it has been removed by assigning a nil value to it), are inserted permanently in a chain (whever they are used or not) and once two chains collide, they will merge constantly (this can create a graph where multiple chains that have no predecessor, can point to the same target node and finally create a common subchain.
Tables currently have no backward chains to allow safely removing nodes and isolate each chain from other chains for the same starting "main position". So once two hashvalues have collided, they will continue to collide forever (until the table is resized and entirely rehashed with only the existing keys without taking into account the former keys that are no longer present).
The order of insertion in the table also matters in how chains can collide and merge: which chain of the two (or more) colliding chains will be prolongated depends on this order or insertion.
Note tht when you load a table only in sequential order, with only integer keys in order from 1 to N, without setting any value to nil, there's no collision because these keys are not stored in the hashed part of the table.
But if you load the table in backward order with exactly the same set of keys and values from N downto 1, all keys are not in the sequential part of the table, but in the hashed part. So integer keys may or may not be hashed depending on the insertion order.
Tables have only been correctly optimized for pure sequences of keys inserted in forward order from 1 to N and with NO nil values: for this case only "#t" is accurate and ipairs() will enumerate all these keys in order. For all other cases, ipairs() will enumerate these keys in pseudo-random order, according to hash values and the current state of collision lists. If your tables are not pure sequences with non-nil values, you MSUT NOT use #t and must not use ipairs(), you should just use pairs() to count integer keys and enumerate them all in random order. If needed you'll need first to collect all integer keys into a new pure sequence, before sorting them and then enumerate the set.
As sorting is costly on large sets of keys, Lua will not do that by default. It's up to you to sort all keys that interest you and collected in a new sequence, but the default ipairs() iterator does not do that. The pairs() iterator as well will just enumerate all keys in the sequential part of the table, siliently drop those that have a nil value, then enumerate all keys foundf in the hashed part, in the order where they were placed in all nodes (not necessarily the order of hash values): pairs() jsut performs two table scans successibly and does not take into account any collision list, so it can enumerate all keys, including integer keys that are still in the hashed part and whose value was not changed to force the key to fill holes left in the sequential part where they should be.

As well the sequential part is rarely resized: it may contain lot of holes that contained previous keys, or that were preallocated by large "chunks" both parts of the table can grow in size by mode nodes than what is needed for keys: the hashed part grows exponentially, the sequential part however startes to grow exponentially for small sizes but then by chunks with constant sizes. There's still no way in Lua to control how each part of tables will grow or will be reduced, there's no tuning parameter, and no support to regenerate an optimal compact store for both parts

[(Lua tables should have an API like "table.compact(t)" and "table.setsizes()" where we could submit tuning parameters: minimum size, maximum size, growth/reduction factor, minimum and maximum fill percentages, for each part: sequential and hashed; when using setsizes() an internal compact() could occur according to the given contraints and threshold. But the implementation details of tables is left to the implementation and their tuning parameters are hidden, not exposed directly to applications; tables could as well use B-trees with other tuning parameters: page size, sort order, minimum and maximum fill level for each page. But also unfottunately Lua contains no API to select the implementation: alternative hashed implementation, sorted or unsorted vector, useful for stacks, sorted B-tree, single or double-linked list].

But the default implementation is definitely not a correct hashed table, full scans of long chains can occur at any time for the hashed part. And there's no way for application to use alternatives; they have to resort on using sequences to store [index]={[key]=value} and make sure that new indexes are inserted in strict forward order only, and for suppressed keys, set [index]={} containing no key and no value. With that you can build a B-tree but at the cost of embedding tables in the main table. Lua (unlike PHP which has a better mechanism) offers no other hint. _javascript_ has on the opposite very strict mechanisms that preserve more invariants (even if this means that some operations on individial keys may be costly:

* some _javascript_ implementations use B-trees with small page sizes (with degrees around 4, 8 or 16) to solve that problem and minimize all worst cases with warrantied maximum access time in O(log(N)) and maximum storage in O(N) (the propotion factor is very near 1), where N is the number of keys with non-nil values,
* whereas Lua currently only proposes a maximum access time in O(N) and a very costly maximum storage in O(N*log(N)) !!!

For large tables (with many keys or where keys are added/removed often and randomly), Lua is extremely poor compared to _javascript_, PHP, Postscript, and other scripting languages, or most C/C++ libraries to manage hashed tables: Lua is very costly in terms of performance and resources usage, and is a very easily target for DoS attacks.

See the current situation in Wikimedia Commons whose servers are now extremely slow and permanently surcharged by very slow or failing Lua scripts (there are also other major problems than just Lua tables, related to an inefficient and costly client interface with Wikidata, and an inefficient JSON data model for loading Wikidata entities) !












Le mar. 26 mai 2020 à 04:07, Robert Burke <sharpobject@gmail.com> a écrit :
On Tue, May 26, 2020, 05:10 Philippe Verdy <verdyp@gmail.com> wrote:
Later when the hash value is used by truncating some high bits (according to the table size), they remain consecutive: the hashed "main" positions in the table is sequential when short keys of equal size are sequential. This makes it extremely easy to rapidly create long collision lists linking all non-null keys together, so the worse case is evident to reproduce (so all table lookups will use slow full table scans.

Why will we need a full table scan for keys that do not share main positions?