• Subject: Re: thought on Lua's hash-table invariants
• From: Philippe Verdy <verdyp@...>
• Date: Tue, 28 Apr 2020 05:13:33 +0200

Also the solution for large hash tables could be to use subtables, by splitting the keys into subkeys.

But in that case, for large hashes, why not using simply a B-tree ?

A B-tree avoids most reallocations, because pages (of key/value pairs) in a B-tree have a constant size, and can be allocated/freed from a large pool containing ALL free or used pages, that can also be allocated or reallocated using the minimum/maximum thresholds for used pages in this pool (the maximum threshold may be 100%, the minimum can be between 0% if you want no deallocation or up to 90% if you want the most compact representation and you don't expect a rapid growth of the size).

Free pages in a B-tree can be chained (each page of N nodes also  contains pointers to N+1 other pages for subtrees, you can use one of these pointers to form a forward-only chain: allocation of a new page is immediate)

A "B-tree" is not necessarily fully ordered: for using it as a hash store, you just need to compare if keys are equal or not, so each small page can be a hash table: collisions will use subtrees (if needed, i.e. if there's no more space in the current page by following its local implicit chain for lookups, or simply by sequential loookup). As the page in a B-tree is small (e.g. up to 16 nodes), lookup of hash chains within the page is warrantied to be fast. And the B-tree can be built to equilibrate the depth of its branches so that all leaf pages are at the same level (there can exist nodes however in the parent levels); and so that only the root page of the B-tree may have a low fill level (all other pages in cluding leaf pages having a high fill level, that you can warranty by a higher value of the minimum fill threshold than for the root page which could remain constantly allcoated e.g. with up to 16 nodes for key/value pairs, and 17 pointers for pointing other parent pages for subtrees, keeping only the last pointer of the root page for the chain of free pages).

What will be the page size (managed as a local hash table only for the key/value nodes contained in that subset) is left to be implementation dependant, but it is also tunable (this minimum should be about 16, but it can be also a bit larger if you want less frequent operations in the B-tree for managing reorganisations of nearby branches to maintain the fill-level invariants in eah page and the constant depth of all leap pages in the B-tree. The reorganisation operations in a B-tree are not very costly, they are in O(depth) for a B-tree storing O(pagesize^depth) nodes, ie. for a B-tree storing N nodes, its cost is O(log(N)/N) and given reasonnable values for the minimum fill level of each page, it will not occur very often.

Le mar. 28 avr. 2020 à 04:47, Philippe Verdy <verdyp@gmail.com> a écrit :
Changing keys (replacing non-nil values by non-nil values) should not affect the has table at all (so it cannot cause a "stop the world" issue, which is a critical security issue if the keys are private data).

But if you change them first to nil, then back to a non-nil value, this should not cause "stop the world" issue caused by reallocation (because the has table will remain almost constant and reallocation should occur only if
* the table fill level would exceed some threshold (e.g. over 85%) for adding new keys
* if you hit a collision chain exhausting a maximum number of steps to find a new free position for a new key, or
* the table fill level would fall below some threshold (e.g. below 25%) for removing some keys, and the reducing the size would make it smaller than a minimum size (e.g. 16 nodes)

The fact that the "lastfree" node moves in one way is clearly a bug (a bad choice in the initial design of the implementation) that can be corrected. I indicated the solution:
* stop doing that, you don't need such "last free node" pointer (but as long as you have nodes chained by "next" pointers, you can keep them unchanged so that setting the key again to a non-nil value will restore it in its chain)
* remove the unnecessary "next" pointer from nodes, instead use static advances by a prime contant modulo the hash table size (to be used both when adding new keys by following the implicit chain with this constant step to match the key (if matched, replace the value of the node if the value is not nil, or set both the key and value of the node to nil) or until you hit a free node (with nil key) where you can place the key/value pair to add.

Bonus:
* you get smaller size in memory for nodes, that only store key/value pairs (both non-nil or both nil) and no extra pointers.
* you have a wider control on the dynamic size of the hashtables (with minimum/maximum fill levels like 25%/85%, plus the "minimum table size" like 16) to limit the frequency of rehashings needed
* invariants are simpler to maintain
* worst case ("stop the world") are avoided alsmot always when the hash table is stable or does not change significantly very often.
* you can instantiate a marge table with a "minimum table size" at start that can be used for fast filling (without needing any reallocation with rehashings in the middle of the fill)
* you may reduce the minimum size later back to a small value when the table has been fed with all the initial data you want and that should be quite stable for long period of time even if the contents evolves slightly (in which case only the minimum/maximum values will be considered: rehashing of all keys may occur only at this time but only if the initial "minimum size was not correctly estimated and was much too large for the final fill level).

Le mar. 28 avr. 2020 à 04:21, 云风 Cloud Wu <cloudwu@gmail.com> a écrit :
Philippe Verdy <verdyp@gmail.com> 于2020年4月28日周二 上午6:09写道：

> The basic function used by Lua for its hash table hwoever is not very strong (it is made to be fast): one way to still use Lua tables for critical data is to first create your own strong digest for any private key that you want to secure against malicious collisions, and then append this digest to the unique key to generate the actual key to be used in table, instead of indexing the table directly by this private key (such appending will still preserve the uniqueness). I wonder however if Lua allows an application to specify its own secure hashing function for storing/reading (key=value} pairs in its tables: is there a metamethod we can use in tables? (setting this methamethod to a different hashing function should instantly cause the reashing of all keys already present in that table).

My experience is that don't use a huge table if you changes it's keys
frequently. It will cause stop-the-world problem.

I found this issue years ago. There is a huge table (nearly 1M nodes)
in our online game server. We add some keys and remove keys
frequently, the total number of the nodes not changed but `lastfree`
moves in one way. So rehasing occured periodic , it stopped the world
for a long time.

My solution is spliting this huge table into multiple smaller one.

I think an user-defined hashing function may not be helpful in small
tables, and metamethod could be much slower than native one.

--
http://blog.codingnow.com
_______________________________________________
lua-l mailing list -- lua-l@lists.lua.org
To unsubscribe send an email to lua-l-leave@lists.lua.org
```_______________________________________________
lua-l mailing list -- lua-l@lists.lua.org
To unsubscribe send an email to lua-l-leave@lists.lua.org
```