Re: So I had this weird idea to solve the indexing problem

As well, you can easily create a generic "sortedpairs" iterator: it just has to use pairs() to collect keys (in random order) and store them in a sequence [the table.keys() function already do that for you], then sort that sequence [using table.sort()], and then iterate with ipairs() on that sorted sequence to get ordered keys that can be directly used to index the given table.

Of course, there's a cost: each time you use such iterator, it must create (=possible a large allocation, i.e. temporary storage cost) and sort (=time cost) the sequence of keys.

But an application can also be built and maintain an index or ordered table keys "on the flow" and maintain the sorted sequence each time a table element is added or removed with a modest cost per key added/removed (it will generally be a bit slower to do that incrementally, than just building and sorting all keys at once after filling from scratch a new large table with many keys).

Le lun. 10 juin 2019 à 12:58, Philippe Verdy <verdy_p@wanadoo.fr> a écrit :

However the nasty unbalanced hash trees only occur if you have not used a good enough hashing function; the implementation, above some table size, could arrange to use a stronger cryptographic function (e.g. SHA512, but still SHA1 is still very resistant) which will ensure a really flattened distribution, instead of a simple modular arithmetic function with the multiplication by known static prime (like the simple Knut's hashing function).

Cryptographic hashes are now very fast, and very resistant to DDOS attacks (it is really extremely difficult to "unflatten" the distribution over any hash table sise of reasonnable size). Even with SHA1 or MD5, it would be very computing intensive to create overlong collision lists desequilibrating the distribution of hash keys. The hashing function does not need to create unique keys for unique values, and the cost of of collision lists is proportional to their maximum length (which is inversely proporitional to the size of the hash table, which is turn should grow in size according to the number of keys/size in the table to index).

Let's suppose that the first level of hash table is sqrt(N) where N is the maximum number of elements in the table, then the average length of collision lists (with a flattened distribution) is sqrt(N)/2, and a good hashing function would make the worst collision list length no logner than sqrt(N). So you can easily use a second layer of hashing and then reduce the worst collision list length to O(sqrt(sqrt(N))). The storage cost for the two hashing tables (using the two hasing functions) will still be 2*O(sqrt(N)), and it is still modest compared to the cost of a B-tree (with moderate fill factor) because of the free space left in its pages.

All these designs are just based on statistics metrics that the implementation can maintain. It has many choices ! Lua however does not mandate any hashing function for its tables, so the ordering of keys wahen traversing tables with pairs() is already not predictable allowing freedom of choice. If you needed a table with sorted keys, you can already do that in Lua using a separate table of keys, ordered in a sequence (from 1 to N) independantly of the implementation of tables.

Because Lua is already optimizing the storage of table keys that form a sequence (they are not stored in the hash, but their in values are directly stored in a separate vector indexed by integer, and theses integer keys are then very fast and don't cost any additional storage, that's why ipairs() is very fast).