lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Nothing prevents an implementation to use a balanced tree approach instead of a hash.
Both approaches can be used simultaneously, depending on statistics factors, notably:
- the fill factor density for the hash, which is supposed to have a flattened distribution (if it uses a good enough hashing function avoiding the nasty effect of very long collision lists on very few hash slots)
- the estimated needed size for the static hash table.
A balanced tree (B-tree variants) can be very compact even for storage. This also basically depends on the size of the B-tree page and the maximum "degree" per node. Its cost however is proportional to the time to sort each page, and also depends on the balancing factors (what is the minimum fill factor to keep, and how many pages must be rearranged). But when filling and emptying large tables, this would require lot of page reorganizations (merges/splits) and yes it will be much slower than what a hashtable needs to do (rearrangements in a collision list are trivial).
Another approach would be to  combine a second independant hashing functions to manage the collision list.

Le dim. 9 juin 2019 à 16:03, Egor Skriptunoff <egor.skriptunoff@gmail.com> a écrit :
On Sun, Jun 9, 2019 at 2:06 PM Soni "They/Them" L. wrote:
hi, so, a lot of ppl won't use Lua because Lua uses 1-based indexing.
well, other languages use 0-based indexing. but, what if we had neither?

the trivial solution for "neither" is just to use linked lists. and that
would, indeed, work, quite badly as well.but what if we come up with
something completely different instead?

rather than limiting ourselves to numeric representations of arrays, I
suggest a much more tangible (and yes tangible is the right term)
approach: filters.

consider the following table:

local t = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}

(also consider as if t[1] etc don't work to access array indexes but
only hashtable indexes)

how do you iterate it? well you'd still use ipairs ofc and it'd be just
as fast, it just wouldn't use numbers. so you'd do "for value in
ipairs(table) do". additionally we can also remove array indexes from
pairs() because it doesn't make any sense to keep them there.

how would you do indexing? well, this is where things get interesting.
let's say you want to get that 5 there. you could do it like this:

local i = {false, false, false, false, true}
local v = table.get(t, i) -- v = 5

great! we no longer need numeric indexing! oh, wait...

uhh... how would we dynamically index things...

well, one thing is consistent between 0-based indexing and 1-based
indexing: in 0-based indexing, the length is always just beyond the last
valid index. in 1-based indexing, the length is the last valid index.
this happens to match up with the number of values in the array. as such
we can keep #t.

one option is to have "left, mid, right = table.bisect(t)". this is
quite efficient as well.

local left, mid, right = table.bisect(t) -- mid is nil
local left, mid, right = table.bisect(right) -- mid is 8
local left, mid, right = table.bisect(left) -- left is 6, right is 7,
mid is nil

(having a separate mid lets we ignore the problem of bisecting an odd
number and having to decide which side it'll go to.)

combine it with #t and you can trivially do 0-based or 1-based indexing
in your own code, and it just works. you can even mix them in the same
code, which is useful in some contexts.

okay so we have bisect and all but this isn't complete yet. to really
finish it off we need one more thing. well, four, actually:
table.pop_left, table.pop_right, table.push_left, table.push_right.
additionally, table.bisect should return views, not full arrays, both so
it's faster and so we can pop_left on a view and have it modify the
original table.

but anyway, this way, everyone's happy, because numeric indexing is no
more. or, well, actually, it's meant to make everyone unhappy I guess,
but you get the point xd.



(This is off-topic)
Your proposal reminds me another idea about storing table's keys ordered by value instead of ordered by hash:

Usually Lua table consists of "array" part and "hash" part.
The non-"array" part should store its data in a different way: it should be reorganized from "hash" part into "tree" part.
All keys in "tree" part of each Lua table should be arranged in some fixed order (imagine that we could compare two arbitrary Lua values of arbitrary datatypes).
Some kind of balanced-tree data structure should be implemented to make all table operations having O(log(N)) time complexity, where N is the number of keys in a "tree" part of a table.

This approach has the following benefits:

#(1) Operations "get next/prev key" will be O(1), as always.
#(2) Operations "get leftmost/rightmost key" will be O(log(N)).
These two items mean that we could implement next() as fast as in Lua 5.3.
Please note that "get leftmost key" operation was not O(1) in Lua 5.3, see the following example:
   while next(t) do t[next(t)] = nil end

#(3) Operations "move M elements to the left/right" will be O(log(N)) for large M.
Hence, we could access a table in a specific way which is independent on the base index:
old style: tbl[1]   new style: get_leftmost(tbl)
old style: tbl[5]   new style: move_right(get_leftmost(tbl), 4)
Of course, this syntax is not neat, it just gives the idea of operation.

#(4) Operation "split diapason key1..key2 into two equal parts" will be O(log(N)).
The primitive "dichotomy the range of keys" might be useful for many cases, including parallelization.

#(5) Balanced trees are immune to DDOS attacks (unlike hash-tables).
Lua would not have to pseudo-randomly shuffle the order of keys used by next() function on each VM initialization. 
Sequence of keys generated by pairs() will be deterministic (unless tables are used as keys).
For example, it would be convenient if all numerical keys go in their natural order when traversed with pairs():
-1.5, 0, 1, 2, 3, math.pi, 4, ..., math.huge

#(6) Memory block size for storing a balanced tree may be not equal to a power of 2 (as for hash tables).
For example, when extending a "tree" part of a growing table, any memory block 2x...4x of the previous size would be OK.
Memory for storing a balanced tree also can be fragmented into several continuous chunks without making the implementation more slower or more complex (unlike hash-tables).
These would make memory manager's life a bit easier.


Drawbacks:
#(-1) Table indexing will be slightly slower (hash tables, despite of collisions when indexing, are faster than balanced trees).
#(-2) Tables would take more memory (trees consume more space than hash-tables).
#(-3) Didn't actually solve the "problem of accustomed 0-based arrays".