lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


* Also for very small table sizes (4 elements or lower) you don't need to perform a hashindex lookup through the chain: you can just store the items directly in random order and perform a full scan (at most 4 loops). The array just has to contain the 4 pointers of keys and 4 pointers of values (no chaining pointer needed, there's no "collision" but on average you'll scan 2 items, and never more than 4)
* For small table sizes (more than 4 elements and 32 elements or lower), you could just keep the items stored in hash value order so that lookups can use a binary search (at most 5 loops); this requires moving items when inserting/deleting keys but the moves in the array won't be long. The array also just needs the 32 pointers of keys and 32 pointers of values (no chaining pointer needed, there's no "colision" at all, but on average you'll scan 5 items at most and 4 items on average)
Collisions in really hashed tables start being a problem for large enough tables (128 elements or higher: at 128 there's still no problem, but at 256 you have collisions every 5 elements added if element sizes is a multiple of 5 or more rarely a multiple of 51)

Large hashtables waste a lot of memory but also their Mersenne (size-1) frequently has small multiples that dramatically severe the efficiency of hashes (notably for storing many small objects whose allocated size would be a multiple of one of the divisors of the Mersenne (size-1). The more you add keys to tables, the more you waste memory with lot of unused nodes but also a lot more collisions caused by small divisors of (size-1), and with larger Mersenne (size-1), you get new small divisors, sometimes several ones: 3, 5, 7, so the tables become sensitive to more diverse kind of objects if the objects have varaible sizes: the frequency collisions will only grow even if the table is far from being full (because there are many free nodes in a table size limited to a power of 2), and the effect of these collisions on global performance become dramatic: hashing no longer optimize the access time. 

That's where a sorted table with binary lookups (including B-trees) can help!


Le mar. 26 mai 2020 à 03:40, Philippe Verdy <verdyp@gmail.com> a écrit :
A table size for 16 elements: 2^4-1=15 is Mersenne, but is a multiple of 3 and 5, so collisions occur for objects whose differences of addresses is frequently a multiple of 3 or 5 (e.g. allocating many objects with size 10, or 12 or 20).
A table size for 32 elements: 2^5-1=31 is Mersenne, but is prime (no problem here).
A table size for 64 elements: 2^6-1=63 is Mersenne, but is a multiple of 3, so collisions occur for objects whose differences of addresses is frequently a multiple of 3 (e.g. allocating many objects with size 12)
A table size for 128 elements: 2^7-1=127 is Mersenne, but is prime (no problem here).
A table size for 512 elements: 2^9-1=511 is Mersenne, but is a multiple of 3.
A table size with 256 elements: 2^8-1=255 is Mersennen but is a multiple of 5

The current limitation of table sizes to a power of 2 offers NO protection at all against collision of pointers to large sets of objects allocated with the same size, Mersenne numbers (2^n-1) frequently have small divisors (3, 5, 7 beging the most common).


Le mar. 26 mai 2020 à 03:17, Philippe Verdy <verdyp@gmail.com> a écrit :
Note that not all Mersenne numbers 2^p-1 are primes (even if they are not powers of 2, nor multiples of power of 2, like aligned addresses)

But it's true that the lower bits of aligned adresses have no use in the final hash and could be stripped before using the modulo. But then the stripped address may be a multiple of the Mersenne if the Mersenne is not prime.

For example 2^16-1 is a multiple of 5, so addresses that are 5 times the alignment size for addresses will collide frequently in a table with 2^16 nodes.

The only way to avoid this is to multiply the stripped-down address (converted to integer without the low bits always equal to 0) by a large enough **non-Mersenne** prime like 0xAAAB to "randomize its bits", and then use that result as the hash value of the address.

You DON'T need the (modulo tablesize-1), just the (modulo tablesize) for converting the hash value to a main slot number in the table, like for all other hashes, i.e. just need to mask the high bits if the table size is still limited to a power of 2: that is something that should change for large tablesizes as this is a severe limitation that wastes memory but which is still applicable, given your comment saying that tablesize-1 is a Mersenne (this should not even be true if this limitation of tablesizes=2^p is removed above some value of p, where tablesizes could just be a multiple of 2^p).

This multiplication by a large non-Mersenne prime will also work even if addresses of objects (tables, functions, or userdata whose mutable content is not hashed at all) are not aligned at all (on 8-bit systems that have no alignment constraint at all, but follow some incremental patterns with many multiples of the same base and a constant offset (this occurs when allocating many objects of the same size): the differences of addresses, ie. the base could easily be a small multiple or submultiple of the (tablesize-1) which is most often NOT a prime and may have small divisors.





Le lun. 18 mai 2020 à 18:53, Roberto Ierusalimschy <roberto@inf.puc-rio.br> a écrit :
> the hash for objects such as tables and functions seems to be the memory
> address at which they are stored, this is indeed unique but due to
> alignment when it is reduced modulo 2^n (the size of the table) it may
> cause many collisions
>
> does Lua do something to randomize the memory address or the hash derived
> from the memory address?

For these objects, the final hash is done modulo table size - 1, which
is a Mersene number.

-- Roberto
_______________________________________________
lua-l mailing list -- lua-l@lists.lua.org
To unsubscribe send an email to lua-l-leave@lists.lua.org