Re: question on the patch FastString

Also someone in this list suggested a possibly better prime (not Mersenne) than 0xAAAB, which was just selected by a small manual incremental search with some properties.

As the list of all 16-bit primes is small and known it is easy to assert and finally check with a long enough lists of words of the same length that are full hashed (not sampled), generated from a sequence: this can be used to evaluate the various primes (or we could just use the primes listed by Knuth, none of them are Mersenne as they have randomized distribution of bits 1 and 0, without any overlong sequences of 1s or overlong sequences of 0s).

Common usage patterns for Lua tables will use for example timestamps, GUID (limited to 16 bytes), or SHA1-computed strong hashes (e.g. identification of arbitrary files by their content), or IPv6 addresses (128-bit, i.e. 16 bytes if indexed as binary-packed strings), or IPv4 addresses (4 bytes when packed in binary strings).

Your test is suitable for indexing filenames, or common field names/variable names in objects (including for example ISO 639 or BCP-47 language codes in tables of translations with many codes of the same length 2 or 3)

For small objects with few keys like {x=2, y=3}, hashing is not significant as possible collisions have no significant effect:

Using a hash table for these small tables is even unnecessary, up to keys could be kept in a single array (e.g. less than about 16 keys), and even kept sorted for lookups by binary search of their hash values (the cost is when adding/removing keys in random hashvalue order, but on average it will not "slide" a lot of keys in the array by memcpy() for insertion or deletion: less than 16 keys to move on average, and rarely, 0 keys to move at best and 31 keys at most): the binary search however would have to take into account possible duplicate hash values (so when dividing the range in two "equal" parts with a central position, you have to scan some positions forwards and backwards as long as they have the same hash value before subdividing again):

This would be the first step before extending the sorted scheme, to create a B-tree (with pages of 16 keys for example) for larger tables, with a warrantied O(log(tablesize/pagesize)*log(pagesize)) access time and a warrantied O(tablesize) space used (so never with any critical worst case): correctly balanced B-trees have a minimum and maximum fill level for all pages, except the root page which may have less keys, and all leaf pages are at the same depth (these levels are tunable even at runtime for specific tables: if you want fewer reorganizations during insertion/deletion, you can use a low minimum at 50% instead of the common 85% level; the maximum does not be set to something else than 100%, unless you want to prepare the table at a reasonnable depth for some later insertions when splitting/merging the nearest sibling pages at the same depth, preferably those with the same parent page as it avoids the recursive splitting/merging of the parent page with its own nearest siblings).

Le mer. 27 mai 2020 à 04:39, 云风 Cloud Wu <cloudwu@gmail.com> a écrit :

Philippe Verdy <verdyp@gmail.com> 于2020年5月26日周二上午1:18写道：
>
> No even with step=1 the transform of h is wrong. As I said (h<<5)+(h>>2) is equivalent to (h<<7 + h) (with preservation of the two highest bits) followed by a rotation of 2 bits to the right. and (h<<7+h) is equivalent to (h*129) so the effect is that the transform multiplies by a non-prime (129 which is divisible by 3), followed by a rotation of 2 bits (which is also wrong because the word size is a multiple of 2, so you have divided the hash collision search space by half the wordsize).
>
> This should immediately be changed to (h<<4)-(h>>3): this is equivalent to (h*127)>>>3, where 127 is prime (but not optimal as it is a simple Mersenne that still does correctly distributes all bits of the input)
>
> As well you use a multiplication before just adding a new byte, so hashes of strings with equal lenght 'a','b','c' are always successive integers.
>
> You should really first add the new byte to h, then perform the multiplication (by 127, not 129 if you still want to use a Mersenne number where the multiplication is just an addition or difference of the input and the shifted input!) and then the rotation (by 3 bits, not 2!)

I wrote a script to check the collision of those hash functions by
simulating lua hash tables [1].

I use https://github.com/dwyl/english-words/blob/master/words_alpha.txt
that contains 370104 English words, putting every N random words into
M slots hash table,
and check how many times did the collision happen.

N/M 3/4 5/8
7/8 9/16 11/16
13/16 15/32 17/32 19/32
21/32
(h<<5) + (h>>2) + x (3)84936 (4)81778 (2)113039
(3)79994 (4)96529 (3)111433 (3)125257
(4)79748 (3)87786 (3)95753
(h<<4) + (h>>3) + x (1)84398 (3)81747 (4)113248
(1)79916 (1)96154 (1)111169 (1)125113
(2)79542 (1)87535 (2)957332
(h<<4) - (h>>3) + x (2)84800 (1)81129 (1)112800
(2)79919 (2)96178 (2)111417 (2)125194
(1)79511 (2)87643 (1)95399
((h + x) * 0xAAAB) >> 3 (4)85250 (2)81712 (3)113099
(4)80467 (3)96463 (4)111597 (4)125487
(3)79684 (4)87912 (4)95902

The number means collision times, smaller is better.

I don't see the obvious difference, but it seems that `(h<<4) -
(h>>3) + x` is better than orignal `(h<<5) + (h>>2) + x`, but `(h<<4)
+ (h>>3) + x` is better , too.

It seems that `(h<<4) - (h>>3) + x` is the best choise in this test
case, and `(h<<5) + (h>>2) + x` is the worst.
but I think the point is changing >>2 to >>3 , the multiplication
number is not very important here (IMHO) .

[1] : https://gist.github.com/cloudwu/61befc67d5ef6eee7ef8fda251612439

--
http://blog.codingnow.com