Re: question on the patch FastString

But you've just proven that (h<<4 - h>>3) is the best (using prime 127, which is a strict Mersenne 2^p-1), as I indicated and that (h<<5 + h>>2) is the worst (current implementation using non-prime 129 but still Mersenne-like in the alternate series 2^p+1). I was right since the start !

Did you experiment:

(h+x)<<4 - (h+x)>> 3

(i.e. adding x before multiplying and then shuffling by rotation) ? it makes a difference for short words.

Note that your test with English words is biased in favor of short words.

A test including "words" like random hex GUID, or common prefixes followed by an integer (from a sequence 1..N) (such that the generated "word" still fits in the limit where the string is not partially sampled but fully hashed, i.e. below 40 bytes with the current implemetnation), or dates and timestamps in ISO8601 formats like "yyyy-mm-ddTHH:ii:ssZ", is something that should be included in the same list or separate list of terms.

But with a lost of English terms with variable length, whose length is included in the hash initialisation step, you don't see that if the sample is short as subsets of words of the same length are not distributed equally, some subsets for very short terms line "'in", "to", or long terms over 16 characters are too small.

Le mer. 27 mai 2020 à 04:39, 云风 Cloud Wu <cloudwu@gmail.com> a écrit :

Philippe Verdy <verdyp@gmail.com> 于2020年5月26日周二上午1:18写道：
>
> No even with step=1 the transform of h is wrong. As I said (h<<5)+(h>>2) is equivalent to (h<<7 + h) (with preservation of the two highest bits) followed by a rotation of 2 bits to the right. and (h<<7+h) is equivalent to (h*129) so the effect is that the transform multiplies by a non-prime (129 which is divisible by 3), followed by a rotation of 2 bits (which is also wrong because the word size is a multiple of 2, so you have divided the hash collision search space by half the wordsize).
>
> This should immediately be changed to (h<<4)-(h>>3): this is equivalent to (h*127)>>>3, where 127 is prime (but not optimal as it is a simple Mersenne that still does correctly distributes all bits of the input)
>
> As well you use a multiplication before just adding a new byte, so hashes of strings with equal lenght 'a','b','c' are always successive integers.
>
> You should really first add the new byte to h, then perform the multiplication (by 127, not 129 if you still want to use a Mersenne number where the multiplication is just an addition or difference of the input and the shifted input!) and then the rotation (by 3 bits, not 2!)

I wrote a script to check the collision of those hash functions by
simulating lua hash tables [1].

I use https://github.com/dwyl/english-words/blob/master/words_alpha.txt
that contains 370104 English words, putting every N random words into
M slots hash table,
and check how many times did the collision happen.

N/M 3/4 5/8
7/8 9/16 11/16
13/16 15/32 17/32 19/32
21/32
(h<<5) + (h>>2) + x (3)84936 (4)81778 (2)113039
(3)79994 (4)96529 (3)111433 (3)125257
(4)79748 (3)87786 (3)95753
(h<<4) + (h>>3) + x (1)84398 (3)81747 (4)113248
(1)79916 (1)96154 (1)111169 (1)125113
(2)79542 (1)87535 (2)957332
(h<<4) - (h>>3) + x (2)84800 (1)81129 (1)112800
(2)79919 (2)96178 (2)111417 (2)125194
(1)79511 (2)87643 (1)95399
((h + x) * 0xAAAB) >> 3 (4)85250 (2)81712 (3)113099
(4)80467 (3)96463 (4)111597 (4)125487
(3)79684 (4)87912 (4)95902

The number means collision times, smaller is better.

I don't see the obvious difference, but it seems that `(h<<4) -
(h>>3) + x` is better than orignal `(h<<5) + (h>>2) + x`, but `(h<<4)
+ (h>>3) + x` is better , too.

It seems that `(h<<4) - (h>>3) + x` is the best choise in this test
case, and `(h<<5) + (h>>2) + x` is the worst.
but I think the point is changing >>2 to >>3 , the multiplication
number is not very important here (IMHO) .

[1] : https://gist.github.com/cloudwu/61befc67d5ef6eee7ef8fda251612439

--
http://blog.codingnow.com