[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: question on the patch FastString
- From: 云风 Cloud Wu <cloudwu@...>
- Date: Wed, 27 May 2020 10:32:56 +0800
Philippe Verdy <verdyp@gmail.com> 于2020年5月26日周二 上午1:18写道:
>
> No even with step=1 the transform of h is wrong. As I said (h<<5)+(h>>2) is equivalent to (h<<7 + h) (with preservation of the two highest bits) followed by a rotation of 2 bits to the right. and (h<<7+h) is equivalent to (h*129) so the effect is that the transform multiplies by a non-prime (129 which is divisible by 3), followed by a rotation of 2 bits (which is also wrong because the word size is a multiple of 2, so you have divided the hash collision search space by half the wordsize).
>
> This should immediately be changed to (h<<4)-(h>>3): this is equivalent to (h*127)>>>3, where 127 is prime (but not optimal as it is a simple Mersenne that still does correctly distributes all bits of the input)
>
> As well you use a multiplication before just adding a new byte, so hashes of strings with equal lenght 'a','b','c' are always successive integers.
>
> You should really first add the new byte to h, then perform the multiplication (by 127, not 129 if you still want to use a Mersenne number where the multiplication is just an addition or difference of the input and the shifted input!) and then the rotation (by 3 bits, not 2!)
I wrote a script to check the collision of those hash functions by
simulating lua hash tables [1].
I use https://github.com/dwyl/english-words/blob/master/words_alpha.txt
that contains 370104 English words, putting every N random words into
M slots hash table,
and check how many times did the collision happen.
N/M 3/4 5/8
7/8 9/16 11/16
13/16 15/32 17/32 19/32
21/32
(h<<5) + (h>>2) + x (3)84936 (4)81778 (2)113039
(3)79994 (4)96529 (3)111433 (3)125257
(4)79748 (3)87786 (3)95753
(h<<4) + (h>>3) + x (1)84398 (3)81747 (4)113248
(1)79916 (1)96154 (1)111169 (1)125113
(2)79542 (1)87535 (2)957332
(h<<4) - (h>>3) + x (2)84800 (1)81129 (1)112800
(2)79919 (2)96178 (2)111417 (2)125194
(1)79511 (2)87643 (1)95399
((h + x) * 0xAAAB) >> 3 (4)85250 (2)81712 (3)113099
(4)80467 (3)96463 (4)111597 (4)125487
(3)79684 (4)87912 (4)95902
The number means collision times, smaller is better.
I don't see the obvious difference, but it seems that `(h<<4) -
(h>>3) + x` is better than orignal `(h<<5) + (h>>2) + x`, but `(h<<4)
+ (h>>3) + x` is better , too.
It seems that `(h<<4) - (h>>3) + x` is the best choise in this test
case, and `(h<<5) + (h>>2) + x` is the worst.
but I think the point is changing >>2 to >>3 , the multiplication
number is not very important here (IMHO) .
[1] : https://gist.github.com/cloudwu/61befc67d5ef6eee7ef8fda251612439
--
http://blog.codingnow.com