lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Philippe Verdy <verdyp@gmail.com> 于2020年5月26日周二 上午1:18写道:
>
> No even with step=1 the transform of h is wrong. As I said (h<<5)+(h>>2) is equivalent to (h<<7 + h) (with preservation of the two highest bits) followed by a rotation of 2 bits to the right. and (h<<7+h) is equivalent to (h*129) so the effect is that the transform multiplies by a non-prime (129 which is divisible by 3), followed by a rotation of 2 bits (which is also wrong because the word size is a multiple of 2, so you have divided the hash collision search space by half the wordsize).
>
> This should immediately be changed to (h<<4)-(h>>3): this is equivalent to (h*127)>>>3, where 127 is prime (but not optimal as it is a simple Mersenne that still does correctly distributes all bits of the input)
>
> As well you use a multiplication before just adding a new byte, so hashes of strings with equal lenght 'a','b','c' are always successive integers.
>
> You should  really first add the new byte to h, then perform the multiplication (by 127, not 129 if you still want to use a Mersenne number where the multiplication is just an addition or difference of the input and the shifted input!) and then the rotation (by 3 bits, not 2!)

I wrote a script to check the collision of those hash functions by
simulating lua hash tables [1].

I use https://github.com/dwyl/english-words/blob/master/words_alpha.txt
that contains 370104 English words,  putting every N random words into
M slots hash table,
and check how many times did the collision happen.

      N/M                                 3/4                 5/8
           7/8                 9/16               11/16
13/16                15/32           17/32               19/32
    21/32
(h<<5) + (h>>2) + x         (3)84936        (4)81778        (2)113039
     (3)79994        (4)96529        (3)111433       (3)125257
(4)79748        (3)87786        (3)95753
(h<<4) + (h>>3) + x         (1)84398        (3)81747        (4)113248
     (1)79916        (1)96154        (1)111169       (1)125113
(2)79542        (1)87535        (2)957332
(h<<4) - (h>>3) + x          (2)84800        (1)81129        (1)112800
      (2)79919        (2)96178        (2)111417       (2)125194
(1)79511        (2)87643        (1)95399
((h + x) * 0xAAAB) >> 3   (4)85250        (2)81712        (3)113099
   (4)80467        (3)96463        (4)111597       (4)125487
(3)79684        (4)87912        (4)95902


The number means collision times, smaller is better.

I don't see the obvious difference, but it seems that  `(h<<4) -
(h>>3) + x` is better than orignal `(h<<5) + (h>>2) + x`, but  `(h<<4)
+ (h>>3) + x` is better , too.

It seems that `(h<<4) - (h>>3) + x`  is the best choise in this test
case, and `(h<<5) + (h>>2) + x` is the worst.
but I think the point is changing >>2 to >>3 , the multiplication
number is not very important here (IMHO) .

[1] : https://gist.github.com/cloudwu/61befc67d5ef6eee7ef8fda251612439

-- 
http://blog.codingnow.com