lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


So now you can confirm that the factor 129 is not suitable as is is not prime (divisible by 3) even if it's a Mersenne number (=2^7+1).

But the current implementation still uses it as the factor: (h<<5)+(h>>2) is the same as (h<<7 + h)>>>2 = (h*128+h)>>>2 = (h*129)>>>2; this way of writing it with the sum of shifts in two directions avoids discarding 2 high bits with (h<<7) and avoids using a rotation (only left or right shifts are needed, the rotation operator >>> is not standard in C even if it is common in native instructions of CPUs; the compiler will detect that and will still generate a rotation if the two parts of the _expression_ do not depend on other variables, but here "h" is the same local variable that cannot be muted externally in the middle of the _expression_; some compilers may not optimize the _expression_ if you don't use a "bitrotl" intrinsic with an explicit wordsize for which the rotation is defined).

You see however the Mersenne factor 129, which is not prime. Using 127 avoids this problem of non-primatlity: (h*127)>>>2 =  (h*128 - h)>>>2 = (h<<7 - h)>>>2 = (h<<5) - (h>>2) : you've just replaced the "+" sign by a "-" sign.

The final ">>2 " is meant to "shuffle" bit positions between round, but the shuffling is partial by arranging one half of the bits of the first unit to combine only with one half of the other units (remember that integers have an even number of bits). Shuffling by rotations of 3 bits instead of 2 avoids this problem:  (h*127)>>>3 =  (h*128 - h)>>>3 = (h<<7 - h)>>>3 = (h<<4) - (h>>3)

This is not changing the code significantly, it will compile as efficiently as the original.

But now the factor 129 is still not good enough as it is only Mersenne, so, even with the >>>3 shuffling (which works to better randomize the dependency between successive units at short distances onto all bits of the output hash, and not just half of them). bits of the first unit are slow to influence and propagate changes to other bits. You need better primes that Mersenne primes so that shuffling occurs faster with less dependency between bits of successive units of input data.

Knuth designated good multipliers for several wordsizes. You deliberately ignored it just like you deliberately chose to ignore the primality requiement of the factor: this is a proof that you've not understood the Knuth algorithm and how and why it really works as a good (but still fast) hash.

So in this piece of code, you deliberately made three choices (including the most important one: primality of the multilpier) to ignore the constraints (deliberate because you say you read that book!), to reduce the basic security of hashes and deliberately allow easy DoS attacks on Lua tables: are you serious? What other developers are will say if you refuse the proofs? May be you did not write this initial line but borrowed it from preexisting code, but when you see it now, you should admit that this was an error, in fact a severe bug that was not properly reviewed.

What is complicate for the immediate fix? i.e. replacing: (h<<5)+(h>>2) by: (h<<4)-(h>>3)
Do you seriously think this tiny change (two tiny digits, and one operator) would impact the local performance or global performance of tables?
Or that it would impact the compatibility, given that you've already changed the indexing of strings elsewhere, by changing the prefix and suffix length from 6 to 8 and modified the step for sampling the middle of "long strings"?








Le mar. 26 mai 2020 à 22:22, Francisco Olarte <folarte@peoplecall.com> a écrit :
Philippe:

On Tue, May 26, 2020 at 9:20 PM Philippe Verdy <verdyp@gmail.com> wrote:
> Some readings:
> https://www.cs.hmc.edu/~geoff/classes/hmc.cs070.200101/homework10/hashfuncs.html

You must be kidding. Something 20 years old with homework in the url?

> https://lowrey.me/exploring-knuths-multiplicative-hash-2/

I don't even know what this is cited for.

> https://gist.github.com/badboy/6267743

This is mildly interesting, but not a lot of information.

> https://cstheory.stackexchange.com/questions/9639/how-did-knuth-derive-a

stackexchange? really?

> There are tons of documents online in addition to the famous Knuth's book "The Art of Programming" that you should have read or seen referenced in many places!

FYI, TAOCP are several books. I assume you refer to vol3, sorting and searching.

As I told you, I own it. Several copies, starting from the hard cover
edition from 1973, second printing, ISBN 0-201-03803-X, with the
folded tape merging insert between pages 340 and 341, and had read 6.4
several times, and uderstood the math ( haven't done it on  decade or
so, and given the amount of math I've forgotten I may not be able to
do it now without refreshing ).

Francisco Olarte.