Re: question on hash for long strings

Le lun. 25 mai 2020 à 12:36, Philippe Verdy <verdyp@gmail.com> a écrit :

Also if you use a plain multiplication (instead of a couple of shifts and a single addition/substraction), you are not required to use primes that are Mersenne numbers; there are better primes that better distribute the bits: good primes to use for the constant multiplication factor should have n/2 bits set to 1 and n/2 bits set to 0, where n is the bitsize of words.

- for wordsize=2, n=16, so you just need to select a 15-bit prime with 8 bits set to 1 (including the lowest bit because such primes are odd) : many such primes exists (they are not Mersenne numbers), somewhere in the range from 2^14+2^8-1 to 2^15-1 (a simple sieve will detect them, select the smallest you find in that range)
- for wordsize=4, n=32, so you just need to select a 31-bit prime with 16 bits set to 1: many more such primes exist, somewhere in the range from 2^30+2^16-1 to 2^31-1
- for wordsize=8, n=32, so you just need to select a 63-bit prime with 32 bits set to 1 : many more such primes exist, somewhere in the range from 2^62+2^32-1 to 2^31-1

Le lun. 25 mai 2020 à 12:14, Philippe Verdy <verdyp@gmail.com> a écrit :
What do you think of : unsigned int h = seed ^ cast_uint(l);
The seed is used and combined with the string length to initialize the first hash before hashing a part of the string contents (every step characters in backward direction, starting from the last character)
This means that the start of the string is not always hashed and is sensitive to collisions. There's no safeguard to protect at least the start. The end of the string is also not completely hashed.
Ideally the 6 first characters and 6 last characters should always be hashed. Short strings (12 bytes or lower should be fully hashed, along with their string length), long strings (longer than 12 chars) should hash the length, the 6 first chars, intermediate characters backward starting from the l-6 down to 7 (by adjustable step, the step being dependant on l-12), and the 6 last characters.

For performance reasons, it should probably hash not bytes but aligned words: the intermediate hash should have the size of a word. This means that indexing the last bytes of the string would need to be aligned or special cased if the string length is not a multiple of the word-size (some extra null padding bytes may be implied for missing bytes, this just requires a small switch to handle the end of string.

If we assume that wordsize is 2, 4 or 8 bytes (is for 8 for 64-bit compilers), then instead of hashing the 6 first bytes and the 6 last bytes, we would just hash the 8 first bytes and the 8 last bytes (aligned with padding, using a small switch after the loop with 8 cases. And in that case the loop for the middle would proceed on sets of 8 bytes every step and the step would be a multiple of 8. And the middle of the loop would hash one, two or four words depending on wordsize

Note: "wordsize" means here "sizeof(unsigned int)".

Not only it would be faster, it would also avoid more collisions without degrading performance.

Also I don't see the interest of passing the step value in parameter to this function, which could as well compute its value from the given string length. And that function would be then the same for all strings (short or long). The only interest of this "step" parameter is that it indicates if we want partial hashing (for long strings) or full hashing: if its value is 8 or lower, we want the full hash. We can give it a larger value to indicate the *maximum* allowed step that the function will use or 0 if there's no maximum, and for normal strings that parameter could be 0 (so at start of the function it will just reduce this maximum if it's non-zero, according to the string length, or will just use the "optimal" value which warranties a reasonnable maximum number of loops for very long strings.

The "optimal" step for the loop is just based on the number of bits in ((l+7)&~7)-16 (only when l>(16+wordsize), because for l<(16+wordsize), the string is still fully hashed along with its length).

Also I'm not sure that a backwards loop is faster than a forwards loop (that would just increment the "str" pointer by "step", whereas the "l" counter used in the loop would just use a simple decrement by 1 and a lower bound set to 0 (for which the C compiler can generate more efficient native code, without any immediate value in a single decrement and test instruction, e.g. on x64, x86, x80, z80, PPC, ARM32, ARM64, 68k, 65xx, and Sparc). This tight loop would just require 3 registers: the str pointer in a dedicated address register, the loop counter in a dedicated fast data register, and the intermediate hash in a second data register.

Also I'm not convinced that the current hashing method using XOR, and a multiplication by 32.25 (implemented by shifts which may be slow) before adding a new word from the string is the best one. The multiplication factor (32.25 here, equivalent to the factor 129 with a final division by 4) may as well be faster on some platforms using an integer multiplication

And then there's no need to divide by 4 if you use a better factor depending on wordsize; the factor 129 is only good if wordsize is 2, but **IT IS NOT A PRIME** as it is divisible by 3; a better factor would be 31.75 for wordsize=2, i.e. "+(h>>2)" should have been "-(h>>2)". Good Mersene primes are:

* 2^7-1 = 127 (for wordsize=2): h^= ((h<<5) - (h>>2) + unit; equivalent to: h^= h*127/4 + byte (if you hash by successive 8-bit units)
* 2^17-1 = 131071 (for wordsize=4): h^= ((h<<15) - (h>>2) + unit ; equivalent to: h^= h*131071/4 + word(if you hash by 16-bit units)
* 2^31-1 = 2147483647 (for wordsize=8): h^= ((h<<29) - (h>>2) + unit; equivalent to: h^= h* 2147483647/4 + word(if you hash by 32-bit units)

Using Mersene primes allows optimization of multiplications using a pair of shifts and a single addition or substraction. Using an integer multiplication may be faster on some platforms than using shifts (if the CPU has no barrel shifter in its ALU and the number of cycles depends on the number bits to shifts)

But the code using ((h<<5) + (h>>2) + cast_byte(str[l - 1])) is definitely wrong, it uses a Mersenne number but it is definitely not a prime : it gives poor hashes because it has a very small factorization factor 3: collisions occur too often (so it is extrememly trivial to be targetted by DoS attacks).

I would recommand using an optimization with wordsize=8, and the Mersenne prime 2^31-1, and then hash by units of 64-bits (with alignment): not only it will be faster, more bytes will be hashed (at least the first 8 bytes and the 1 to 7 last bytes, hashing the rest of long strings by group of 8 bytes starting from the last 8 bytes (aligned down so that the last 8 to 15 bytes will also always be hashed)

Le ven. 22 mai 2020 à 02:21, Andrea <andrea.l.vitali@gmail.com> a écrit :
I just want to confirm my finding

This is the code I see here https://github.com/lua/lua/blob/master/lstring.c

unsigned int luaS_hash (const char *str, size_t l, unsigned int seed,
size_t step) {
unsigned int h = seed ^ cast_uint(l);
for (; l >= step; l -= step)
h ^= ((h<<5) + (h>>2) + cast_byte(str[l - 1]));
return h;
}

From this code it is clear that the subset of characters sampled does not depend on the seed but only from the length of the string and the step.

Therefore if you build strings that only differs in characters that are not sampled, you are going to get same number of collisions. Only the buckets were the collisions happen will change, depending on the seed.

Does this mean that Lua may be susceptible to hash DoS attacks?

What is the purpose of the seed? it seems that the only effect is to randomize performance from one run to another...

Andrea

--
Andrea Vitali