lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Now for portability: how do you detect the actual wordsize? The function uses an "unsigned int" and can use "cast_uint()" but we need a portable macro for defining the conditional code that will be compiled depending on its bit-length. you may use cast_uint(-1) to check if its value is 65535, or 2^32-1 or 2^64-1 but in conditional macros (#if) this is errorprone and not portable with all implementations of C99.
You should use MAX_UINT (whose valie can be tested in #if macros) to create the C implementation that would use the correct implementation, and that will allow processing aligned "uint" values read from the string using type casts on the string pointer.

Only the end of string requires special processing as it does not necessarily fit a full size of a unsigned int (and the byte order in uint is platform dependant: you want to hash the first bytes seen at en of string, which may be the lowest significant bytes or the highest significant bytes): if you use directly the uint pointer, you would hash some bytes after the end of the striong, and with unknown values, or that could possibly be unallocated (so a simple masking would not necessarily work (this depends on how string buffers are allocated in memory and if they contain enough padding bytes whose content is forced to zeroes or not).

Safe memory allocators (from C malloc() are not warrantied to prefill the memory with zeroes, and string buffers may have been allocated from a preexisting pool of allocated memory which is recycled inside Lua without being freed or cleared. May be a solution to this problem would be that all new strings create in Lua have a padding to align the size of their allocated buffer aligned to a multiple of the native wordsize. But I won't bet this is the case if strings created by substring extraction of another string are just reusing the same buffers instead of copying the substring into a new buffer: if this occurs, bytes after the end of the strings are not padded and may expose other data that must not be part of the hash and whose value is unpredictable and not comparable with other strings of equal length and content (but not necessarily interned so that a single copy exists in memory at any time).

For this reason, the end of string is not easy to mask with a cast_unit(), and can only be processed by first creating the last unit to hash in a byte-per-byte reading loop using cast_byte(). This loop would be small, limited to 1 to 8 bytes at most for wordsize=8, and can be inlined with a simple:
  unit = cast_byte(ptr[0]);
  switch(cast_unit(ptr)&7) {
  case 1: break;
  case 2: unit+= cast_byte(ptr[1])<<8; // no break
  case 3: unit+= cast_byte(ptr[2])<<16; // no break
  case 4: unit+= cast_byte(ptr[3])<<24; // no break
  case 5: unit+= cast_byte(ptr[4])<<32; // no break
  case 6: unit+= cast_byte(ptr[5])<<40; // no break
  case 7: unit+= cast_byte(ptr[6])<<48; // no break
  case 0: unit+= cast_byte(ptr[7])<<56; // no break
  }
After this switch completes, you have a safe unit containing only the value of the last bytes, and you can safely apply the last transform: h=(h*factor)>>>3+unit.
All other units in the string should be aligned (including the start of string, processed separately, and the last complete unit processed in a loop before this switch).

But a good question remains: do all string values have their first byte aligned with word boundaries (if this not be the case because string buffers are reused for substrings) you need a first test at top of the function to test the start address given, and then you'll need to do the composition byte per byte with the same switch above to process each unit including the first unit at start of string.

There are also obvious ways to optimize further the switch above, to minimize the number of typecasts (because the alignement constraint is know, however the byte order is not known and would require a conditional test if it is MSB-first like on 68k, or LSB-first like on x86/x64/ARM). But I give only an example for clarity. Such tests can be performed by macro, and won't have any performance cost at runtime.

Le lun. 25 mai 2020 à 12:49, Philippe Verdy <verdyp@gmail.com> a écrit :
Now if you use multiplications (using the primes I described), you also need to keep the rotation by 3 bits to the right after the multiplication; only the multiplication prime factor is changed to have a suitable number of bits set to 1.


Le lun. 25 mai 2020 à 12:45, Philippe Verdy <verdyp@gmail.com> a écrit :
Finally the final rotation of 2 bits to the right (what I incorrectly named a "division", but where the 2 fractional bits are reinjected into the 2 high bits of the result) after the multiplication is definitely not the best: it should use a rotation of 3 bits so that rotations will not create simple short patterns of units at constant offsets and all bits of input will be distributed equally in all bits of the resulting hash; with wordsize=2 (the current implementation), there's a predictable repetition every 4 bytes which allows weak hashes!

With Mersenne primes:
* 2^7-1 = 127 (for wordsize=2): h^= ((h<<4) - (h>>3) + unit; equivalent to: h^= (h*127)>>>3 + byte; (if you hash by successive 8-bit units)
* 2^17-1 = 131071 (for wordsize=4): h^= ((h<<14) - (h>>3) + unit ; equivalent to: h^= (h*131071)>>>3 + word; (if you hash by 16-bit units) 
* 2^31-1 = 2147483647 (for wordsize=8): h^= ((h<<28) - (h>>3) + unit; equivalent to: h^= (h*2147483647)>>>3 + word; (if you hash by 32-bit units)


Le lun. 25 mai 2020 à 12:36, Philippe Verdy <verdyp@gmail.com> a écrit :
Also if you use a plain multiplication (instead of a couple of shifts and a single addition/substraction), you are not required to use primes that are Mersenne numbers; there are better primes that better distribute the bits: good primes to use for the constant multiplication factor should have n/2 bits set to 1 and n/2 bits set to 0, where n is the bitsize of words.

- for wordsize=2, n=16, so you just need to select a 15-bit prime with 8 bits set to 1 (including the lowest bit because such primes are odd) : many such primes exists (they are not Mersenne numbers), somewhere in the range from 2^14+2^8-1 to 2^15-1 (a simple sieve will detect them, select the smallest you find in that range)
- for wordsize=4, n=32, so you just need to select a 31-bit prime with 16 bits set to 1: many more such primes exist, somewhere in the range from 2^30+2^16-1 to 2^31-1
- for wordsize=8, n=32, so you just need to select a 63-bit prime with 32 bits set to 1 : many more such primes exist, somewhere in the range from 2^62+2^32-1 to 2^31-1




Le lun. 25 mai 2020 à 12:14, Philippe Verdy <verdyp@gmail.com> a écrit :
What do you think of :   unsigned int h = seed ^ cast_uint(l);
The seed is used and combined with the string length to initialize the first hash before hashing a part of the string contents (every step characters in backward direction, starting from the last character)
This means that the start of the string is not always hashed and is sensitive to collisions. There's no safeguard to protect at least the start. The end of the string is also not completely hashed.
Ideally the 6 first characters and 6 last characters should always be hashed. Short strings (12 bytes or lower should be fully hashed, along with their string length), long strings (longer than 12 chars) should hash the length, the 6 first chars, intermediate characters backward starting from the l-6 down to 7 (by adjustable step, the step being dependant on l-12), and the 6 last characters.

For performance reasons, it should probably hash not bytes but aligned words: the intermediate hash should have the size of a word. This means that indexing the last bytes of the string would need to be aligned or special cased if the string length is not a multiple of the word-size (some extra null padding bytes may be implied for missing bytes, this just requires a small switch to handle the end of string.

If we assume that wordsize is 2, 4 or 8 bytes (is for 8 for 64-bit compilers), then instead of hashing the 6 first bytes and the 6 last bytes, we would just hash the 8 first bytes and the 8 last bytes (aligned with padding, using a small switch after the loop with 8 cases. And in that case the loop for the middle would proceed on sets of 8 bytes every step and the step would be a multiple of 8. And the middle of the loop would hash one, two or four words depending on wordsize

Note: "wordsize" means here "sizeof(unsigned int)".

Not only it would be faster, it would also avoid more collisions without degrading performance.

Also I don't see the interest of passing the step value in parameter to this function, which could as well compute its value from the given string length. And that function would be then the same for all strings (short or long). The only interest of this "step" parameter is that it indicates if we want partial hashing (for long strings) or full hashing: if its value is 8 or lower, we want the full hash. We can give it a larger value to indicate the *maximum* allowed step that the function will use or 0 if there's no maximum, and for normal strings that parameter could be 0 (so at start of the function it will just reduce this maximum if it's non-zero, according to the string length, or will just use the "optimal" value which warranties a reasonnable maximum number of loops for very long strings.

The "optimal" step for the loop is just based on the number of bits in ((l+7)&~7)-16 (only when l>(16+wordsize), because for l<(16+wordsize), the string is still fully hashed along with its length).

Also I'm not sure that a backwards loop is faster than a forwards loop (that would just increment the "str" pointer by "step", whereas the "l" counter used in the loop would just use a simple decrement by 1 and a lower bound set to 0 (for which the C compiler can generate more efficient native code, without any immediate value in a single decrement and test instruction, e.g. on x64, x86, x80, z80, PPC, ARM32, ARM64, 68k, 65xx, and Sparc). This tight loop would just require 3 registers: the str pointer in a dedicated address register, the loop counter in a dedicated fast data register, and the intermediate hash in a second data register.

Also I'm not convinced that the current hashing method using XOR, and a multiplication by 32.25 (implemented by shifts which may be slow) before adding a new word from the string is the best one. The multiplication factor (32.25 here, equivalent to the factor 129 with a final division by 4) may as well be faster on some platforms using an integer multiplication

And then there's no need to divide by 4 if you use a better factor depending on wordsize; the factor 129 is only good if wordsize is 2, but **IT IS NOT A PRIME** as it is divisible by 3; a better factor would be 31.75 for wordsize=2, i.e. "+(h>>2)" should have been "-(h>>2)". Good Mersene primes are:

* 2^7-1 = 127 (for wordsize=2): h^= ((h<<5) - (h>>2) + unit; equivalent to: h^= h*127/4 + byte (if you hash by successive 8-bit units)
* 2^17-1 = 131071 (for wordsize=4): h^= ((h<<15) - (h>>2) + unit ; equivalent to: h^= h*131071/4 + word(if you hash by 16-bit units) 
* 2^31-1 = 2147483647 (for wordsize=8): h^= ((h<<29) - (h>>2) + unit; equivalent to: h^= h* 2147483647/4 + word(if you hash by 32-bit units)

Using Mersene primes allows optimization of multiplications using a pair of shifts and a single addition or substraction. Using an integer multiplication may be faster on some platforms than using shifts (if the CPU has no barrel shifter in its ALU and the number of cycles depends on the number bits to shifts)

But the code using  ((h<<5) + (h>>2) + cast_byte(str[l - 1])) is definitely wrong, it uses a Mersenne number but it is definitely not a prime : it gives poor hashes because it has a very small factorization factor 3: collisions occur too often (so it is extrememly trivial to be targetted by DoS attacks).

I would recommand using an optimization with wordsize=8, and the Mersenne prime 2^31-1, and then hash by units of 64-bits (with alignment): not only it will be faster, more bytes will be hashed (at least the first 8 bytes and the 1 to 7 last bytes, hashing the rest of long strings by group of 8 bytes starting from the last 8 bytes (aligned down so that the last 8 to 15 bytes will also always be hashed)


Le ven. 22 mai 2020 à 02:21, Andrea <andrea.l.vitali@gmail.com> a écrit :
I just want to confirm my finding


unsigned int luaS_hash (const char *str, size_t l, unsigned int seed,
                        size_t step) {
  unsigned int h = seed ^ cast_uint(l);
  for (; l >= step; l -= step)
    h ^= ((h<<5) + (h>>2) + cast_byte(str[l - 1]));
  return h;
}


From this code it is clear that the subset of characters sampled does not depend on the seed but only from the length of the string and the step.

Therefore if you build strings that only differs in characters that are not sampled, you are going to get same number of collisions. Only the buckets were the collisions happen will change, depending on the seed.

Does this mean that Lua may be susceptible to hash DoS attacks?

What is the purpose of the seed? it seems that the only effect is to randomize performance from one run to another...

    Andrea

--
Andrea Vitali