lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On 13/12/2012 14:34, Richter, Jörg wrote:
>PS: Do you know of some scheme (or reflexion on the topic) to define a
>proportion of big strings to hash?
I am not sure I understand you. Do you mean a scheme like Lua is doing to hash
not all bytes of the string? If so, no not really.

Yes, that's what I meant. (And yes, my question was really obscure.)
I came up with this: first I always hash 8 bytes at start and 7 at end, mainly of id strings that often share common prefixes & suffixes. Thus the issue is only for len>15, in bytes (and below n=len-15). The problem is to have a formula that keeps enough of small strings (for which each byte counts more) while not giving unreasonable amounts for big ones. So, my take is something like
    k * log2(n) + n * 2^p , with now k=1 and p=10
("integer log2" is just counting rshifts until n=0)
And bytes to hash are just taken somewhere in the middle, because I cannot find a reason why anywhere else, or spread along the string, would be any better, _in general_ (all depends on actual input data).

Interesting side story: Recently I instrumented most of our hash-tables to dump
a histogram of bucket lengths. Including the Lua string table. The first output was
somewhat shocking because some buckets of the string table had 200 and more
entries.
Turns out nearly every string in this bucket was a backtrace only differing in one line
number. Hashing the whole string, or doing lazy interning, would have helped.
But we never noticed any slowdown because of that.
(This was with Lua 5.1)

Maybe because you don't spent all your free time processing backtraces? But in case of series of not-so-small strings only differing in 1 byte, there is no solution, is there? (Unless they are pre-analysed, or the "string generator" informs the string pool somehow, as in "Hash the byte #333!").

Denis

PS: About the last point, seems it would be quite simple just to add a single param to the hash func indicating a kind of most-relevant-byte (or a pair of params for a range).