[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Hash Table Collisions (n.runs-SA-2011.004)
- From: Jay Carlson <nop@...>
- Date: Fri, 6 Jan 2012 20:40:31 -0500
I wonder if combining a secret seed with the contents of the
accumulated hash to choose how far to jump ahead each time is
sufficiently annoying. Instead of jumping ahead stepsize, jump ahead
stepsize+(h%stepsize) and mix a few bytes in. We don't need to always
use the same number of samples for a given length either, right?
On Fri, Jan 6, 2012 at 10:41 AM, Roberto Ierusalimschy
<roberto@inf.puc-rio.br> wrote:
>> > Yes, it would be great to hear "official" opinion from both Lua team
>> > and Mike.
>>
>> Well, this is not an official opinion, but one possible approach would
>> be to break strings into two variants: short strings and long strings,
>> where long strings would not be interned any more. More details later
>> (too late to write it now).
>
> If I understood the problem correctly, a variable hash seed would
> solve it, except for the skipping of characters. So, one approach
> is to implement strings using two variants: short strings (up to
> 32 bytes) and long strings.
Consider /usr/bin/md5sum and sha1sum. Their hex output is 32 and 40
characters. I've indexed tables that way, and would have been
surprised if that length crossed a line. But I'm not quite sure what
the consequences would be for boxed strings.
I would guess the next step up is lengths is indexing based on file
paths. Looking at /usr I see file paths falling off after about 64-80,
but there are longer ones too.
I notice Lua's hash reads memory backwards, presumably because common
prefixes make bit diffusion from the tail more valuable. It would be
more friendly to the memory hierarchy on some machines to read it
forward; does it really collide more? Speaking of memory hierarchy,
many uses of mid-length strings are likely to be hot, and the cost of
reading them small. My gut feeling is that about the time you'd see
word or SSE vectorization really pay off is when boxed strings would
make sense anyway. Dunno what the state of the art is for SIMD
vectorizable string-like hashes, especially for arbitrary alignment.
Perhaps there's a toy in
http://graphics.stanford.edu/~seander/bithacks.html somewhere....
[You can stop reading now, since I got distracted writing the tools to
look at the file system. Should have used Excel....]
find /home/nop -type f -print | lua -e 'max=0 hist={} for l in
io.lines() do max=math.max(max, #l) b=math.floor(#l/8)
hist[b]=(hist[b] or 0) + 1 end --[[END]] for i = 1,(max+7)/8 do x =
hist[i] or 0 print(i*8, x, string.rep("*", x/256)) end'
Yes, my lua golf handicap is terrible. Oh, can we have file:read("*z")
to read a \0-terminated line to go with find -print0 and all of its
friends? In this case I know there are no file names with newlines in
them but this burns people memorably. Just for fun:
find /home/nop -type f -print | gawk '{l=length(); if (l>m) {m=l}
b=int(l/8); h[b]++} END {print m; for(i=0;i<int(m/8);i++) {printf "%3d
%5d %.*d\n", i*8, h[i], h[i]/256, 0}}'
I'm pretty proud of myself for using printf("%.*d", len, 0) to fake
bar graphs using zero-padding. Mostly because I couldn't find
string.rep....
Jay
- References:
- Re: Hash Table Collisions (n.runs-SA-2011.004), Vladimir Protasov
- Re: Hash Table Collisions (n.runs-SA-2011.004), Leo Razoumov
- Re: Hash Table Collisions (n.runs-SA-2011.004), Vladimir Protasov
- Re: Hash Table Collisions (n.runs-SA-2011.004), Miles Bader
- Re: Hash Table Collisions (n.runs-SA-2011.004), Ashwin Hirschi
- Re: Hash Table Collisions (n.runs-SA-2011.004), Miles Bader
- Re: Hash Table Collisions (n.runs-SA-2011.004), David Kolf
- Re: Hash Table Collisions (n.runs-SA-2011.004), Miles Bader
- Re: Hash Table Collisions (n.runs-SA-2011.004), Alexander Gladysh
- Re: Hash Table Collisions (n.runs-SA-2011.004), Roberto Ierusalimschy
- Re: Hash Table Collisions (n.runs-SA-2011.004), Roberto Ierusalimschy