lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Tue, Jan 31, 2012 at 6:47 PM, Doug Currie <doug.currie@gmail.com> wrote:
> On Jan 31, 2012, at 7:05 PM, Josh Haberman wrote:
> > This is the key result that I realized but couldn't find
> > published anywhere, which made me wonder whether this was previously
> > known.  All algorithms I could find published used the (much) more
> > complicated and expensive removal algorithm.
>
> … which saves a next pointer at every bucket.

The published algorithms I am referring to *also* have a "next"
pointer at every bucket (ie. they do not "save" a next pointer).
Everything I have mentioned in this thread concerns the same hashing
structure Lua uses: ie. *with* a "next" pointer.  For example, the
link in my original message:
  http://www.brpreiss.com/books/opus4/html/page234.html#SECTION009514000000000000000

This slow and complicated algorithm is for removing items from
a hash table that uses chained scatter (ie. *with* next pointers).

Here is another example of a page that says that deletion from
a chained scatter table is O(m^2), and that chains can coalesce:
  http://tanksoftware.com/tutes/uni/hashing.html#ChainedScatter

Another example:

  "The previous section has shown that the worst-case running time to
  insert or to find an object into a chained scatter table is O(M).
  The average case analysis of chained scatter tables is complicated
  by the fact that lists coalesce. However, if we assume that chains
  never coalesce, then the chains which appear in a chained scatter
  table for a given set of items are identical to those which appear
  in a separately chained hash table for the same set of items.

  Unfortunately we cannot assume that lists do not coalesce--they do!
  We therefore expect that the average list will be longer than   and
  that the running times are correspondingly slower. Knuth has shown
  that the average number of probes in an unsuccessful search is
  <math follows>"

  -- http://www.brpreiss.com/books/opus5/html/page238.html

So it appears that Knuth has analyzed this also, though I don't own
TAOCP volume 3 so can't look it up.

In fact, it appears that it is so accepted that chained scatter *must*
lead to coalescing that many authors call it "coalesced hashing"
instead of "chained scatter.  For example, see this Wikipedia page:
  http://en.wikipedia.org/wiki/Coalesced_hashing

It seems hard to believe that all of these people (Knuth in
particular) would have missed this simple technique for preventing
chains from coalescing in a chained scatter table.  That's why I'm
wondering if I'm missing something.  But the more I think about it
the more I'm convinced that this technique works.  But if I'm
wrong I'm hoping someone will find the flaw in my argument before
I get too attached to the idea.  :)

The amazing thing about chained scatter with Brent's variation (to
avoid coalescing) is that it appears to be optimal in almost every
way.  Like separate chaining (ie. where the table is just pointers to
separately-allocated linked list nodes) each chain is exactly as long
as the number of collisions for that hash bucket, so your lookup
performance is limited only by the non-uniformity of your hash
function.  This is true even at 100% load.  Unlike separate chaining,
you don't have to do malloc() or free() to do inserts or deletions,
except when a resize is required.  The cache behavior is better than
separate chaining because all data is in a single contiguous table.
The memory usage will be better than separate chaining because every
hash bucket can be used whether anything hashes there or not.  But
lookup performance should be better than open addressing because
the chains let you jump straight to the next colliding hash key.

Josh