Re: ltable.c table deletion

It seems hard to believe that all of these people (Knuth in
particular) would have missed this simple technique for preventing
chains from coalescing in a chained scatter table. That's why I'm
wondering if I'm missing something.

In case anyone is interested, I was able to get my hands on Knuth v.3,

where he describes "coalesced chaining." He acknowledges that chain

coalescing can be avoided at insertion time, and attributes this

technique as being suggested by "D.E. Ferguson," who I believe is

David E. Ferguson, but Knuth does not give a specific citation. Given

that coalescing can be avoided, the name "coalesced chaining" seems an

unfortunate one for this hash table layout scheme -- "chained scatter"

seems much better (with coalescing and non-coalescing variants).

Knuth analyzes the expected cost of lookup and finds that lookup

performance improves when you prevent coalescing (because chains are

shorter), but that "there is not enough of an improvement [...] to

warrant changing the algorithm." Here are graphs of load (0=empty,

1=full) vs. expected number of entries probed for each lookup,

comparing coalesing vs. non-coalescing implementations (ie. Brent's

variation vs. just naive chained scatter). These approximate formulas

come from Knuth v.3 2nd ed. page 524 and 525:

Successful lookup (difference is subtle; ~1.5 vs. ~1.8 at 100% load):

http://www.wolframalpha.com/input/?i=1+%2B+%28e%5E%282x%29+-+1+-+2x%29+%2F+8x+%2B+x+%2F+4+vs+1+%2B+x+%2F+2+from+0+to+1

Unsccessful lookup (a bit larger; ~1.5 vs. ~2.1 at 100% load):

http://www.wolframalpha.com/input/?i=1+%2B+%28e%5E%282x%29+-+1+-+2x%29+%2F+4+and+1+%2B+%28x%5E2%29%2F2+from+0+to+1

However, Knuth doesn't mention the cost of deletion, which I believe

goes from O(n^2) to O(n) when you avoid coalescing -- he doesn't cover

the deletion algorithm in the book.

My takeaways from this are:

- this technique of avoiding coalescing is known, but not well-known.

Knuth attributes it to Ferguson and doesn't mention Brent, and to

my eye Brent's paper [0] is talking about an improvement to open

addressing, not chained scatter.

- avoiding coalescing reduces deletion time from O(n^2) to O(n) (and I

think the average case improvement would be even more drastic), but

I can't find a published citation for this anywhere. This part

doesn't matter for Lua since it doesn't perform actual deletion.

- avoiding coalescing does make lookup faster, but not drastically

so (see graphs above), and it makes insertion more expensive (since

you sometimes have to compute an extra hash and follow that hash's

chain). So if you're inserting a lot more than you're doing

lookups, avoiding coalescing ("Brent's variation") could actually be

a net loss.

It would be interesting to test whether removing Brent's variation

from the code would make a noticeable difference in benchmarks. I

think all you would have to do is remove the "is colliding node out of

its main position?" branch in ltable.c and always take the "else"

block of that conditional instead. It could theoretically make things

either faster or slower, depending on the program.

Josh

[0] http://maths.anu.edu.au/~brent/pub/pub013.html