lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Mon, Jul 6, 2015 at 2:14 PM, Cosmin Apreutesei
<cosmin.apreutesei@gmail.com> wrote:
>> The point is that you don't currently have a non-leaky implementation.
>
> I'm not sure I follow. If it's leaky then that's a bug. Can you show me
> please?
>

Create a tuple. Destroy the tuple. The index tree keeps all of the
intermediate nodes that it had to add in order to pick up that tuple,
and only the leaf node is actually removed.

If you have tuples (1, 2, 3) and (2, 2, 3) and destroy both then
you've leaked four tables that have no useful children:

{
  [1] = {
    [2] = {}
  },
  [2] = {
    [2] = {}
  }
}

The resolution, if minimizing memory usage is your goal, is to
implement a __gc method on your tuple objects that walks up the tree
and removes empty branches. This does, of course, make GC sweeps more
expensive, and it means that a subsequent insertion is more expensive,
so it's again a time/space tradeoff.

>> The standard isn't "never make tradeoffs." The standard is "make sure
>> the tradeoffs are worth it" with a side of "avoid breaking/penalizing
>> existing code if you can". And one tradeoff that's ALMOST NEVER worth
>> it is making an extremely common operation more expensive. It's ALMOST
>> ALWAYS preferable to add a little bit more per-call cost to a
>> less-frequent scenario.
>
> What specifically is the overhead and what is the operation?

This is a general statement in response to your general statement of
all engineering being tradeoffs.

Applied to the discussion at hand, the relevant operation would be
table indexing. The overhead is hypothetical, something to think about
to inform whatever decisions would be made, not a specific critique of
any proposed implementation.

>> > Why would they get more expensive? It's the same hashmap algorithm
>> > except
>> > now the hash function is a sum of the hashes of the elements and I don't
>> > think anyone would object that that should be O(#t) as it is for
>> > strings.
>> > What am I missing?
>>
>> It's not O(#t) for short strings (that is, the kind you're most likely
>> to use as table keys), it's O(1) thanks to string interning.
>
>
> Creating a new string implies hashing it (which implies traversing at least
> part of it). It would be the same with tuples. Hashing is O(min(#s,
> some_threshold)) in both cases. The rest of the lookup algorithm would be
> exactly the same. What am I missing?

What you're missing is that there are two different operations being
discussed here: constructing a value, and indexing into a table.

You don't hash a string every time you want to look up a value in a
table. You intern it once, and then the interned value can be used to
index into a table in O(1) time.

However, if you DON'T intern tuples in advance or memoize their hash
values, then you have to deal with an O(n) hashing operation every
time you index into a table. (The decision between interning, advance
memoization, and lazy memoization is a bit deep, since the tradeoffs
there can be subtle, and there's a lot of them.)

The difference between these methods gets increasingly significant
when you cache tuples in variables.

>> A built-in tuple would either have more garbage from having multiple
>> deep copies of non-unique tuples,
>
> No deep copies are necessary. Traversing the elements of the tuple doesn't
> have to be a recursive operation.

Sorry, I didn't mean to imply a copy operation when talking about deep
copies. Rather, it would come up innocuously:

local x = tuple(1, 2, 3)
local y = tuple(1, 2, 3)

x and y COULD be references to the same value (interning) or they
could be element-wise deep copies of each other (hashing).

>> or more long-term memory usage from
>> having a canonical copy of the tuple interned in a pool.
>
> I'm not sure how interning increases memory usage. Can you explain that
> point?

Interning requires you to maintain a pool of values. If you don't have
a mechanism in place to reap gc'ed values (and AFAIK Lua's string pool
doesn't reap gc'ed strings) then the pool's memory usage is
monotonically increasing.

If you have a bunch of unique tuples that are only used for a short
period at a time (for example if you use tuples to represent
outstanding RPC calls or something) then interning without reaping
starts using up a lot of memory, and interning with reaping is more
expensive on the GC side than using element-wise hashing.

If you have a selection of tuples that you use a lot (compound lookup
keys, perhaps) then interning turns into a huge savings since you
don't have lots of copies of identical data lying around.

>> All in all: To me it strongly feels like tuples would be better served
>> by library code (or maybe a power patch) than core code. They aren't
>> broadly used enough to warrant the corresponding increase in Lua's
>> footprint, and the people who would benefit from its existence can use
>> a library or patch.
>
>
> Now we're back to what I was saying about lib coders vs. app coders. How did
> you form the opinion that they aren't broadly used in apps? Multi-value
> indexing is fundamental in any data-rich app (games, web crawling, web apps
> -- I can't remember an app where I haven't used them).

This isn't really a lib-vs-app distinction. Tuples are one of those
things that you use a lot when you're accustomed to using them, and if
you haven't used them much in the past you won't jump to adopt them if
you're introduced to them. (A large proportion of the tuple users I've
met are Python converts.)

I've only used compound data types as keys in a very small number of
cases. Most of the time that tuples would be useful I tend to just use
nested data structures instead. (Unless the low-order keys in your
data set are particularly sparse, this doesn't introduce much waste.)
I'm having trouble even imagining where I would use a tuple while
implementing a game, outside of the networking code. An application
that's HEAVILY data-rich is going to be sending accesses to a database
anyway.

/s/ Adam