lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


David Kastrup wrote:
Mike Pall <mikelu-0703@mike.de> writes:

David Kastrup wrote:
Such strings would not get hashed unless one used them as a table
index, and they would not get unified or copied either unless
necessary.
The problematic words are "unless necessary". So how/where do you
decide when the memory holding the string data is no longer valid
and needs to be copied? (Well, a pooling approach might work.)

Stop.  I was talking about hashing, you are talking about copying.
Two different things.  While I admit that avoiding copying might at
some point also be interesting, this is a problem that affects
performance orders of magnitude less.

Your original message says "unified or copied"; I thought that
referred to copy avoidance, too.

I'm not 100% convinced by the argument that hash avoidance is
less of an issue than copy avoidance. In the case of long strings
which are not currently in the Lua string universe, copying
probably dominates hashing in the interning process; as you
say, most string comparisons terminate after a few characters
so if the string is not currently present, the cost of hashing
is the computation of the hash code (a maximum of 32 loops)
and a few probes into the intern table all of which will
quickly return "unequal". However, copying a million bytes
(say) is time consuming.

If, on the other hand, the string is already in the Lua
string table (in which case it probably isn't very long),
then it is fairly likely that it will at some point be
used for equality comparison or as a table key. In the
former case, the equality comparison will effectively
have been done in advance when the string was interned;
you could look at it as optimistic memoization of equality.
If the string is ever going to be used in an equality
comparison which returns true, then the memoization pays
off.

As you mention below, copy avoidance is complicated because
it requires a delicate dance between Lua and the host
application to allow for joint memory management. I think
the callback solution is too much of a price to pay: it
would mean adding more overhead to every string header,
and it would complicate both the Lua internals and (more
importantly) the API.

A core string rope type is, in my opinion, a more promising
approach; the interface would be somewhat similar to the
existing luaL_buffer implementation -- which is delightfully
simple to use -- except that it wouldn't take over the
stack. It might also be useful to have a non-interned mutable
fixed-length string (NIMFLS) type (for use by the API only)
which could effectively take the place of malloc() and
would allow smallish strings to be built up prior to
interning; the key would be making it possible to convert
a NIMFLS into a string without copying, which would mean
they'd have to have compatible headers. The advantage of
NIMFLSs is that they would be memory managed by Lua.
The combination of ropes and NIMFLSs would probably allow
for efficient string handling, without creating too much
API complication. (Few string libraries allow the use
of ropes, which is a pity; however, i/o operations at
least typically allow the data to be provided in the
form of an iovec; being able to create an iovec from a
rope would be a nice plus.)



If one fears that memory will fill up with identical strings
otherwise, one can hash and fold strings as part of garbage
collection, but that's a different story.

But what about dangling pointers on the C side from previous calls
to lua_tostring()?

If we are talking about copy avoidance, it might be possible to have
the C routine pass a callback to lua_tostring_inplace for when the
string is no longer needed (if allocated via malloc, this could be
free when C itself does not need to retain the string separately from
Lua).  And have a routine that C can call in order to tell Lua that
the memory containing string is going out of scope _now_ (like when it
is allocated with alloca and C is about to exit the block where this
has been done), in which case Lua will create its own copy of the
contents if it still needs the string and has not copied it yet.

If the string is a string constant, the callback would be a routine
doing nothing.

And you still need to allocate a fixed size object on the heap for
every lua_pushlightstring(). And then you have the added complexity
of dealing with two different string types everywhere in the Lua
core. I'm not sure this pays off, especially for small strings,
which make up the majority in most scenarios.

Lazy strings could be the default, then no separate type would be
required.

For huge strings (like whole files) a string buffer approach (using
userdata objects) might be more appropriate. But then you'll face
the problem that most library functions expect strings and won't be
able to use this object type. Some ideas about how to rectify this
have been floating around (but one really needs to avoid copying the
data).

Being able to avoid _hashing_ the data until needed would already make
quite a difference.  Avoiding copying appears possible with some
effort and additional string creation functions, at the cost of laying
some more management tasks at the feet of the programmer wanting to
make use of this.

OTOH you could try to speed up string hashing (luaS_newlstr).  I've
personally given up on this. Either the hash quality deteriorates too
much or the compiler produces slower code.

One should open-code the special case of stride 1, and one should do
the hashing from the start rather than from the end of the string.
That would already speed up the most common cases.

But of course being able to avoid calculation of the hash code and
looking it up in the global string table (with the occasional
resulting need to grow this table) would save much much more time in
the case of throwaway strings not used for indexing.