lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Rici Lake <lua@ricilake.net> writes:

> David Kastrup wrote:
>> Mike Pall <mikelu-0703@mike.de> writes:
>>
>>> David Kastrup wrote:
>>>> Such strings would not get hashed unless one used them as a table
>>>> index, and they would not get unified or copied either unless
>>>> necessary.
>>> The problematic words are "unless necessary". So how/where do you
>>> decide when the memory holding the string data is no longer valid
>>> and needs to be copied? (Well, a pooling approach might work.)
>>
>> Stop.  I was talking about hashing, you are talking about copying.
>> Two different things.  While I admit that avoiding copying might at
>> some point also be interesting, this is a problem that affects
>> performance orders of magnitude less.
>
> Your original message says "unified or copied"; I thought that
> referred to copy avoidance, too.

I have to agree that in my original message I conflated two
optimizations about which I had thought about the same time.  But I
later realized that they can be arranged such that you'll need two
bullets for shooting the whole proposal down.

> I'm not 100% convinced by the argument that hash avoidance is less
> of an issue than copy avoidance.

You probably meant this the other way round?

> In the case of long strings which are not currently in the Lua
> string universe, copying probably dominates hashing in the interning
> process;

I very much doubt it.

> as you say, most string comparisons terminate after a few characters
> so if the string is not currently present, the cost of hashing is
> the computation of the hash code (a maximum of 32 loops)

Those are expensive loops.

> and a few probes into the intern table all of which will quickly
> return "unequal". However, copying a million bytes (say) is time
> consuming.

I doubt that million byte strings are the norm.  And I doubt that
you'll _ever_ be in the situation that million-byte strings will
be used for comparison or indexing, and thus the whole hashing
operation is a waste of time.

> If, on the other hand, the string is already in the Lua string table
> (in which case it probably isn't very long), then it is fairly
> likely that it will at some point be used for equality comparison or
> as a table key.

Not at all.  The main application of strings is passing strings on,
string manipulation a la printf, maybe sorting them (in which case the
hashing does not buy a thing since it is only useful for checking
equality, not order).  Table lookup and exact equality comparison are
rather rare operations with regard to what people tend do with
strings.  The dominance of tables in Lua shifts the balance (after
all, one accesses structure records in that manner), but for strings
for which the source is not Lua program code, I doubt that hashing
them unconditionally makes much sense.

> In the former case, the equality comparison will effectively have
> been done in advance when the string was interned; you could look at
> it as optimistic memoization of equality.

Given that the hashing is likely to be more than an order of magnitude
slower than just doing a copy, we are not talking about a single
comparison here.  And given that the processing of an input string
usually is over once it has compared as equal, the _only_ case where
the hashing ever has a chance to amortize is when effectively a large
number of _unequal_ comparisons are expected otherwise.  And those
will in Lua pretty much exclusively happen in the form of a table
lookup.  So that is when one should do the hashing.

> If the string is ever going to be used in an equality comparison
> which returns true, then the memoization pays off.

No.  It pays off when the string is ever going to be checked to be
_unequal_ to a large number of other strings.  A single equality
comparison will be done about 10 times faster with a direct comparison
rather than with hashing.

> As you mention below, copy avoidance is complicated because it
> requires a delicate dance between Lua and the host application to
> allow for joint memory management.

Yes.

> I think the callback solution is too much of a price to pay: it
> would mean adding more overhead to every string header, and it would
> complicate both the Lua internals and (more importantly) the API.

I think the price would be moderate, but it certainly is not
negligible.  It would help to lower the performance barrier of passing
material in and out of Lua that is not per se structured according to
the needs of Lua.  For an application like LuaTeX which has its
primary central data structures historically determined without any a
priori consideration of Lua, this would help to create more efficient
and/or transparent ways of calling Lua hooks at various
performance-sensitive points.

> A core string rope type is, in my opinion, a more promising
> approach; the interface would be somewhat similar to the existing
> luaL_buffer implementation -- which is delightfully simple to use --
> except that it wouldn't take over the stack. It might also be useful
> to have a non-interned mutable fixed-length string (NIMFLS) type
> (for use by the API only) which could effectively take the place of
> malloc() and would allow smallish strings to be built up prior to
> interning; the key would be making it possible to convert a NIMFLS
> into a string without copying, which would mean they'd have to have
> compatible headers. The advantage of NIMFLSs is that they would be
> memory managed by Lua.  The combination of ropes and NIMFLSs would
> probably allow for efficient string handling, without creating too
> much API complication. (Few string libraries allow the use of ropes,
> which is a pity; however, i/o operations at least typically allow
> the data to be provided in the form of an iovec; being able to
> create an iovec from a rope would be a nice plus.)

I have to invest some time before commenting on this proposal.  I have
difficulties seeing how it would address the same sort of problem I
was thinking about.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum