[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Ideas about implementing a string type?
- From: David Kastrup <dak@...>
- Date: Sun, 01 Apr 2007 21:46:07 +0200
Rici Lake <lua@ricilake.net> writes:
> That sort of application is pretty common, at least in my coding.
> Such an application might read a file a line at a time, and try to
> parse it into tokens. The tokens would likely be present in the
> intern table (and would likely be used as keys in a table), but the
> input lines themselves are likely to be unique.
>
> So, yes, it's a waste of time to hash the input lines. But the
> question is, how much time is being wasted, compared with any API
> complications which result from not doing it? (See below for some
> more thoughts on this.)
I think that the penalty for this approach don't really show much as
long as one stays entirely within Lua.
It becomes more of a problem when Lua is an extension language and the
data structures of the surrounding application are not those of Lua.
Strings are one of the most common data types, and so it would be nice
if there was a lightweight manner of passing those into Lua, if all
one is going to do with them is string manipulation.
I have now realized that Lua strings are allocated right behind the
data structure describing their Lua variable.
Now let us assume a "lightweight" string type intended to be used with
lazy hashing and avoiding copying where possible.
Instead of the string following the string record, we will have the
following:
a) a pointer to the memory containing the string
b) a pointer to a function to call with the address in a) when Lua no
longer needs the string
Of course, I am somewhat guilty of having TeX in mind again: in TeX,
strings in TeX's "string pool" are allocated on a stack-like
structure. Except for the top of stack, no string from the string
pool ever gets deallocated again, so "b)" would be a noop. However,
other applications might often have constant strings, too, so the
usefulness of the trivial case is not completely restricted to TeX.
Now what semantics seem useful for such lightweight strings? If
hashing is supposed to be lazy, it means that identity of strings
can't be established when the variable is created (and we probably
should not move it afterwards, I guess judging from what I understand
from the design documents and discussions here). When using strings
as indexes, we should for that reason use their string hash for
establishing the proper hash chain, not the variable address itself.
Now if we walk through some hash chain and find a value comparing
equal, we can replace the pointer of this lightweight string to the
address where the other string is. In that way, future comparisons
for equality can be resolved quickly, as strings sharing the same data
address and length can't be different.
More problematic is the case where we have an equal hash value but a
different string. Non-uniqueness of the variable itself will mean
that we will actually need to compare those strings every time we
encounter them again. While in the common use case, hash chains will
be short enough not to make this too much of a concern, it is somewhat
dissatisfactory.
One solution would be to then properly create a normal Lua string
hashed and interned in the string pool and then make the lightweight
record point to this Lua string (which will then be uniquely
representing the given string like a proper Lua string does).
In that manner, the hash disambiguation will again work in future
without having to look at string content itself.
If lightweight strings are optional, this indirection idea might not
be worth the trouble, though. One could instead just accept that
native strings might be faster in the case of hash collisions so that
one should not use lightweight strings if that use case was really
important.
[benchmarks]
> len hashonly copyonly
> 1 0m0.160s 0m0.368s
> 2 0m0.184s 0m0.368s
> 3 0m0.204s 0m0.396s
> 4 0m0.232s 0m0.344s
> 5 0m0.260s 0m0.368s
> 6 0m0.300s 0m0.368s
> 7 0m0.344s 0m0.396s
> 8 0m0.368s 0m0.344s
> 9 0m0.412s 0m0.364s
> 10 0m0.440s 0m0.364s
> 11 0m0.468s 0m0.396s
> 12 0m0.524s 0m0.344s
> 13 0m0.556s 0m0.368s
> 14 0m0.624s 0m0.364s
> 15 0m0.656s 0m0.664s
> 16 0m0.680s 0m0.340s
>
> I don't pretend to understand the discrepancy at length = 15; I did
> the test eight times and it's consistent. It also comes up high for
> length = 17 and 18, but not for 31. But that might be just this
> particular CPU. Anyway, the code is available for anyone who wants
> to give it a try.
One possibility would be that at len=15 (add one byte for
0-termination), the length of string variable and content together
matches a size that makes malloc and/or free do extra work for some
reason.
--
David Kastrup, Kriemhildstr. 15, 44793 Bochum