lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


David Kastrup wrote:

Uhm, where hashing is of any use, the string is used for equality
comparison and/or table lookup.  I'd be surprised if 124 was not
already about an order of magnitude above the typical size for those
applications.  One exception I could think of: an application that
outputs only input lines that have not occured previously.  In that
case, long keys for a table make sense.

That sort of application is pretty common, at least in my coding.
Such an application might read a file a line at a time, and try to
parse it into tokens. The tokens would likely be present in the
intern table (and would likely be used as keys in a table), but
the input lines themselves are likely to be unique.

So, yes, it's a waste of time to hash the input lines. But the
question is, how much time is being wasted, compared with any
API complications which result from not doing it? (See below
for some more thoughts on this.)

I was encouraged by the fact that the hash function could
hash about 8 million strings per second, which seems
adequate, although I take your point about the cache.
(See below, too.)


If we take, for example, a look at the string pool of TeX, the
following string lengths appear here with the following frequencies:

<snip interesting data>


The median of the string size is 13.  This is likely somewhat shorter
than with other applications since TeX has neither multi-line strings,
nor does it use format control strings for output.

Yes, that's certainly shorter than in most of my applications, but not
drastically so. It's a bit hard to measure though because a lot of the
short strings I use are keys which are used repeatedly. I do go to
some trouble to ensure that C code doesn't unnecessarily hash
constant strings (one simple way of doing this is to recycle
the code which Lua uses to ensure that metamethod names don't
need to be hashed every time a metamethod lookup is done.)

Still, at 13 the
cost for copying from your benchmark would be about 6ms, that for
hashing about 40ms.  A speedup of 5 for the median case is certainly
worth something.

Well, sure. I never claimed that the cost of copying was large for
short strings, only for largish strings. For short strings, it appears,
the dominant cost is neither hashing nor copying, but rather malloc()
and free(). Hashing and interning avoids this cost as well in the
case of repeated short strings: even in TeX, I would guess, that's
not an uncommon case.

Yes.  For long strings, hashing is comparatively cheap.  But I think
that it will also be comparatively useless for that case (since those
long strings are unlikely to be used as keys), so while it may be
cheap, it will still not give a good return of investment.

Probably true, but again, the savings from eliminating it are
negligible, so it would only be worth doing if it doesn't bring
more complications with it.

If one reserves the value "0" for an uncomputed hash,
the necessary check for 0 will be comparatively cheap (worst case is a
missed branch prediction).

The problem is, what do you do if you find that a string which
hasn't previously been cached is a duplicate? You can't free
the string without violating the contract between the API and
the client that it is possible to keep the address of a string
as long as that string remains on the stack. (And, in fact,
the Lua core and parser, and many of my programs, depend on
a stronger contract, which is that the address of a string
is usable as long as the string is known to be referenced
from somewhere in the Lua space, such as a table.)

You could weaken the contract to exclude the use of the
string as a key or in an equality test, but that would
mean that you also could not do anything which might
invoke a Lua function (or even metamethod), which would
be pretty limiting. string.gsub, for example, relies on
both the pattern and the target string to stay where
they are, but it (usefully) allows a callback into
a Lua function to compute the replacement.

Undoubtedly there are ways of dealing with all of
that; I just don't see what they are. I've used
scripting language interfaces which do not provide
the address-stability guarantee that Lua provides,
and I have to say that a good part of Lua's charm for
me is the simplicity of the API in comparison; it
pretty well frees me from thinking about memory
management issues.

Of course, LuaTeX is an extreme case, but I would not go as far as to
call it pathological.

I wouldn't call it pathological, either, but it may not be
typical enough to be used to reason about common cases.

--- Last minute addition.

I tried the benchmark with some very long strings in order
to try to get some concept about caching. This is with
ten million repetitions; even so, the last line should have
*less* benefit from the caching, and that's consistent with
the slightly higher timing. If you have a better idea,
please feel free to use the code.

length     donothing     hashonly     difference
      10    0m0.084s     0m0.432s       0.348s
     100    0m0.084s     0m0.988s       0.904s
    1000    0m0.080s     0m1.108s       1.028s
   10000    0m0.084s     0m1.104s       1.020s
 1000000    0m0.156s     0m1.264s       1.108s
10000000    0m0.832s     0m2.116s       1.284s

That's *just* hashing; not inserting into the intern table
(as were the previous results). Note that hashing short
strings does take less time than long strings. For copying,
there is apparently a large constant cost. Here's hash
and copy for short string lengths (10 million repetitions,
lots of cache effects, no doubt.)

len   hashonly    copyonly
 1    0m0.160s    0m0.368s
 2    0m0.184s    0m0.368s
 3    0m0.204s    0m0.396s
 4    0m0.232s    0m0.344s
 5    0m0.260s    0m0.368s
 6    0m0.300s    0m0.368s
 7    0m0.344s    0m0.396s
 8    0m0.368s    0m0.344s
 9    0m0.412s    0m0.364s
10    0m0.440s    0m0.364s
11    0m0.468s    0m0.396s
12    0m0.524s    0m0.344s
13    0m0.556s    0m0.368s
14    0m0.624s    0m0.364s
15    0m0.656s    0m0.664s
16    0m0.680s    0m0.340s

I don't pretend to understand the discrepancy at length = 15; I did
the test eight times and it's consistent. It also comes up high
for length = 17 and 18, but not for 31. But that might be just
this particular CPU. Anyway, the code is available for anyone who
wants to give it a try.