[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Interning strings considered harmful (somewhat)
- From: Florian Weimer <fw@...>
- Date: Tue, 03 Nov 2009 20:25:20 +0100
* Roberto Ierusalimschy:
>> The second part of the following benchmark shows distinctly non-linear
>> performance: Each pass of the first part runs under one second on my
>> machine, but the passes in the second part take 4, 14, and 32 seconds,
>> respectively.
>>
>> I'm a bit worried by this, and the impact on programs which process
>> data from untrusted inputs. I haven't got a really good idea what can
>> be done about this. In similar cases, people have used Jenkins'
>> lookup3.c hash function with a random seed. Perhaps it is sufficient
>> to drop the skipping from Lua's hash function and use a random seed
>> stored in the Lua state, but the internal mixing of the current hash
>> function seems to be rather weak.
>
> What exactly is worrying you?
A network service that uses externally supplied strings and does not
immediately release them (either due to explicit retention, or due to
GC delays) can experience a significant slowdown.
The effect is also visible without without an string retention, that
is, with:
local function f(s, count)
for i=10000,10000 + count - 1 do
local s1 = magic(s, i)
end
end
You won't see quadratic behavior in the count because it's now
dependent on the length of a GC cycle (I think).
> Excluding malware, I do not think this situation happens enough to
> justify any worry.
I know the argument: anybody who wants to take out your web server can
just flood it with 5 Gbps of traffic (or more if necessary).
> Considering malware, I guess there are plenty of ways for untrusted
> inputs to waste CPU time, this is only one more.
It's a generic one, that can't be avoided as long as you keep using
strings. Other things can be avoided by applying conservative bounds
before starting computations etc.
> (BTW, simply dropping the skip in the hash function does solve the
> "problem"; just add a "step = 1" in lstring.c to check. But it creates a
> new one: the hash function becomes too expensive for long strings.)
You need to add a random seed, too. Is the result really too
expensive?
Doug Baskins (the Judy array inventor) suggests to use a relatively
weak hash to split strings into buckets and then store the strings in
each bucket in a trie, using Judy arrays. I don't know if Judy arrays
actually pay off their complexity (the hashing destroys locality, so
it's hard to imagine how the'd be a win).
Would an uninterned string type introduce too many additional code
paths in the VM?