lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


* Roberto Ierusalimschy:

>> The second part of the following benchmark shows distinctly non-linear
>> performance: Each pass of the first part runs under one second on my
>> machine, but the passes in the second part take 4, 14, and 32 seconds,
>> respectively.
>> 
>> I'm a bit worried by this, and the impact on programs which process
>> data from untrusted inputs.  I haven't got a really good idea what can
>> be done about this.  In similar cases, people have used Jenkins'
>> lookup3.c hash function with a random seed.  Perhaps it is sufficient
>> to drop the skipping from Lua's hash function and use a random seed
>> stored in the Lua state, but the internal mixing of the current hash
>> function seems to be rather weak.
>
> What exactly is worrying you?

A network service that uses externally supplied strings and does not
immediately release them (either due to explicit retention, or due to
GC delays) can experience a significant slowdown.

The effect is also visible without without an string retention, that
is, with:

local function f(s, count)
   for i=10000,10000 + count - 1 do
      local s1 = magic(s, i)
   end
end

You won't see quadratic behavior in the count because it's now
dependent on the length of a GC cycle (I think).

> Excluding malware, I do not think this situation happens enough to
> justify any worry.

I know the argument: anybody who wants to take out your web server can
just flood it with 5 Gbps of traffic (or more if necessary).

> Considering malware, I guess there are plenty of ways for untrusted
> inputs to waste CPU time, this is only one more.

It's a generic one, that can't be avoided as long as you keep using
strings.  Other things can be avoided by applying conservative bounds
before starting computations etc.

> (BTW, simply dropping the skip in the hash function does solve the
> "problem"; just add a "step = 1" in lstring.c to check. But it creates a
> new one: the hash function becomes too expensive for long strings.)

You need to add a random seed, too.  Is the result really too
expensive?

Doug Baskins (the Judy array inventor) suggests to use a relatively
weak hash to split strings into buckets and then store the strings in
each bucket in a trie, using Judy arrays.  I don't know if Judy arrays
actually pay off their complexity (the hashing destroys locality, so
it's hard to imagine how the'd be a win).

Would an uninterned string type introduce too many additional code
paths in the VM?