[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: string optimization ideas [Re: precompiled regex's to lua]
- From: Karel Tuma <ktuma@...>
- Date: Fri, 10 Nov 2006 14:54:22 +0100
as roberto have mentioned, lua patterns are not optimized
for redundancies by nature. performance may heavily depend
on how you write the pattern.
by the way, string patterns may not be to blame at all. working
with strings is generally slower (except comparing and hash keying)
in lua because the way theyre implented - substrings are copied all over
and over again thrashing the heap, better to not mention cpu cache.
(low data locality). you may want to try to do string.sub() few times
for every line and ditto in perl. if perl still outperforms, something
is wrong with your I/O handling, or you're just creating too many strings
for every iteration. trying gprof might prove more useful too
(though no sure about the wrestling with surrounding perl bloat).
interesting approach is to avoid string creation completely.
for large input bulks of data, it is enough to masquerade lua_tolstring
and fool the relevant library routines. for returning strings without
creating actual copy, it's a bit more tricky. the only easily hackable
way i came up with was to store "actual" length and offset with every
string reference, so creating a substring with string.sub/match etc.
would be matter of just creating appropiate reference with offset and length
to the string table. string comparison would need to compare these two,
along with stringtable key.
there are several issues though:
a) isn't overhead of additional off/len processing higher
than actual re-hashing?
-> needs benchmark
b) what about if someone always creates "small" reference and the
original 2kb string isnt ever again used as a whole, just those few
-> implement "reduction" in garbage collector (i'e find
shortest "common" string, rehash, and update all references)?
i've this on my todo list already...
On Fri, Nov 10, 2006 at 05:24:20AM +0100, email@example.com wrote:
> Thanks for the background, so I'm kind of comparing apples
> and oranges?
> Anyways, I had done the benchmarking, and up until the
> last yards Lua was indeed always faster. With use of qr//
> in Perl, it wasn't.
> Which lead me to think, is there anything that could be
> done. In general terms, I too would prefer Lua over
> anything, but it also needs to be "faster or the same
> speed" as competitor. Otherwise, a transition would
> certainly -and justifyably- be questioned.
> The actual issue is CSV parsing (comma separated values).
> Just simple, "%d+,%d+,[^,]," kind. Ideas for a better
> approach in Lua then string.match? (making a C module
> just for cutting out the , fields would be Fast but is not
> really an alternative, at least not in comparison to other
> On Thu, 9 Nov 2006 23:23:12 +0100
> Karel Tuma <firstname.lastname@example.org> wrote:
> >just benchmark it :)
> >lua of course wins, but not always.
> >the thing with compiling regexs, they're FSA (finite
> >state automata)
> >basically tree of states, where nodes can point meshed
> >to other nodes.
> >FSA compiler is considered non-trivial in terms of
> >implementation size.
> >but lua patterns are sort of NFA (nonfinite), they're
> >already compiled as they
> >are. due to that referring back to some state or
> >is very limited making lua patterns "less powerful" than
> >pcre's, but
> >enough for the usual daywork and when you need something
> >you've to code the logic all by yourself.
> >to be exact FSA is slower for "simple" expressions where
> >NFA on the
> >complex ones, some pcre implementations even use both
> >and choose
> >between them depending on the expression (!).
> >hey, this is lua, we want the simple and fast, right?
> >add to this perl's suckyness on much everything else and
> >lua is the winner
> >(using lua for parsing 1Gb+ logs using mmap(), look at
> >http://blua.leet.cz/sep/STRHOOK_PATCH.patch to get the
> >On Thu, Nov 09, 2006 at 09:58:22PM +0100, Asko Kauppi
> >>I didn't find any reference to discussing precompiled
> >>expressions, and Lua.
> >>Some background:
> >>In huge log file handling, Lua loses to Perl (not by
> >>much!) seemingly
> >>because it has no concept of precompiling, and then
> >>applying the
> >>regular expression patterns.
> >>In Perl, one can:
> >> my $re= qr/^\d+\s/;
> >> $var =~ $re; # $re is a precompiled, optimized regex,
> >>applied to
> >> $var
> >> or:
> >> $var =~ /^\d+\s/o; # 'o' for once, compile once, cache,
> >> string.match( var, "%d+%s" ) -- is "%d+%s" analyzed
> >>anew each
> >> time?
> >>Is Lua losing in speed, since it has not this concept,
> >>or have the
> >>authors been so clever, it's already "there", just
> >>invisible? :)
> >>We're talking BIG files, say 1 million lines, or more.