Re: Spelling corrector

Thanks Enrico, I tried some of your comments about string concatenation. I will reply inline:

>From a superficial glance at spell.lua:

> yield( splits[i-1][1]..word[i+1]..word[i]..splits[i+2][2] )

Do you really need to concatenate here, or could you instead pass a table with separate entries and concatenate only when needed?

It is really necessary to concatenate here. The distortion procedure of a word may produce in different ways the same distorted word result. So, in this case the yield function is a wrapper over the coroutine.yield function, which checks that a word has not been generated twice, avoiding multiple yields with the same word. Not doing this check will be even worst :(

Also, you could try using table.concat() instead of .. (it could be faster... or not).

I tried table.concat and string.format, but both perform worst. This was counter-intuitive to me, because Lua string concat generates copies of intermediate strings. However, seems that for short strings and small number of concatenated strings, string __concat performs better than string.format or table.concat. Does anyone know if my observation is true?

You could also keep an eye on memory usage to see if this is really an allocation problem

collectgarbage("count") is around 20MB/40MB. By using table.concat this numbers are similar, but the performance decreases a lot, close to two times slower.

Lots of temporary strings seem to be created (e.g. by word_str:sub(...)); perhaps the code could be restructured to reduce their number as much as possible?

I'm wondering if it is possible. It will be possible to use indices instead of concatenated strings, but in this way the code will be forced to check repetitions working over the indices, and at the ends, the code will be more complicated and slower :'(

Last but not least:

> for i=1,#word_str do for j=1,#alphabet do

I did not try to follow the code, so I may be dead wrong, but nested loops like this usually raise a O(n^2) efficiency red flag, especially because you perform string concatenation inside them.

Certainly, the bottleneck is the large number of words generated by both nested loops (replaces and inserts). It is needed to loop over N*M being N the user word length and M=26 the length of English alphabet. As said by Peter Norvig, talking about 'edits1' function number of generated words:

This can be a big set. For a word of length n, there will be n deletions, n-1 transpositions, 26n alterations, and 26(n+1) insertions, for a total of 54n+25 (of which a few are typically duplicates). For example, len(edits1('something')) -- that is, the number of elements in the result of edits1('something') -- is 494.

--
Enrico