lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


You might look into using a DAWG or Trie to achieve the search on the backend.  I'm currently using this approach to perform matching on incoming text against a Trie of about 200,000 phrases.

Basics (psudo pascal):
PTreeEntry=^TreeEntry;
TreeEntry=packed record
  c : Char; // Character this item represents
  eow: boolean; // Is this the end of a word or phrase
  children: array of PTreeEntry; // Children for this "letter"
  data: pointer; // Data associated with eow
end;

Insert a phrase by creating it within the root, for each letter add a new child node with the next character and set it to the working node.  When you reach the end of your word/phrase to match set eow to true and set the data pointer if you need it.  Rinse back to the root node and repeat for each phrase you want to match against.

Then to match against the block of text.  Set the working node to root, read a character.  If it exists in the children set the working node and move to the next character, if you have multiple entries you can create mutliple working nodes.  When you reach an end of word marker you have a match, if you don't find the next character reset the working node and keep going.  Make sure you set a max working depth (length of longest phrase is typically a good number) so you don't have too many working nodes for the text.

I have working pascal code that shows the aproach as well is tied into Lua if it would help, or there are lots of documents on the web on the topic (here is a link to an article I started with the basics http://www.eonclash.com/OutsideTheBox/Outside%20the%20Box%20(Article%201%20-%20Up%20a%20Trie).zip ).

 - Jeremy

On Fri, Feb 13, 2009 at 11:42 AM, Nathaniel Trevivian <nathanielt@gatewest.co.uk> wrote:

Basically, the script does a string.find on a large piece of text (500  
words-ish - sometimes more) for occurrences of any word from a list of  
over 40,000.
At the moment, I'm going through a table which contains the 40,000  
"check words" and performing this string.find on each one.

How about extracting the words in the text and then see if they're one of
the "check words"? You'll need to invert the table you have, but that's
easy:
function invert(t)
local z={}
for k,v in pairs(t) do z[v]=true end
return z
end

I see where you're going with that - and it's a great idea. However, the difficulty comes when the "check word" is actually a "check phrase", eg: two or three words.

I would have to split the document text up and compare each word against the inverted table, but how would I split it up? If I split it into single words, it wouldn't work.

Thanks anyway!