lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Hello people,

I am writing a tokenizer for natural languages texts and have found a little problem. 
As with natural language texts, I have many different patterns/rules that can 
designate a token of the text to be captured, such as:

([%a-']+)
([.,;!?:"])
(!?)
([+-]?[%d.,]+)

and so on, in a code like the classical:

words = { }
string.gsub(text, rule, function (w) table.insert(words, w) end)

Lua's regular expressions don't support OR in the patterns and I haven't been able to 
figure out a non dramatically time-spending way of doing this. Of course it could be 
possible to apply the patterns one by one to size-increasing substrings of 'text', but 
certainly isn't the best solution both from performance and 'elegance'. Does anyone 
have any idea on how to do this without linking to external libraries such as regex?
Thank you for the attention.

Tiago Tresoldi