[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Multiple rules for gsub
- From: tiago.tresoldi@...
- Date: Wed, 04 Aug 2004 19:58:58 -0300
Hello people,
I am writing a tokenizer for natural languages texts and have found a little problem.
As with natural language texts, I have many different patterns/rules that can
designate a token of the text to be captured, such as:
([%a-']+)
([.,;!?:"])
(!?)
([+-]?[%d.,]+)
and so on, in a code like the classical:
words = { }
string.gsub(text, rule, function (w) table.insert(words, w) end)
Lua's regular expressions don't support OR in the patterns and I haven't been able to
figure out a non dramatically time-spending way of doing this. Of course it could be
possible to apply the patterns one by one to size-increasing substrings of 'text', but
certainly isn't the best solution both from performance and 'elegance'. Does anyone
have any idea on how to do this without linking to external libraries such as regex?
Thank you for the attention.
Tiago Tresoldi