[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Capture patterns
- From: Eike Decker <eike@...>
- Date: Tue, 8 Jan 2008 16:55:49 +0100
Hi
Looking at your given samples:
1. $100.00
2. .00
3. 100,000.00
4. -1,234.56
5. (1,234.56)
6. 1,234.56CR
7. 1,234.56-
8. +100.00
I made this pattern here which does match all the given examples above and
returns the only the figures and the necessary characters to figure out if it
is a negative value:
%$?([%d%.,%-%(%)CR]+)
You could maybe replace all matches by some "normalized" number values using
gsub. Since some localizations are flipping the . and , (like 1.000,00), this
function would ignore this:
function normalize(str)
str = str:gsub("%$?([%d%.,%-%(%)CR]+)",
function(num)
num = num:gsub("^(.*)%-$","-%1") -- let minus be in front
num = num:gsub("^%((.-)%)$","-%1") -- if in brackets, negate it
num = num:gsub("^(.*)CR$","-%1") -- I assume CR means negative?
num = num:gsub("^(.-)[%.%,](..)$","%1.%2")
num = num:gsub("[%,%.](....)","%1") -- remove ,/. except for last one
return num
end)
return str
end
The given samples above are transformed to
1. 100.00
2. .00
3. 100000.00
4. -1234.56
5. -1234.56
6. -1234.56
7. -1234.56
8. +100.00
All of them should be able to be transformed into a real number using tonumber
and could be matched by a [%-+]%d*%.?%d* pattern.
Maybe there are simpler (or more efficient) ways to do that... that's just how I
would approach this problem ;)
One problem remains: the pattern might match also strings that don't contain
numbers. This might be fixed by changing the matching pattern:
%$?([%d%.,%-%(%)CR]+)
to
%$?([%d%.,%-%(%)CR]*%d+[%d%.,%-%(%)CR]*)
which would require the pattern to contain at least one figure next to all the
special chars. But I haven't tested that.
The lua regex functions are quite simple but also faster compared to full
regular expression as supplied in perl. Maybe it could be possible to replace
this function I wrote by one single regular expression which does all this in
one stroke. But I doubt that you could transform this by using a single lua
regular expression.
Eike
> >What the pattern actually captures is columns separated by whitespace.
> >The pattern just forces that the columns corresponding to numbers are the
> last three ones, and forces that the first field contains the rest of the
> >line.
>
> Thanks again for the explanation. After some consideration, I see how it
> works.
>
> I did a very poor job of transferring my problem to a suitable example. When
> I tried to adapt the pattern to my 13-number real-world situation, I am
> having diffficulty. I have no doubt this is my bug- not an error in the
> provided solution. The problem with my original pattern was it capturing
> dashes in the text or capturing periods as in an abbreviation. The dollar
> amounts are always preceded by a text description.
>
> If I were doing this on my native platform, I would search for periods. At
> each period, I would look for two trailing digits (and perhaps one preceding
> digit. (Consider that some use .00 and some 0.00.) Then I would capture
> everything in the string from space to space. Look at all these possible
> formats:
>
> 1. $100.00
> 2. .00
> 3. 100,000.00
> 4. -1,234.56
> 5. (1,234.56)
> 6. 1,234.56CR
> 7. 1,234.56-
> 8. +100.00
>
> My original problem statement was to get a pattern that would do something
> similar to the above strategy. My perception is that the code would be more
> utilitarian if I did not look for a fixed number of fields, and that I allow
> all the common financial notations used in reports.
>
> I read the section in PiL multiple times on patterns (20.3, p. 180ff), and I
> think that the section could be better organized by covering the "magic
> characters" in order instead of randomly. This would make the text more
> useful as a reference (at least to someone scatter-brained like me).
>
> I think that Lua is an excellent language for this type of problem. I am
> awed by its power and flexibility.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> CONFIDENTIALITY NOTICE: This E-mail message and all attachments, which
> originated from Sealy Management Company Inc, are intended solely for the use
> of the intended recipient or entity and may contain legally privileged and
> confidential information. If the reader of this message is not the intended
> recipient, you are hereby notified that any reading, disclosure,
> dissemination, distribution, copying or other use of this message is strictly
> prohibited. If you have received this message in error, please notify the
> sender of the message immediately and delete this message and all
> attachments, including all copies or backups thereof, from your system. You
> may also reach us by phone at 205-391-6000. Thank you.
>