• Subject: Re: Capture patterns
• From: Eike Decker <eike@...>
• Date: Tue, 8 Jan 2008 16:55:49 +0100

```Hi

1. \$100.00
2. .00
3. 100,000.00
4. -1,234.56
5. (1,234.56)
6. 1,234.56CR
7. 1,234.56-
8. +100.00

I made this pattern here which does match all the given examples above and
returns the only the figures and the necessary characters to figure out if it
is a negative value:

%\$?([%d%.,%-%(%)CR]+)

You could maybe replace all matches by some "normalized" number values using
gsub. Since some localizations are flipping the . and , (like 1.000,00), this
function would ignore this:

function normalize(str)
str = str:gsub("%\$?([%d%.,%-%(%)CR]+)",
function(num)
num = num:gsub("^(.*)%-\$","-%1") -- let minus be in front
num = num:gsub("^%((.-)%)\$","-%1") -- if in brackets, negate it
num = num:gsub("^(.*)CR\$","-%1") -- I assume CR means negative?
num = num:gsub("^(.-)[%.%,](..)\$","%1.%2")
num = num:gsub("[%,%.](....)","%1") -- remove ,/. except for last one
return num
end)
return str
end

The given samples above are transformed to

1. 100.00
2. .00
3. 100000.00
4. -1234.56
5. -1234.56
6. -1234.56
7. -1234.56
8. +100.00

All of them should be able to be transformed into a real number using tonumber
and could be matched by a [%-+]%d*%.?%d* pattern.

Maybe there are simpler (or more efficient) ways to do that... that's just how I
would approach this problem ;)

One problem remains: the pattern might match also strings that don't contain
numbers. This might be fixed by changing the matching pattern:

%\$?([%d%.,%-%(%)CR]+)

to

%\$?([%d%.,%-%(%)CR]*%d+[%d%.,%-%(%)CR]*)

which would require the pattern to contain at least one figure next to all the
special chars. But I haven't tested that.

The lua regex functions are quite simple but also faster compared to full
regular expression as supplied in perl. Maybe it could be possible to replace
this function I wrote by one single regular expression which does all this in
one stroke. But I doubt that you could transform this by using a single lua
regular expression.

Eike

> >What the pattern actually captures is columns separated by whitespace.
> >The pattern just forces that the columns corresponding to numbers are the
> last three ones, and forces that the first field contains the rest of the
> >line.
>
> Thanks again for the explanation. After some consideration, I see how it
> works.
>
> I did a very poor job of transferring my problem to a suitable example. When
> I tried to adapt the pattern to my 13-number real-world situation, I am
> having diffficulty.  I have no doubt this is my bug- not an error in the
> provided solution.  The problem with my original pattern was it capturing
> dashes in the text or capturing periods as in an abbreviation.  The dollar
> amounts are always preceded by a text description.
>
> If I were doing this on my native platform, I would search for periods.  At
> each period, I would look for two trailing digits (and perhaps one preceding
> digit. (Consider that some use .00 and some 0.00.)  Then I would capture
> everything in the string from space to space.  Look at all these possible
> formats:
>
> 1. \$100.00
> 2. .00
> 3. 100,000.00
> 4. -1,234.56
> 5. (1,234.56)
> 6. 1,234.56CR
> 7. 1,234.56-
> 8. +100.00
>
> My original problem statement was to get a pattern that would do something
> similar to the above strategy. My perception is that the code would be more
> utilitarian if I did not look for a fixed number of fields, and that I allow
> all the common financial notations used in reports.
>
> I read the section in PiL multiple times on patterns (20.3, p. 180ff), and I
> think that the section could be better organized by covering the "magic
> characters" in order instead of randomly.  This would make the text more
> useful as a reference (at least to someone scatter-brained like me).
>
> I think that Lua is an excellent language for this type of problem.  I am
> awed by its power and flexibility.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> CONFIDENTIALITY NOTICE:  This E-mail message and all attachments, which
> originated from Sealy Management Company Inc, are intended solely for the use
> of the intended recipient or entity and may contain legally privileged and
> confidential information.  If the reader of this message is not the intended
> recipient, you are hereby notified that any reading, disclosure,
> dissemination, distribution, copying or other use of this message is strictly
> prohibited.  If you have received this message in error, please notify the
> sender of the message immediately and delete this message and all
> attachments, including all copies or backups thereof, from your system.  You
> may also reach us by phone at 205-391-6000.  Thank you.
>

```