lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Hello. Sorry about the length, but I think completeness is better than lack of thought..

I'm looking for post headers in a forum archive, to put posts on single lines for sorting.
Here's an example of one post:

[code]
Author: name1
Email: mail@test.com
Host: ---.dialup.optusnet.com.au
Created: 8:42:41 PM 02-14-2001 (PST)
Subject: RE: Widgetty
blah guff and such and so

[/code]

I'm loading the entire archive as a binary string for the first parts of the procedure, and I want to mark the start of each post. If the posts is simple in form, I can use this:
[code]
DATA=string.gsub(DATA,"\r\n\r\n(Author:.-\r\nEmail:)","\r\n\r\n××%1")
[/code]

That looks like it will work, as it anchors the pattern to the double newline, then "Author:", then non-greedy match to end of line, then "Email:".

It should capture exactly this:

[code]
Author: name1
Email:
[/code]

and replace it with this:
[code]
××Author: name1
Email:
[/code]

It works for all but 20 of 80555 posts. :) It fails though, if a post takes this form:

[code]
Author: name3
Email: name3@hotmail.com
Host: ---.chartermi.net
Created: 8:47:55 PM 02-14-2001 (PST)
Subject: THIS IS THE ONE TO WATCH
With the color change...

Author: <b>name3 <Title Tag></b> (---.chartermi.net)<br>
Date:    02-14-01 19:57<br><br>
color change!!! hahahahaha</font><p>

Without...

Author: <b>name3<font color="darkgreen"> <Title Tag></font></b> (---.chartermi.net)<br>
Date:    02-14-01 19:58<br><br>
I guess I'll keep the green one. LOL</font><p>

[/code]

In this case it takes the first two lines of the header, and more. It looks like a bug in the non-greedy matching operator. Basically, it's greedy when it shouldn't be! What it captures is only part of the post, up to the quoted bit with Author:.-\r\nDate: and the next capture finishes that post, and grabs the entire subsequent post!

I've found a cure, but that doesn't make the issue go away, it just evades it.

[code]
DATA=string.gsub(DATA,"\r\n\r\n(Author:[^\r]-\r\nEmail:)","\r\n\r\n××%1")
DATA=string.gsub(DATA,"\r\n\r\n(Author:[^\r]*\r\nEmail:)","\r\n\r\n××%1")
[/code]

Both of these work, by forcing the repetition to stop at the \r so it can't match beyond without explicit further expressions to match, whether the operator is greedy or not. This seems to suggest that there is a problem with the - operator when used in gsub.
NOTE! All newlines are CR LF (\r\n, 0D 0A). While they were mixed types in the original archive source, this is something I've been careful to fix, and verify, before posting this stuff.


Finally, in case you want to test this stuff in full, here's a sample that is verified to behave as described:
[code]
Author: name1
Email: mail@test.com
Host: ---.dialup.optusnet.com.au
Created: 8:42:41 PM 02-14-2001 (PST)
Subject: RE: Widgetty
blah guff and such and so

Author: name 2
Email:
Host: ---.sympatico.ca
Created: 8:42:45 PM 02-14-2001 (PST)
Subject: RE: Zoodlewurdle
Wed. 01/02/14 23:47 EST

name1:

I wonder, though, if there IS a way to change...

The thing that confuses me is:  I looked at the source of this page...

Author:
Email: item@yahoo.co.uk
Host: ---.92.107.2
Created: 8:45:37 PM 02-14-2001 (PST)
Subject: RE: Stuff.
 There's nothing wrong with...

;-)

Author: Doohickey
Email: anywhere@hotmail.com
Host: --.eznet.net
Created: 8:46:46 PM 02-14-2001 (PST)
Subject: RE: Various Things
I think you'll likely see what I mean right away...

Author: name3
Email: name3@hotmail.com
Host: ---.chartermi.net
Created: 8:47:55 PM 02-14-2001 (PST)
Subject: THIS IS THE ONE TO WATCH
With the color change...

Author: <b>name3 <Title Tag></b> (---.chartermi.net)<br>
Date:    02-14-01 19:57<br><br>
color change!!! hahahahaha</font><p>

Without...

Author: <b>name3<font color="darkgreen"> <Title Tag></font></b> (---.chartermi.net)<br>
Date:    02-14-01 19:58<br><br>
I guess I'll keep the green one. LOL</font><p>



You got it right name5



Author: Another Thinger
Email: xyz@excite.com
Host: ---.oak.dial.netzero.com
Created: 8:48:04 PM 02-14-2001 (PST)
Subject: RE: more stuff
name5 .. i took a brief look .. and here the HTML i got

dwardmoses<font color="darkgreen"> <Title Tag></font

Author: name5
Email:
Host: ---.sympatico.ca
Created: 8:49:12 PM 02-14-2001 (PST)
Subject: RE: extras
Wed. 01/02/14 23:53 EST

name1:

Thanks very much!  I'm a bit intimidated by the installation process, and especially the consequences of a bungled install...but perhaps I'll try it tomorrow.

[/code]