lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Thank you very much for all the hints!

On 20/07/2020 22:26, Sean Conner wrote:
It was thus said that the Great Lorenzo Donati once stated:
Hi list!

I need to extract some information from some mail messages. Is there
some pure Lua library that can help me in the process?

  That is a tall order, and I doubt you'll get all what you want in a "pure"
Lua library (more about this below).


I had a bad hunch about this. That's why asked on the list hoping to have better insight. *sigh*

* Pure Lua. Possibly simple and lightweight. Maybe short enough to be
embedded in a Lua script or anyway to reside in a single file side to
side to my script.

  I have code to parse email headers [1], but

	1. it's nearly 700 lines of code;
	2. it's GPL, so it fails your "no copy-left hassle" test;
	3. it's mostly LPEG, so it fails your "pure Lua" test.
	4. it doesn't handle quoted-printable [2][3].

* Reliable, well-tested and foolproof. I don't know much about all the
RFCs that comprise the mail message format, but the library API should
be easy enough to let me extract the content of any header field and any
text part of the message. I have little time and expertise to cope with
corner cases where the library could fail because of bugs.

  There are quite a number of RFCs actually---I reference 14 different RFCs
in my code, and there might be new ones since I wrote the code.


14!? Ouch!


* It should handle quoted-printable encoding. In particular, it should
be able to convert from quoted-printable to UTF-8 automatically. I don't
strictly need other encodings, but also converting to Windows CP-1252
would be a bonus.

  This is the biggest issue you'll have.  Handling quoted-printable isn't
that bad in and of itself, but converting everything to UTF-8 will be a
monumental task in pure Lua.  Personally, for a task like this, I would use
iconv (I know it as a GNU library to do character set conversions and I am
unaware of any non-GNU library that does the same).

I think I could implement what I want to do directly easily without a
library except the quoted-printable decoding part. But I know little
about the mail format, except a quick glimpse on the related Wikipedia
articles, so I fear I could botch something obvious by simply creating
an ad-hoc "parser", and I don't have much time for this little project.

[snip]

  Also, each header has a specific format, which goes to explain why my code
is nearly 700 lines long (email addresses are particularly hairy to parse).

TIA for any useful advice and hint.

  Parsing email with pure Lua---possible, but I wouldn't want to do it.
Convering character sets in pure Lua---theorectically possible but good luck
in finding pure Lua code to do that.

  -spc (There's a reason I used LPEG for this ... )

[1]	https://github.com/spc476/LPeg-Parsers/blob/master/email.lua

[2]	There is a form of quotable-printable for use in headers (which is
	what I'm thinking of as I write this)---handling quotable-printable
	in the body is *not* a conern of my code, which mostly deals with
	headers.  And it doesn't support the header form of
	quotable-printable.

[3]	I suppose I could, but *I* would require the use of iconv in
	addition to LPEG.  Also, not everyone follows the letter of the RFCs
	(headers are *supposed* to be ASCII-only).



Very useful insight. I know understand that looking for a pre-made pure Lua library is not really an option probably.

Fortunately for my use case I don't need to handle any possible field and any possible format, since I would be parsing mails from a very specific sender, whose mails are automatically generated.

To be more explicit, the sender is Amazon order system. I wanted to make a simple script that automates what I do manually now, i.e. extracting order information and put them in a text file for easy tracking and reference.

As I said in my previous post, the easiest path seems to parse the message source, looking for the text/plain part, and extract what I need from there.

I already examined some sources and the template they follow seems quite parsable with plain Lua code once the message part is extracted.

The biggest hurdle for me are:

(1) the automatic identification and extraction of the right message part, since I knew almost nothing about mail format quirks (and there are many, as you confirmed). That's why I hoped there would be a library that did that for me. As you confirmed, this is probably not an option.

(2) decode from quoted-printable to UTF-8 or CP-1252 (I'm on Windows).


For (1) I think I could try and look for the `Content-Type:` field, which is always `multipart/alternative;` and contains a `boundary` placeholder which separates messages parts.

Given that the message is automatically generated and doesn't seem to sport a lot of variation in its template, I guess a simple pattern search should be ok.

Once the boundary marker has been inferred, I'd scan the rest of the message for the first part that has a text/plain content type.

That seems reasonable (I hope).


(2) Is a biggie, though. Bear in mind that I don't need full UTF-8 support, because the message part I'm looking for seems to contain only latin-1 characters, so they are all in the Unicode Basic Multilingual Plane (that's why I would be content with a CP-1252 encoding as well; I would prefer UTF-8 because it's nicer and interoperable, though :-)

Anyway, since the data I'm looking for is mostly numerical, I could also live with some data loss in the few textual data I need (if a product description contained, say, a chinese character, I would happily skip it). So maybe I have hope there is some simpler library (or algorithm) that covers that.

Otherwise I think I have to dumb down the process and instead of handling the message source, I have to copy-paste the text directly from my mail client, which is a slower and more error-prone process though.

Thanks again!

Cheers!

-- Lorenzo