[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Request for advice: pure Lua Library to parse mail messages.
- From: Sean Conner <sean@...>
- Date: Mon, 20 Jul 2020 16:26:57 -0400
It was thus said that the Great Lorenzo Donati once stated:
> Hi list!
>
> I need to extract some information from some mail messages. Is there
> some pure Lua library that can help me in the process?
That is a tall order, and I doubt you'll get all what you want in a "pure"
Lua library (more about this below).
> * Pure Lua. Possibly simple and lightweight. Maybe short enough to be
> embedded in a Lua script or anyway to reside in a single file side to
> side to my script.
I have code to parse email headers [1], but
1. it's nearly 700 lines of code;
2. it's GPL, so it fails your "no copy-left hassle" test;
3. it's mostly LPEG, so it fails your "pure Lua" test.
4. it doesn't handle quoted-printable [2][3].
> * Reliable, well-tested and foolproof. I don't know much about all the
> RFCs that comprise the mail message format, but the library API should
> be easy enough to let me extract the content of any header field and any
> text part of the message. I have little time and expertise to cope with
> corner cases where the library could fail because of bugs.
There are quite a number of RFCs actually---I reference 14 different RFCs
in my code, and there might be new ones since I wrote the code.
> * It should handle quoted-printable encoding. In particular, it should
> be able to convert from quoted-printable to UTF-8 automatically. I don't
> strictly need other encodings, but also converting to Windows CP-1252
> would be a bonus.
This is the biggest issue you'll have. Handling quoted-printable isn't
that bad in and of itself, but converting everything to UTF-8 will be a
monumental task in pure Lua. Personally, for a task like this, I would use
iconv (I know it as a GNU library to do character set conversions and I am
unaware of any non-GNU library that does the same).
> I think I could implement what I want to do directly easily without a
> library except the quoted-printable decoding part. But I know little
> about the mail format, except a quick glimpse on the related Wikipedia
> articles, so I fear I could botch something obvious by simply creating
> an ad-hoc "parser", and I don't have much time for this little project.
Parsing email is tricker than expected. First, the header names have a
canonical form, but ideally you need to compare them case-insensitive, so
the following are all the same:
From: sean@conman.org
FROM: sean@conman.org
fRoM: sean@conman.org
froM: sean@conman.org
(NOTE: each line is *supposed* to end with a CR and LF; I ended up having
to scan for an optional CR and a mandatory LF) Second, a header line can
span multiple lines---subsequent lines start with whitespace (space or
tabs):
Comment: This is a comment
COMMENT: So
is this.
cOmMeNt: And
this is
a
comment
commenT: And so am I.
The headers are separated from the body by a blank line, so the worse (for
multiple line headers) is something like:
FROM:
sean@conman.org
to:
fred@example.com
sUbJeCt:
This
is
a subject line
The body of the message goes here.
Also, each header has a specific format, which goes to explain why my code
is nearly 700 lines long (email addresses are particularly hairy to parse).
> TIA for any useful advice and hint.
Parsing email with pure Lua---possible, but I wouldn't want to do it.
Convering character sets in pure Lua---theorectically possible but good luck
in finding pure Lua code to do that.
-spc (There's a reason I used LPEG for this ... )
[1] https://github.com/spc476/LPeg-Parsers/blob/master/email.lua
[2] There is a form of quotable-printable for use in headers (which is
what I'm thinking of as I write this)---handling quotable-printable
in the body is *not* a conern of my code, which mostly deals with
headers. And it doesn't support the header form of
quotable-printable.
[3] I suppose I could, but *I* would require the use of iconv in
addition to LPEG. Also, not everyone follows the letter of the RFCs
(headers are *supposed* to be ASCII-only).