lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


It was thus said that the Great Lorenzo Donati once stated:
> Hi list!
> 
> I need to extract some information from some mail messages. Is there 
> some pure Lua library that can help me in the process?

  That is a tall order, and I doubt you'll get all what you want in a "pure"
Lua library (more about this below).

> * Pure Lua. Possibly simple and lightweight. Maybe short enough to be 
> embedded in a Lua script or anyway to reside in a single file side to 
> side to my script.

  I have code to parse email headers [1], but

	1. it's nearly 700 lines of code;
	2. it's GPL, so it fails your "no copy-left hassle" test;
	3. it's mostly LPEG, so it fails your "pure Lua" test.
	4. it doesn't handle quoted-printable [2][3].

> * Reliable, well-tested and foolproof. I don't know much about all the 
> RFCs that comprise the mail message format, but the library API should 
> be easy enough to let me extract the content of any header field and any 
> text part of the message. I have little time and expertise to cope with 
> corner cases where the library could fail because of bugs.

  There are quite a number of RFCs actually---I reference 14 different RFCs
in my code, and there might be new ones since I wrote the code.

> * It should handle quoted-printable encoding. In particular, it should 
> be able to convert from quoted-printable to UTF-8 automatically. I don't 
> strictly need other encodings, but also converting to Windows CP-1252 
> would be a bonus.

  This is the biggest issue you'll have.  Handling quoted-printable isn't
that bad in and of itself, but converting everything to UTF-8 will be a
monumental task in pure Lua.  Personally, for a task like this, I would use
iconv (I know it as a GNU library to do character set conversions and I am
unaware of any non-GNU library that does the same).

> I think I could implement what I want to do directly easily without a 
> library except the quoted-printable decoding part. But I know little 
> about the mail format, except a quick glimpse on the related Wikipedia 
> articles, so I fear I could botch something obvious by simply creating 
> an ad-hoc "parser", and I don't have much time for this little project.

  Parsing email is tricker than expected.  First, the header names have a
canonical form, but ideally you need to compare them case-insensitive, so
the following are all the same:

	From: sean@conman.org
	FROM: sean@conman.org
	fRoM: sean@conman.org
	froM: sean@conman.org

  (NOTE:  each line is *supposed* to end with a CR and LF; I ended up having
to scan for an optional CR and a mandatory LF) Second, a header line can
span multiple lines---subsequent lines start with whitespace (space or
tabs):

	Comment: This is a comment
	COMMENT: So
		is this.
	cOmMeNt: And
	 this is
		a 
	 comment
	commenT: And so am I.

  The headers are separated from the body by a blank line, so the worse (for
multiple line headers) is something like:

	FROM:
	 sean@conman.org
	to:
	 fred@example.com
	sUbJeCt:
	 This
		is
	 a subject line

	The body of the message goes here.

  Also, each header has a specific format, which goes to explain why my code
is nearly 700 lines long (email addresses are particularly hairy to parse).

> TIA for any useful advice and hint.

  Parsing email with pure Lua---possible, but I wouldn't want to do it. 
Convering character sets in pure Lua---theorectically possible but good luck
in finding pure Lua code to do that.

  -spc (There's a reason I used LPEG for this ... )

[1]	https://github.com/spc476/LPeg-Parsers/blob/master/email.lua

[2]	There is a form of quotable-printable for use in headers (which is
	what I'm thinking of as I write this)---handling quotable-printable
	in the body is *not* a conern of my code, which mostly deals with
	headers.  And it doesn't support the header form of
	quotable-printable.

[3]	I suppose I could, but *I* would require the use of iconv in
	addition to LPEG.  Also, not everyone follows the letter of the RFCs
	(headers are *supposed* to be ASCII-only).