[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Request for advice: pure Lua Library to parse mail messages.
- From: Sean Conner <sean@...>
- Date: Tue, 21 Jul 2020 04:51:03 -0400
It was thus said that the Great Lorenzo Donati once stated:
> Thank you very much for all the hints!
You're welcome.
> On 20/07/2020 22:26, Sean Conner wrote:
> > There are quite a number of RFCs actually---I reference 14 different RFCs
> >in my code, and there might be new ones since I wrote the code.
>
> 14!? Ouch!
14. But given what you are trying to parse (order information from
Amazon) that number goes down quite a bit. Let's see ... at the most basic
level you need RFC-5322 for the general format for email headers, and
RFC-2045 to RFC-2049 for the MIME stuff, so half a dozen. Yes, it's a bit
of slog, but it does explain the format.
Of the 14, some are older versions that RFC-5322 updates, you have one
dealing with Usenet (which used a lot of email headers in addition to its
own), mailing list headers (yes, they got their own RFCs) and some
additional email headers added over the years.
> Fortunately for my use case I don't need to handle any possible field
> and any possible format, since I would be parsing mails from a very
> specific sender, whose mails are automatically generated.
Then they'll stand a very good chance of being well formed. Thank God for
small favors.
> For (1) I think I could try and look for the `Content-Type:` field,
> which is always `multipart/alternative;` and contains a `boundary`
> placeholder which separates messages parts.
>
> Given that the message is automatically generated and doesn't seem to
> sport a lot of variation in its template, I guess a simple pattern
> search should be ok.
>
> Once the boundary marker has been inferred, I'd scan the rest of the
> message for the first part that has a text/plain content type.
>
> That seems reasonable (I hope).
At the very least I would scan through the RFCs I listed above. I have
some email from Amazon about some orders I placed earlier this year, and in
the header section I find the following three headers:
MIME-Version: 1.0
Content-Type: multipart/alternative;
boundary="----=_Part_16209100_796398164.1588360430199"
Content-Length: 1690
(not necessarily in that order mind you!). You can see the Content-Type:
contains the message boundary (and it doesn't always have to be quoted---fun
times, yo). Each section will then be separated by the boundary string,
prefixed with two '--'. So the above bounary will actually appear as:
------=_Part_16209100_796398164.1588360430199
and at the end, it will appear with two '--' at the end, like this:
------=_Part_16209100_796398164.1588360430199--
That said, the message I have from Amazon only has one section and thus,
no boundary actually appears. Instead, the main body of the email contains
the following two headers for the one section:
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
Somehow, I didn't get the quoted-printable formating. Go figure (I guess
because it's all English text, which fits in the 7-bit ASCII range so it's
not needed). It's the valid variations like these that "quick-and-dirty"
parsing is prone to break [1] (and yes, I can sympathize with you wanting a
library to handle all this for you).
> (2) Is a biggie, though. Bear in mind that I don't need full UTF-8
> support, because the message part I'm looking for seems to contain only
> latin-1 characters, so they are all in the Unicode Basic Multilingual
> Plane (that's why I would be content with a CP-1252 encoding as well; I
> would prefer UTF-8 because it's nicer and interoperable, though :-)
It may very well be in UTF-8. It should have the characterset encoding
listed in the headers somewhere.
> Anyway, since the data I'm looking for is mostly numerical, I could also
> live with some data loss in the few textual data I need (if a product
> description contained, say, a chinese character, I would happily skip
> it). So maybe I have hope there is some simpler library (or algorithm)
> that covers that.
I wish you well.
-spc
[1] I recently wrote an HTML parser using LPEG. I started out with a
"quick-n-dirty" one but quickly realized I was going to be worse off
than with a proper parser. So I broke out the DTD [2] for the
version of HTML I had to parse, and wrote one [3]. Works perfectly,
handles the optional closing tags (and the one opening tag). It
helped that all the HTML I need to parse is well formed and
validated.
[2] Docuemnt Type Definition
[3] Two actually---I started out using the re module from LPEG and that
hit some limitations, so I switch to actual LPEG.