lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


It was thus said that the Great Lorenzo Donati once stated:
> Thank you very much for all the hints!

  You're welcome.

> On 20/07/2020 22:26, Sean Conner wrote:
> >  There are quite a number of RFCs actually---I reference 14 different RFCs
> >in my code, and there might be new ones since I wrote the code.
> 
> 14!? Ouch!

  14.  But given what you are trying to parse (order information from
Amazon) that number goes down quite a bit.  Let's see ... at the most basic
level you need RFC-5322 for the general format for email headers, and
RFC-2045 to RFC-2049 for the MIME stuff, so half a dozen.  Yes, it's a bit
of slog, but it does explain the format.

  Of the 14, some are older versions that RFC-5322 updates, you have one
dealing with Usenet (which used a lot of email headers in addition to its
own), mailing list headers (yes, they got their own RFCs) and some
additional email headers added over the years.  

> Fortunately for my use case I don't need to handle any possible field 
> and any possible format, since I would be parsing mails from a very 
> specific sender, whose mails are automatically generated.

  Then they'll stand a very good chance of being well formed.  Thank God for
small favors.

> For (1) I think I could try and look for the `Content-Type:` field, 
> which is always `multipart/alternative;` and contains a `boundary` 
> placeholder which separates messages parts.
> 
> Given that the message is automatically generated and doesn't seem to 
> sport a lot of variation in its template, I guess a simple pattern 
> search should be ok.
> 
> Once the boundary marker has been inferred, I'd scan the rest of the 
> message for the first part that has a text/plain content type.
> 
> That seems reasonable (I hope).

  At the very least I would scan through the RFCs I listed above.  I have
some email from Amazon about some orders I placed earlier this year, and in
the header section I find the following three headers:

MIME-Version: 1.0
Content-Type: multipart/alternative; 
        boundary="----=_Part_16209100_796398164.1588360430199"
Content-Length: 1690

(not necessarily in that order mind you!).  You can see the Content-Type:
contains the message boundary (and it doesn't always have to be quoted---fun
times, yo).  Each section will then be separated by the boundary string,
prefixed with two '--'.  So the above bounary will actually appear as:

------=_Part_16209100_796398164.1588360430199

and at the end, it will appear with two '--' at the end, like this:

------=_Part_16209100_796398164.1588360430199--

  That said, the message I have from Amazon only has one section and thus,
no boundary actually appears.  Instead, the main body of the email contains
the following two headers for the one section:

Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit

  Somehow, I didn't get the quoted-printable formating.  Go figure (I guess
because it's all English text, which fits in the 7-bit ASCII range so it's
not needed).  It's the valid variations like these that "quick-and-dirty"
parsing is prone to break [1] (and yes, I can sympathize with you wanting a
library to handle all this for you).

> (2) Is a biggie, though. Bear in mind that I don't need full UTF-8 
> support, because the message part I'm looking for seems to contain only 
> latin-1 characters, so they are all in the Unicode Basic Multilingual 
> Plane (that's why I would be content with a CP-1252 encoding as well; I 
> would prefer UTF-8 because it's nicer and interoperable, though :-)

  It may very well be in UTF-8.  It should have the characterset encoding
listed in the headers somewhere.
  
> Anyway, since the data I'm looking for is mostly numerical, I could also 
> live with some data loss in the few textual data I need (if a product 
> description contained, say, a chinese character, I would happily skip 
> it). So maybe I have hope there is some simpler library (or algorithm) 
> that covers that.

  I wish you well.

  -spc

[1]	I recently wrote an HTML parser using LPEG.  I started out with a
	"quick-n-dirty" one but quickly realized I was going to be worse off
	than with a proper parser.  So I broke out the DTD [2] for the
	version of HTML I had to parse, and wrote one [3].  Works perfectly,
	handles the optional closing tags (and the one opening tag).  It
	helped that all the HTML I need to parse is well formed and
	validated.

[2]	Docuemnt Type Definition

[3]	Two actually---I started out using the re module from LPEG and that
	hit some limitations, so I switch to actual LPEG.