lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On 21/07/2020 10:51, Sean Conner wrote:
It was thus said that the Great Lorenzo Donati once stated:
Thank you very much for all the hints!

  You're welcome.

On 20/07/2020 22:26, Sean Conner wrote:
 There are quite a number of RFCs actually---I reference 14 different RFCs
in my code, and there might be new ones since I wrote the code.

14!? Ouch!

  14.  But given what you are trying to parse (order information from
Amazon) that number goes down quite a bit.  Let's see ... at the most basic
level you need RFC-5322 for the general format for email headers, and
RFC-2045 to RFC-2049 for the MIME stuff, so half a dozen.  Yes, it's a bit
of slog, but it does explain the format.

  Of the 14, some are older versions that RFC-5322 updates, you have one
dealing with Usenet (which used a lot of email headers in addition to its
own), mailing list headers (yes, they got their own RFCs) and some
additional email headers added over the years.

Fortunately for my use case I don't need to handle any possible field
and any possible format, since I would be parsing mails from a very
specific sender, whose mails are automatically generated.

  Then they'll stand a very good chance of being well formed.  Thank God for
small favors.

For (1) I think I could try and look for the `Content-Type:` field,
which is always `multipart/alternative;` and contains a `boundary`
placeholder which separates messages parts.

Given that the message is automatically generated and doesn't seem to
sport a lot of variation in its template, I guess a simple pattern
search should be ok.

Once the boundary marker has been inferred, I'd scan the rest of the
message for the first part that has a text/plain content type.

That seems reasonable (I hope).

  At the very least I would scan through the RFCs I listed above.

I'll try when I have time. Too optimistically *grin* I hoped mail format was simple enough not to require so much specification scanning in advance. :-)

I have
some email from Amazon about some orders I placed earlier this year, and in
the header section I find the following three headers:

MIME-Version: 1.0
Content-Type: multipart/alternative;
        boundary="----=_Part_16209100_796398164.1588360430199"
Content-Length: 1690


The first two matches exactly what I've got. Content-length is missing.

(not necessarily in that order mind you!).  You can see the Content-Type:
contains the message boundary (and it doesn't always have to be quoted---fun
times, yo).  Each section will then be separated by the boundary string,
prefixed with two '--'.  So the above bounary will actually appear as:

------=_Part_16209100_796398164.1588360430199

and at the end, it will appear with two '--' at the end, like this:

------=_Part_16209100_796398164.1588360430199--


That was something I already inferred by visually scanning some of those messages. Thanks for confirming my guess.

  That said, the message I have from Amazon only has one section and thus,
no boundary actually appears.  Instead, the main body of the email contains
the following two headers for the one section:

Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit

  Somehow, I didn't get the quoted-printable formating.  Go figure (I guess
because it's all English text, which fits in the 7-bit ASCII range so it's
not needed).

Yes, I guess most probably it's because my messages are in Italian. I have:

Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


It's the valid variations like these that "quick-and-dirty"
parsing is prone to break [1] (and yes, I can sympathize with you wanting a
library to handle all this for you).

(2) Is a biggie, though. Bear in mind that I don't need full UTF-8
support, because the message part I'm looking for seems to contain only
latin-1 characters, so they are all in the Unicode Basic Multilingual
Plane (that's why I would be content with a CP-1252 encoding as well; I
would prefer UTF-8 because it's nicer and interoperable, though :-)

  It may very well be in UTF-8.  It should have the characterset encoding
listed in the headers somewhere.

Anyway, since the data I'm looking for is mostly numerical, I could also
live with some data loss in the few textual data I need (if a product
description contained, say, a chinese character, I would happily skip
it). So maybe I have hope there is some simpler library (or algorithm)
that covers that.

  I wish you well.

Thanks!


  -spc

[1]	I recently wrote an HTML parser using LPEG.  I started out with a
	"quick-n-dirty" one but quickly realized I was going to be worse off
	than with a proper parser.  So I broke out the DTD [2] for the
	version of HTML I had to parse, and wrote one [3].  Works perfectly,
	handles the optional closing tags (and the one opening tag).  It
	helped that all the HTML I need to parse is well formed and
	validated.

I wish I had time to learn to use LPEG. I gave it a go a couple of times in past decade, but it's theoretical background is way over my head to be "grokked" in a couple of days. I have little formal education in compiler and grammar theory, and I realize having a firm understanding of how a formal grammar "behaves" really would help understanding LPEG and how to use it for practical tasks.

I /can/ read the EBNF form of a grammar and reason about it in a practical way, but I really can't /design/ a grammar to do what I want, and that would help a lot to use LPEG effectively, I guess.

So every time I gave up for lack of time and I forgot almost everything I learned. I found it has quite a steep learning curve, alas. I also tried a small tutorial written by Gavin Wright (IIRC), but it wasn't enough to bring me to that "AHA!" moment when you really grasp how to use the tool effectively.



[2]	Docuemnt Type Definition

[3]	Two actually---I started out using the re module from LPEG and that
	hit some limitations, so I switch to actual LPEG.


-- Lorenzo