lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


  So I have this LPEG modules that parses email [1], but it's mainly written
using the 're' module and not directly in LPEG.  This makes it pretty easy
to parse emails:

	local dump  = require "org.conman.table".dump
	local parse = require "org.conman.parsers.email"

	local emailtext = ... -- whatever to get the raw email headers
	local msg       = parse:match(emailtext)
	dump("msg",msg)	-- function to dump a Lua table

  What I'm trying to do is generalize this way more.  I don't always need to
parse every defined header, and if I need to parse non-email headers, like
HTTP or SIP, I have to pretty much start over.  So I'm rethinking the
structure and I have something that work, even if it looks more
intimidating.  Each header is now its own Lua module, and classified per
RFC (so for example, the From: header is defined in RFC-05322 [2], but
there's an update in RFC-06854).  This will now look like:

        local dump    = require "org.conman.table".dump
        local headers = require "org.conman.parsers.headers.RFC-05322.Return-Path"
                      + require "org.conman.parsers.headers.RFC-05322.Received"
                      + require "org.conman.parsers.headers.RFC-05322.Date"
                      + require "org.conman.parsers.headers.RFC-05322.From"
                      + require "org.conman.parsers.headers.RFC-05322.Sender"
                      + require "org.conman.parsers.headers.RFC-05322.Reply-To"
                      + require "org.conman.parsers.headers.RFC-05322.To"
                      + require "org.conman.parsers.headers.RFC-05322.Cc"
                      + require "org.conman.parsers.headers.RFC-05322.Bcc"
                      + require "org.conman.parsers.headers.RFC-05322.Message-ID"
                      + require "org.conman.parsers.headers.RFC-05322.In-Reply-To"
                      + require "org.conman.parsers.headers.RFC-05322.References"
                      + require "org.conman.parsers.headers.RFC-05322.Subject"
                      + require "org.conman.parsers.headers.RFC-05322.Comments"
                      + require "org.conman.parsers.headers.RFC-05322.Keywords"
                      + require "org.conman.parsers.headers.RFC-05322.Resent-Date"
                      + require "org.conman.parsers.headers.RFC-05322.Resent-From"
                      + require "org.conman.parsers.headers.RFC-05322.Resent-Sender"
                      + require "org.conman.parsers.headers.RFC-05322.Resent-To"
                      + require "org.conman.parsers.headers.RFC-05322.Resent-Cc"
                      + require "org.conman.parsers.headers.RFC-05322.Resent-Bcc"
                      + require "org.conman.parsers.headers.RFC-05322.Resent-Message-ID"
                      + require "org.conman.parsers.headers.RFC-02369.List-Archive"
                      + require "org.conman.parsers.headers.RFC-02369.List-Help"
                      + require "org.conman.parsers.headers.RFC-02369.List-Unsubscribe"
                      + require "org.conman.parsers.headers.RFC-02369.List-Post"
                      + require "org.conman.parsers.headers.RFC-02369.List-Owner"
                      + require "org.conman.parsers.headers.RFC-02369.List-Subscribe"
                      + require "org.conman.parsers.headers.RFC-02919.List-Id"
                      + require "org.conman.parsers.headers.RFC-08058.List-Unsubscribe-Post"
                      + require "org.conman.parsers.headers.generic"
        local parse   = require "org.conman.parsers.headers.parse"(headers)
	
	local emailtext = ...
	local msg       = parse:match(emailtext)
	dump("msg",msg)

"org.conman.parsers.headers.generic" is the module to parse a non-specified header,
and the module "org.conman.parsers.headers.parse" returns a pattern to
return the headers in a table.  So if you don't need to parse the various
List-*: headers, you can exclude them and not waste time parsing data you
don't care to.  These modules are all written in plain LPEG [3].  This also
means if you want to parse older versions of the headers, you can:

	local headers = require "org.conman.parsers.headers.RFC-05332.Date"
		      + require "org.conman.parsers.headers.RFC-00724.Date"
		      + require "org.conman.parsers.headers.RFC-00680.Date"


will allow you to parse the Date headers according to said RFCs:

	RFC-05322	Date: Thu, 20 Apr 2023 16:45:48 -0400
	RFC-00724	Date: Thursday, 20 April 2023 1645-EDT
	RFC-00680	Date: 20 APR 2023 at 1645-EDT

  So far, so good.  The new method works, but there's one issue---it's about
twice as slow as the one I wrote using the 're' module.  It's not the
loading of a bazillion modules, since when testing my 're' module, I still
"require" all the above modules, in addition to "org.conman.parsers.email". 
But I'm at a loss as to how to go about profiling the code.  There are three
things that confound the issue---the Lua VM, the Lua code itself, and the
LPEG VM. Does anyone have any ideas how I would go about profiling the code?
I can provide the code upon request (warning:  it's over 60 files).

  Any help or ideas would be appreciated.  Thanks.

  -spc

[1]	https://github.com/spc476/LPeg-Parsers/blob/master/email.lua

[2]	I'm now switching to five-digit RFC numbers, given that any day now
	RFC-10000 will be released.

[3]	And are mostly a few lines long---three or four lines for loading
	various parsing modules, then pretty much one line for the header. 
	I'm still not sure how I feel about such small modules and possibly
	flooding LuaRocks with tons of these.  I'm still experimenting.