[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Profiling LPEG
- From: Sean Conner <sean@...>
- Date: Tue, 25 Apr 2023 20:08:25 -0400
So I have this LPEG modules that parses email [1], but it's mainly written
using the 're' module and not directly in LPEG. This makes it pretty easy
to parse emails:
local dump = require "org.conman.table".dump
local parse = require "org.conman.parsers.email"
local emailtext = ... -- whatever to get the raw email headers
local msg = parse:match(emailtext)
dump("msg",msg) -- function to dump a Lua table
What I'm trying to do is generalize this way more. I don't always need to
parse every defined header, and if I need to parse non-email headers, like
HTTP or SIP, I have to pretty much start over. So I'm rethinking the
structure and I have something that work, even if it looks more
intimidating. Each header is now its own Lua module, and classified per
RFC (so for example, the From: header is defined in RFC-05322 [2], but
there's an update in RFC-06854). This will now look like:
local dump = require "org.conman.table".dump
local headers = require "org.conman.parsers.headers.RFC-05322.Return-Path"
+ require "org.conman.parsers.headers.RFC-05322.Received"
+ require "org.conman.parsers.headers.RFC-05322.Date"
+ require "org.conman.parsers.headers.RFC-05322.From"
+ require "org.conman.parsers.headers.RFC-05322.Sender"
+ require "org.conman.parsers.headers.RFC-05322.Reply-To"
+ require "org.conman.parsers.headers.RFC-05322.To"
+ require "org.conman.parsers.headers.RFC-05322.Cc"
+ require "org.conman.parsers.headers.RFC-05322.Bcc"
+ require "org.conman.parsers.headers.RFC-05322.Message-ID"
+ require "org.conman.parsers.headers.RFC-05322.In-Reply-To"
+ require "org.conman.parsers.headers.RFC-05322.References"
+ require "org.conman.parsers.headers.RFC-05322.Subject"
+ require "org.conman.parsers.headers.RFC-05322.Comments"
+ require "org.conman.parsers.headers.RFC-05322.Keywords"
+ require "org.conman.parsers.headers.RFC-05322.Resent-Date"
+ require "org.conman.parsers.headers.RFC-05322.Resent-From"
+ require "org.conman.parsers.headers.RFC-05322.Resent-Sender"
+ require "org.conman.parsers.headers.RFC-05322.Resent-To"
+ require "org.conman.parsers.headers.RFC-05322.Resent-Cc"
+ require "org.conman.parsers.headers.RFC-05322.Resent-Bcc"
+ require "org.conman.parsers.headers.RFC-05322.Resent-Message-ID"
+ require "org.conman.parsers.headers.RFC-02369.List-Archive"
+ require "org.conman.parsers.headers.RFC-02369.List-Help"
+ require "org.conman.parsers.headers.RFC-02369.List-Unsubscribe"
+ require "org.conman.parsers.headers.RFC-02369.List-Post"
+ require "org.conman.parsers.headers.RFC-02369.List-Owner"
+ require "org.conman.parsers.headers.RFC-02369.List-Subscribe"
+ require "org.conman.parsers.headers.RFC-02919.List-Id"
+ require "org.conman.parsers.headers.RFC-08058.List-Unsubscribe-Post"
+ require "org.conman.parsers.headers.generic"
local parse = require "org.conman.parsers.headers.parse"(headers)
local emailtext = ...
local msg = parse:match(emailtext)
dump("msg",msg)
"org.conman.parsers.headers.generic" is the module to parse a non-specified header,
and the module "org.conman.parsers.headers.parse" returns a pattern to
return the headers in a table. So if you don't need to parse the various
List-*: headers, you can exclude them and not waste time parsing data you
don't care to. These modules are all written in plain LPEG [3]. This also
means if you want to parse older versions of the headers, you can:
local headers = require "org.conman.parsers.headers.RFC-05332.Date"
+ require "org.conman.parsers.headers.RFC-00724.Date"
+ require "org.conman.parsers.headers.RFC-00680.Date"
will allow you to parse the Date headers according to said RFCs:
RFC-05322 Date: Thu, 20 Apr 2023 16:45:48 -0400
RFC-00724 Date: Thursday, 20 April 2023 1645-EDT
RFC-00680 Date: 20 APR 2023 at 1645-EDT
So far, so good. The new method works, but there's one issue---it's about
twice as slow as the one I wrote using the 're' module. It's not the
loading of a bazillion modules, since when testing my 're' module, I still
"require" all the above modules, in addition to "org.conman.parsers.email".
But I'm at a loss as to how to go about profiling the code. There are three
things that confound the issue---the Lua VM, the Lua code itself, and the
LPEG VM. Does anyone have any ideas how I would go about profiling the code?
I can provide the code upon request (warning: it's over 60 files).
Any help or ideas would be appreciated. Thanks.
-spc
[1] https://github.com/spc476/LPeg-Parsers/blob/master/email.lua
[2] I'm now switching to five-digit RFC numbers, given that any day now
RFC-10000 will be released.
[3] And are mostly a few lines long---three or four lines for loading
various parsing modules, then pretty much one line for the header.
I'm still not sure how I feel about such small modules and possibly
flooding LuaRocks with tons of these. I'm still experimenting.