[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: Homemade email system using LuaSocket and LuaPOP3
- From: clemens fischer <ino-news@...>
- Date: Wed, 21 Sep 2011 00:38:29 +0200
Sean Conner wrote:
> Now, with that out of the way, a decent method of storing emails is one
> per file, and there's even a semi-standard for that . My preference is
> to take the Message-ID (if it doesn't exist, generate one), take a hash
> (SHA1, MD5, pick your favorite) and use that result as the basis for the
> directory/filename. I also store two versions of the headers and the body
> as separate files. For example:
> Message-ID: <email@example.com>
> This (I include the brackets since it's part of the message id) hashes to
> (I use MD5 since it was handy):
According to RFC-5322 a "Message-ID" contains the string including the
angle brackets. I say that because here (in bash):
$ md5sum <<< '<firstname.lastname@example.org>'
Including the header name "Message-ID:" and any possibly folding white
space will pose a problem when looking up ID's mentioned in
"References:" or "In-Reply-To:" headers.
> I then break the hash up into three components:
> fff6 c8c5 b7ae790d732d6cf50b8a5ff6
> The first two components become directories (I've found that too many files
> in a single directory has performance issues) and the third the basis for
> the filename. The base filename becomes the third portion of the hash plus
> the message ID (sans the brackets):
> I do this in case two email message IDs hash to the same value. With that,
> I create three files per email, the body, and two for headers. The first
> one for headers only contains the From:, To:, Date: and Subject: headers,
> which for me, are typically the only ones I'm insterested in (say, for
> displaying purposes). The other headers file contains the full set of
> headers. So, this method creates:
> ,B = body of email message
> ,HF = full headers
> ,HS = From:, To:, Subject:, Date: headers only
> For "folders" of email, I use a text file that contains message IDs of
> emails for that "folder". The upside---an email message can be in multiple
> "folders" while maintaining a single copy of the email. The downside---I
> need to track the "folders" an email is in (probably with the use of another
> header, but I haven't gotten that far yet).
In leafnode (the small usenet server) this is solved using separate
directories handled in a similiar way. The Message-ID's are hashed and
there's a directory spool/news/message.id/XXX/ containing entire
articles. "XXX" is a hash bucket named with three decimal digits.
The "folders" (usenet newsgroups) contain hard-links providing the
mapping between articles and possibly several newsgroups an article may
have been crossposted to.
This solves the "an email message can be in multiple "folders" while
maintaining a single copy" problem while no separate database (your text
file index) is needed. It is way simpler to make tools handling links
than to keep a database.
>  Except for the header parsing---for that I use C code, and I'm still
> working on that.
Good luck with . I have frequently tried to catch up with all the
variations of headers in "conforming" emails/articles and, of course,
the spammy ones. In addition to the ones you mention, here are some
tools doing MIME parsing:
They are possibly more "real world" than the strict ones you refer to,
especially DJB's mess822.