[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Homemade email system using LuaSocket and LuaPOP3
- From: clemens fischer <ino-news@...>
- Date: Wed, 21 Sep 2011 00:38:29 +0200
Sean Conner wrote:
> ...
> Now, with that out of the way, a decent method of storing emails is one
> per file, and there's even a semi-standard for that [3]. My preference is
> to take the Message-ID (if it doesn't exist, generate one), take a hash
> (SHA1, MD5, pick your favorite) and use that result as the basis for the
> directory/filename. I also store two versions of the headers and the body
> as separate files. For example:
>
> Message-ID: <d1b.3e6bbea3.37310e87@aol.com>
>
> This (I include the brackets since it's part of the message id) hashes to
> (I use MD5 since it was handy):
>
> fff6c8c5b7ae790d732d6cf50b8a5ff6
According to RFC-5322 a "Message-ID" contains the string including the
angle brackets. I say that because here (in bash):
$ md5sum <<< '<d1b.3e6bbea3.37310e87@aol.com>'
96a8888ea961b869c85919526e0ac48b -
Including the header name "Message-ID:" and any possibly folding white
space will pose a problem when looking up ID's mentioned in
"References:" or "In-Reply-To:" headers.
> ...
> I then break the hash up into three components:
>
> fff6 c8c5 b7ae790d732d6cf50b8a5ff6
>
> The first two components become directories (I've found that too many files
> in a single directory has performance issues) and the third the basis for
> the filename. The base filename becomes the third portion of the hash plus
> the message ID (sans the brackets):
>
> b7ae790d732d6cf50b8a5ff6,d1b.3e6bbea3.37310e87@aol.com
>
> I do this in case two email message IDs hash to the same value. With that,
> I create three files per email, the body, and two for headers. The first
> one for headers only contains the From:, To:, Date: and Subject: headers,
> which for me, are typically the only ones I'm insterested in (say, for
> displaying purposes). The other headers file contains the full set of
> headers. So, this method creates:
>
> fff6/c8c5/b7ae790d732d6cf50b8a5ff6,d1b.3e6bbea3.37310e87@aol.com,B
> fff6/c8c5/b7ae790d732d6cf50b8a5ff6,d1b.3e6bbea3.37310e87@aol.com,HF
> fff6/c8c5/b7ae790d732d6cf50b8a5ff6,d1b.3e6bbea3.37310e87@aol.com,HS
>
> ,B = body of email message
> ,HF = full headers
> ,HS = From:, To:, Subject:, Date: headers only
>
> For "folders" of email, I use a text file that contains message IDs of
> emails for that "folder". The upside---an email message can be in multiple
> "folders" while maintaining a single copy of the email. The downside---I
> need to track the "folders" an email is in (probably with the use of another
> header, but I haven't gotten that far yet).
In leafnode (the small usenet server) this is solved using separate
directories handled in a similiar way. The Message-ID's are hashed and
there's a directory spool/news/message.id/XXX/ containing entire
articles. "XXX" is a hash bucket named with three decimal digits.
The "folders" (usenet newsgroups) contain hard-links providing the
mapping between articles and possibly several newsgroups an article may
have been crossposted to.
This solves the "an email message can be in multiple "folders" while
maintaining a single copy" problem while no separate database (your text
file index) is needed. It is way simpler to make tools handling links
than to keep a database.
> ...
> [4] Except for the header parsing---for that I use C code, and I'm still
> working on that.
Good luck with [4]. I have frequently tried to catch up with all the
variations of headers in "conforming" emails/articles and, of course,
the spammy ones. In addition to the ones you mention, here are some
tools doing MIME parsing:
http://www.ivarch.com/programs/qsf/
http://bogofilter.sourceforge.net/
They are possibly more "real world" than the strict ones you refer to,
especially DJB's mess822.
clemens