lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Sean Conner wrote:

> ...
>  Now, with that out of the way, a decent method of storing emails is one
> per file, and there's even a semi-standard for that [3].  My preference is
> to take the Message-ID (if it doesn't exist, generate one), take a hash
> (SHA1, MD5, pick your favorite) and use that result as the basis for the
> directory/filename.  I also store two versions of the headers and the body
> as separate files.  For example:
>
>        Message-ID: <d1b.3e6bbea3.37310e87@aol.com>
>
>  This (I include the brackets since it's part of the message id) hashes to
> (I use MD5 since it was handy):
>
>        fff6c8c5b7ae790d732d6cf50b8a5ff6

According to RFC-5322 a "Message-ID" contains the string including the
angle brackets.  I say that because here (in bash):

  $ md5sum <<< '<d1b.3e6bbea3.37310e87@aol.com>'
  96a8888ea961b869c85919526e0ac48b  -

Including the header name "Message-ID:" and any possibly folding white
space will pose a problem when looking up ID's mentioned in
"References:" or "In-Reply-To:" headers.

> ...
>  I then break the hash up into three components:
>
>        fff6 c8c5 b7ae790d732d6cf50b8a5ff6
>
>  The first two components become directories (I've found that too many files
> in a single directory has performance issues) and the third the basis for
> the filename.  The base filename becomes the third portion of the hash plus
> the message ID (sans the brackets):
>
>        b7ae790d732d6cf50b8a5ff6,d1b.3e6bbea3.37310e87@aol.com
>
>  I do this in case two email message IDs hash to the same value.  With that,
> I create three files per email, the body, and two for headers.  The first
> one for headers only contains the From:, To:, Date: and Subject: headers,
> which for me, are typically the only ones I'm insterested in (say, for
> displaying purposes).  The other headers file contains the full set of
> headers.  So, this method creates:
>
>        fff6/c8c5/b7ae790d732d6cf50b8a5ff6,d1b.3e6bbea3.37310e87@aol.com,B
>        fff6/c8c5/b7ae790d732d6cf50b8a5ff6,d1b.3e6bbea3.37310e87@aol.com,HF
>        fff6/c8c5/b7ae790d732d6cf50b8a5ff6,d1b.3e6bbea3.37310e87@aol.com,HS
>
>        ,B = body of email message
>        ,HF = full headers
>        ,HS = From:, To:, Subject:, Date: headers only
>
>  For "folders" of email, I use a text file that contains message IDs of
> emails for that "folder".  The upside---an email message can be in multiple
> "folders" while maintaining a single copy of the email.  The downside---I
> need to track the "folders" an email is in (probably with the use of another
> header, but I haven't gotten that far yet).

In leafnode (the small usenet server) this is solved using separate
directories handled in a similiar way.  The Message-ID's are hashed and
there's a directory spool/news/message.id/XXX/ containing entire
articles.  "XXX" is a hash bucket named with three decimal digits.

The "folders" (usenet newsgroups) contain hard-links providing the
mapping between articles and possibly several newsgroups an article may
have been crossposted to.

This solves the "an email message can be in multiple "folders" while
maintaining a single copy" problem while no separate database (your text
file index) is needed.  It is way simpler to make tools handling links
than to keep a database.

> ...
> [4]     Except for the header parsing---for that I use C code, and I'm still
>        working on that.

Good luck with [4].  I have frequently tried to catch up with all the
variations of headers in "conforming" emails/articles and, of course,
the spammy ones.  In addition to the ones you mention, here are some
tools doing MIME parsing:

http://www.ivarch.com/programs/qsf/
http://bogofilter.sourceforge.net/

They are possibly more "real world" than the strict ones you refer to,
especially DJB's mess822.


clemens