lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


It was thus said that the Great clemens fischer once stated:
> Sean Conner wrote:
> 
> > ...
> >  Now, with that out of the way, a decent method of storing emails is one
> > per file, and there's even a semi-standard for that [3].  My preference is
> > to take the Message-ID (if it doesn't exist, generate one), take a hash
> > (SHA1, MD5, pick your favorite) and use that result as the basis for the
> > directory/filename.  I also store two versions of the headers and the body
> > as separate files.  For example:
> >
> >        Message-ID: <d1b.3e6bbea3.37310e87@aol.com>
> >
> >  This (I include the brackets since it's part of the message id) hashes to
> > (I use MD5 since it was handy):
> >
> >        fff6c8c5b7ae790d732d6cf50b8a5ff6
> 
> According to RFC-5322 a "Message-ID" contains the string including the
> angle brackets.  I say that because here (in bash):
> 
>   $ md5sum <<< '<d1b.3e6bbea3.37310e87@aol.com>'
>   96a8888ea961b869c85919526e0ac48b  -

  Actually, looking over the code (and it's a mess, what with a dozen
half-finished different versions) I'm not sure what I was exactly using for
the hash, but I do know it was consistent at least.

> Including the header name "Message-ID:" and any possibly folding white
> space will pose a problem when looking up ID's mentioned in
> "References:" or "In-Reply-To:" headers.

  The actual message ID appears in angle brackets---anything else is *not*
the message ID (but the header can contain other stuff, not usually, but the
older the email, the more likely it'll be ... um ... interesting).

> >  I then break the hash up into three components:
> >
> >        fff6 c8c5 b7ae790d732d6cf50b8a5ff6

  I found a later version that broke the hash up thusly:

	fff 6c8 c5b7ae790d732d6cf50b8a5ff6

  I did that because the former (with four hex-digits) could lead to
directories with up to 65,536 entries, whereas with the later (three
hex-digits) you would end up with directories with only (only!) 4,096
entries, a figure I find much more managable.

> The "folders" (usenet newsgroups) contain hard-links providing the
> mapping between articles and possibly several newsgroups an article may
> have been crossposted to.
> 
> This solves the "an email message can be in multiple "folders" while
> maintaining a single copy" problem while no separate database (your text
> file index) is needed.  It is way simpler to make tools handling links
> than to keep a database.

  But without a separate database (my text file), there is no way of knowing
which "folders" an individual message resides in.  For instance, one can
delete a message from a folder, or one can delete a message form all
folders (and thus, remove it entirely).  

> > ...
> > [4]     Except for the header parsing---for that I use C code, and I'm still
> >        working on that.
> 
> Good luck with [4].  I have frequently tried to catch up with all the
> variations of headers in "conforming" emails/articles and, of course,
> the spammy ones.  In addition to the ones you mention, here are some
> tools doing MIME parsing:

  I'm close---I just need to finish doing some rewriting as I changed
directions in the actual parsing (first draft---everything had to be in
memory.  That doesn't work well for handing email as it comes in over the
network, so I needed to handle a stream-based interface, but I didn't want
to lose the ability to handle a memory-mapped email---much rewriting
ensued), as well as further clarification of the various headers (and just
how messed up they can be).  

> http://www.ivarch.com/programs/qsf/
> http://bogofilter.sourceforge.net/
> 
> They are possibly more "real world" than the strict ones you refer to,
> especially DJB's mess822.

  I have personal email going back to 1993; I have even older emails going
back to the mid-80s (from archives), so I have plenty of "real world"
examples to go by.

  -spc (also, I found this bug in Lua http://www.lua.org/bugs.html#5.1.4-6
	due to my email project).