lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Sep 28, 2011, at 3:36 AM, Tim Channon wrote:

> Not mad, bad, as in why not.
> 
> A suggestion which has a sound basis even if it is not obvious why.
> 
> If you intend to store the emails on disk consider storing in sqlite, in
> other words a database and single disk file.

What would you know... this is exactly the road (rabbit hole?) I'm going down...

> The downside revolves around speed issues, such as search.  Whether this
> actually matters is an opinion and on circumstances.

I found SQLite to be rather responsive. The full text search module is also very handy.

> There are almost no email clients or servers which store in database, an
> exception perhaps is Outlook Express (dbx)

Apple's Mail.app stores some sort of mail digest in -yes you guessed it- an embedded SQLite database. In fact, Cocoa provides direct support for SQLite as a persistence mechanism (see CoreData [1]).

> and a 'nix server the name of
> which escapes me, specific feature.

There are a couple such as dbmail [2], archiveopteryx [3], decimail [4], etc, etc. They tend to be on the, hmm, heavy duty side though. AT least for my taste :)

> A major why is about hammering general purpose computer resources, not good.
> 
> Choice is usually either: store in a zillion individual files in the
> general purpose disk filing system, try a million messages; use some
> kind of single file where it is necessary to update small sections of
> the single file.
> 
> Both have problems. The former crucifies file systems which are not
> designed for extreme numbers of small files, also leading to
> considerable allocation inefficiency. The latter, good example is mbox
> as used by Thunderbird where file locking problems have been endemic and
> corruption is known. (as mature these are fairly rare today)
> Any single file system is going to have compaction problems, in the case
> of t/b this hits realtime issues.

At the moment, I have a dual structure, using both the file system (to store the raw messages) and SQLite (to store the parsed structures  and assist with queries).

For each mailbox:

(1) A mail directory

Stores the raw messages in a folder structure reflecting their 'Date:' header.

For example, a message with the header "Date: Sun, 25 Sep 2011 09:58:35 +0800", will be stored under '../1ur/9/p/177srlb.1.eml' (i.e. [YYYY]/[MM]/[DD]/[UnixEpoch].[Sequence].eml).

(2) An UID database file

The UID [5]  database bridges the file system to the database representation. It records incoming messages and number them sequentially.

(3) A Mail database file

This stores the parsed out messages. It has three main entities:

(3.1) 'mail' to uniquely identify messages and hold their different artifacts together. This is keyed by UID.

(3.2) 'mail_part' holds the various MIME section information (curtesy of reformime [6])

(3.3) 'mail_header' holds the various header information related to each MIME parts.

All textual values are decoded to UTF-8 and stored in FTS virtual tables, to facilitate search.

Here is an extract of sqlite3_analyzer for a mail.db which contains all of lua-l:

Page size in bytes.................... 1024      
Pages in the whole file (measured).... 140028    
Pages in the whole file (calculated).. 140028    
Pages that store data................. 140028     100.0% 
Pages on the freelist (per header).... 0            0.0% 
Pages on the freelist (calculated).... 0            0.0% 
Pages of auto-vacuum overhead......... 0            0.0% 
Number of tables in the database...... 28        
Number of indices..................... 21        
Number of named indices............... 4         
Automatically generated indices....... 17        
Size of the file in bytes............. 143388672 
Bytes of user payload stored.......... 78487090    54.7% 

MAIL_HEADER_TEXT_CONTENT.............. 38759       27.7% 
MAIL_REFERENCE........................ 36770       26.3% 
MAIL_HEADER........................... 19269       13.8% 
MAIL_HEADER_TEXT_SEGMENTS............. 16710       11.9% 
MAIL.................................. 10596        7.6% 
MAIL_ADDRESS.......................... 7124         5.1% 
MAIL_HEADER_TEXT_DOCSIZE.............. 6166         4.4% 
MAIL_PART............................. 4169         3.0% 
ADDRESS............................... 238          0.17% 
ADDRESS_DOMAIN........................ 101          0.072% 
TIME.................................. 69           0.049% 
MAIL_HEADER_TEXT_SEGDIR............... 19           0.014% 
SQLITE_MASTER......................... 13           0.009% 
CONTENT_TYPE.......................... 6            0.004% 
ADDRESS_TLD........................... 2            0.001% 
CHARSET............................... 2            0.001% 
CONTENT_TRANSFER...................... 2            0.001% 
DATE.................................. 2            0.001% 
HEADER................................ 2            0.001% 
HEADER_TYPE........................... 2            0.001% 
MAIL_PART_TEXT_SEGDIR................. 2            0.001% 
MAIL_HEADER_TEXT_STAT................. 1            0.0% 
MAIL_PART_TEXT_CONTENT................ 1            0.0% 
MAIL_PART_TEXT_DOCSIZE................ 1            0.0% 
MAIL_PART_TEXT_SEGMENTS............... 1            0.0% 
MAIL_PART_TEXT_STAT................... 1            0.0% 

 In any case, this is very much a work in progress, so we will see how it goes in practice :)

> sqlite is stable and suitably permissively licenced. It uses a single
> file. The database files are OS agnostic therefore portable. Downsides
> are single task and a need for compaction.

Regarding concurrency, the new WAL journal mode (Write-Ahead Logging) is a real bone [7] .

> Compression can be used but
> this is possibly large and with licence issues.
> If you do a core, a gui could turn it into a mail client.

In this particular exercise, I'm planning a personal, self-organizing IMAP server of sort.

> I can think of some curiosities. Sqlite can actually be a bundle of
> database, is limited in number allowed to be joined as one.

Yes, one can attach up to 62 databases at the moment [8][9]. This is a very handy feature :)

> Possibility
> of database per email account or subject.

I have one database per, hmmm, "mailbox". In this context, a mailbox can be any grouping of messages that make some kind of logical sense. 

> The header parsing problem has been mentioned. Too right, is quite a
> problem.

Yes, that's a bit tedious. For the time being, I have simply outsourced that to third party tools such as maildrop's reformail, reformime and D. J. Bernstein's mess822. Works rather well so far.

> In reality broken message formats are common, things go wrong
> or are that way. This is about mature software which can workaround reality.

Cheers,


[1] http://developer.apple.com/library/mac/#documentation/Cocoa/Conceptual/CoreData/Articles/cdPersistentStores.html
[2] http://dbmail.org/
[3] http://archiveopteryx.org/
[4] http://decimail.org/server/
[5] Unique, sequential, identifier, in the IMAP sense of the term: http://tools.ietf.org/html/rfc3501#section-2.3.1.1
[6] http://www.courier-mta.org/reformime.html
[7] http://www.sqlite.org/wal.html
[8] http://www.sqlite.org/lang_attach.html
[9] See Maximum Number Of Attached Databases: http://www.sqlite.org/limits.html
[10] http://cr.yp.to/mess822.html