[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Homemade email system using LuaSocket and LuaPOP3
- From: Petite Abeille <petite.abeille@...>
- Date: Wed, 28 Sep 2011 09:04:07 +0200
On Sep 28, 2011, at 3:36 AM, Tim Channon wrote:
> Not mad, bad, as in why not.
>
> A suggestion which has a sound basis even if it is not obvious why.
>
> If you intend to store the emails on disk consider storing in sqlite, in
> other words a database and single disk file.
What would you know... this is exactly the road (rabbit hole?) I'm going down...
> The downside revolves around speed issues, such as search. Whether this
> actually matters is an opinion and on circumstances.
I found SQLite to be rather responsive. The full text search module is also very handy.
> There are almost no email clients or servers which store in database, an
> exception perhaps is Outlook Express (dbx)
Apple's Mail.app stores some sort of mail digest in -yes you guessed it- an embedded SQLite database. In fact, Cocoa provides direct support for SQLite as a persistence mechanism (see CoreData [1]).
> and a 'nix server the name of
> which escapes me, specific feature.
There are a couple such as dbmail [2], archiveopteryx [3], decimail [4], etc, etc. They tend to be on the, hmm, heavy duty side though. AT least for my taste :)
> A major why is about hammering general purpose computer resources, not good.
>
> Choice is usually either: store in a zillion individual files in the
> general purpose disk filing system, try a million messages; use some
> kind of single file where it is necessary to update small sections of
> the single file.
>
> Both have problems. The former crucifies file systems which are not
> designed for extreme numbers of small files, also leading to
> considerable allocation inefficiency. The latter, good example is mbox
> as used by Thunderbird where file locking problems have been endemic and
> corruption is known. (as mature these are fairly rare today)
> Any single file system is going to have compaction problems, in the case
> of t/b this hits realtime issues.
At the moment, I have a dual structure, using both the file system (to store the raw messages) and SQLite (to store the parsed structures and assist with queries).
For each mailbox:
(1) A mail directory
Stores the raw messages in a folder structure reflecting their 'Date:' header.
For example, a message with the header "Date: Sun, 25 Sep 2011 09:58:35 +0800", will be stored under '../1ur/9/p/177srlb.1.eml' (i.e. [YYYY]/[MM]/[DD]/[UnixEpoch].[Sequence].eml).
(2) An UID database file
The UID [5] database bridges the file system to the database representation. It records incoming messages and number them sequentially.
(3) A Mail database file
This stores the parsed out messages. It has three main entities:
(3.1) 'mail' to uniquely identify messages and hold their different artifacts together. This is keyed by UID.
(3.2) 'mail_part' holds the various MIME section information (curtesy of reformime [6])
(3.3) 'mail_header' holds the various header information related to each MIME parts.
All textual values are decoded to UTF-8 and stored in FTS virtual tables, to facilitate search.
Here is an extract of sqlite3_analyzer for a mail.db which contains all of lua-l:
Page size in bytes.................... 1024
Pages in the whole file (measured).... 140028
Pages in the whole file (calculated).. 140028
Pages that store data................. 140028 100.0%
Pages on the freelist (per header).... 0 0.0%
Pages on the freelist (calculated).... 0 0.0%
Pages of auto-vacuum overhead......... 0 0.0%
Number of tables in the database...... 28
Number of indices..................... 21
Number of named indices............... 4
Automatically generated indices....... 17
Size of the file in bytes............. 143388672
Bytes of user payload stored.......... 78487090 54.7%
MAIL_HEADER_TEXT_CONTENT.............. 38759 27.7%
MAIL_REFERENCE........................ 36770 26.3%
MAIL_HEADER........................... 19269 13.8%
MAIL_HEADER_TEXT_SEGMENTS............. 16710 11.9%
MAIL.................................. 10596 7.6%
MAIL_ADDRESS.......................... 7124 5.1%
MAIL_HEADER_TEXT_DOCSIZE.............. 6166 4.4%
MAIL_PART............................. 4169 3.0%
ADDRESS............................... 238 0.17%
ADDRESS_DOMAIN........................ 101 0.072%
TIME.................................. 69 0.049%
MAIL_HEADER_TEXT_SEGDIR............... 19 0.014%
SQLITE_MASTER......................... 13 0.009%
CONTENT_TYPE.......................... 6 0.004%
ADDRESS_TLD........................... 2 0.001%
CHARSET............................... 2 0.001%
CONTENT_TRANSFER...................... 2 0.001%
DATE.................................. 2 0.001%
HEADER................................ 2 0.001%
HEADER_TYPE........................... 2 0.001%
MAIL_PART_TEXT_SEGDIR................. 2 0.001%
MAIL_HEADER_TEXT_STAT................. 1 0.0%
MAIL_PART_TEXT_CONTENT................ 1 0.0%
MAIL_PART_TEXT_DOCSIZE................ 1 0.0%
MAIL_PART_TEXT_SEGMENTS............... 1 0.0%
MAIL_PART_TEXT_STAT................... 1 0.0%
In any case, this is very much a work in progress, so we will see how it goes in practice :)
> sqlite is stable and suitably permissively licenced. It uses a single
> file. The database files are OS agnostic therefore portable. Downsides
> are single task and a need for compaction.
Regarding concurrency, the new WAL journal mode (Write-Ahead Logging) is a real bone [7] .
> Compression can be used but
> this is possibly large and with licence issues.
> If you do a core, a gui could turn it into a mail client.
In this particular exercise, I'm planning a personal, self-organizing IMAP server of sort.
> I can think of some curiosities. Sqlite can actually be a bundle of
> database, is limited in number allowed to be joined as one.
Yes, one can attach up to 62 databases at the moment [8][9]. This is a very handy feature :)
> Possibility
> of database per email account or subject.
I have one database per, hmmm, "mailbox". In this context, a mailbox can be any grouping of messages that make some kind of logical sense.
> The header parsing problem has been mentioned. Too right, is quite a
> problem.
Yes, that's a bit tedious. For the time being, I have simply outsourced that to third party tools such as maildrop's reformail, reformime and D. J. Bernstein's mess822. Works rather well so far.
> In reality broken message formats are common, things go wrong
> or are that way. This is about mature software which can workaround reality.
Cheers,
[1] http://developer.apple.com/library/mac/#documentation/Cocoa/Conceptual/CoreData/Articles/cdPersistentStores.html
[2] http://dbmail.org/
[3] http://archiveopteryx.org/
[4] http://decimail.org/server/
[5] Unique, sequential, identifier, in the IMAP sense of the term: http://tools.ietf.org/html/rfc3501#section-2.3.1.1
[6] http://www.courier-mta.org/reformime.html
[7] http://www.sqlite.org/wal.html
[8] http://www.sqlite.org/lang_attach.html
[9] See Maximum Number Of Attached Databases: http://www.sqlite.org/limits.html
[10] http://cr.yp.to/mess822.html