From 339167a0c9b34b27a97e6e76b88adfc8d0f9232a Mon Sep 17 00:00:00 2001 From: Dan Winship Date: Wed, 1 Mar 2000 19:37:47 +0000 Subject: add an Ibex whitepaper svn path=/trunk/; revision=1999 --- doc/white-papers/mail/ChangeLog | 4 + doc/white-papers/mail/ibex.sgml | 158 ++++++++++++++++++++++++++++++++++++++++ 2 files changed, 162 insertions(+) create mode 100644 doc/white-papers/mail/ibex.sgml (limited to 'doc') diff --git a/doc/white-papers/mail/ChangeLog b/doc/white-papers/mail/ChangeLog index 6d4e8b7f8a..5933582d40 100644 --- a/doc/white-papers/mail/ChangeLog +++ b/doc/white-papers/mail/ChangeLog @@ -1,3 +1,7 @@ +2000-03-01 Dan Winship + + * ibex.sgml: Ibex white paper + 2000-02-29 Dan Winship * camel.sgml: Reorg a bit more, make the
 section narrower,
diff --git a/doc/white-papers/mail/ibex.sgml b/doc/white-papers/mail/ibex.sgml
new file mode 100644
index 0000000000..dcb8f5ca4b
--- /dev/null
+++ b/doc/white-papers/mail/ibex.sgml
@@ -0,0 +1,158 @@
+Evolution">
+
+
+]>
+
+
+ + + Ibex: an Indexing System + + + + Dan + Winship + +
+ danw@helixcode.com +
+
+
+
+ + + 2000 + Helix Code, Inc. + + +
+ + + Introduction + + + &Ibex; is a library for text indexing. It is being used by + &Camel; to allow it to quickly search locally-stored messages, + either because the user is looking for a specific piece of text, + or because the application is contructing a vFolder or filtering + incoming mail. + + + + + Design Goals and Requirements for Ibex + + + The design of &Ibex; is based on a number of requirements. + + + + + First, obviously, it must be fast. In particular, searching + the index must be appreciably faster than searching through + the messages themselves, and constructing and maintaining + the index must not take a noticeable amount of time. + + + + + + The indexes must not take up too much space. Many users have + limited filesystem quotas on the systems where they read + their mail, and even users who read mail on private machines + have to worry about running out of space on their disks. The + indexes should be able to do their job without taking up so + much space that the user decides he would be better off + without them. + + + + Another aspect of this problem is that the system as a whole + must be clever about what it does and does not index: + accidentally indexing a "text" mail message containing + uuencoded, BinHexed, or PGP-encrypted data will drastically + affect the size of the index file. Either the caller or the + indexer itself has to avoid trying to index these sorts of + things. + + + + + + The indexing system must allow data to be added to the index + incrementally, so that new messages can be added to the + index (and deleted messages can be removed from it) without + having to re-scan all existing messages. + + + + + + It must allow the calling application to explain the + structure of the data however it wants to, rather than + requiring that the unit of indexing be individual files. + This way, &Camel; can index a single mbox-format file and + treat it as multiple messages. + + + + + + It must support non-ASCII text, given that many people send + and receive non-English email, and even people who only + speak English may receive email from people whose names + cannot be written in the US-ASCII character set. + + + + + + While there are a number of existing indexing systems, none of + them met all (or even most) of our requirements. + + + + + The Implementation + + + &Ibex; is still young, and many of the details of the current + implementation are not yet finalized. + + + + With the current index file format, 13 megabytes of Info files + can be indexed into a 371 kilobyte index file—a bit under + 3% of the original size. This is reasonable, but making it + smaller would be nice. (The file format includes some simple + compression, but gzip can compress an + index file to about half its size, so we can clearly do better.) + + + + The implementation has been profiled and optimized for speed to + some degree. But, it has so far only been run on a 500MHz + Pentium III system with very fast disks, so we have no solid + benchmarks. + + + + Further optimization (of both the file format and the in-memory + data structures) awaits seeing how the library is most easily + used by &Evolution;: if the indexes are likely to be kept in + memory for long periods of time, the in-memory data structures + need to be kept small, but the reading and writing operations + can be slow. On the other hand, if the indexes will only be + opened when they are needed, reading and writing must be fast, + and memory usage is less critical. + + + + Of course, to be useful for other applications that have + indexing needs, the library should provide several options, so + that each application can use the library in the way that is + most suited for its needs. + + +
-- cgit