From 339167a0c9b34b27a97e6e76b88adfc8d0f9232a Mon Sep 17 00:00:00 2001
From: Dan Winship <danw@src.gnome.org>
Date: Wed, 1 Mar 2000 19:37:47 +0000
Subject: add an Ibex whitepaper

svn path=/trunk/; revision=1999
---
 doc/white-papers/mail/ChangeLog |   4 +
 doc/white-papers/mail/ibex.sgml | 158 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 162 insertions(+)
 create mode 100644 doc/white-papers/mail/ibex.sgml

(limited to 'doc')
diff --git a/doc/white-papers/mail/ChangeLog b/doc/white-papers/mail/ChangeLog
index 6d4e8b7f8a..5933582d40 100644
--- a/doc/white-papers/mail/ChangeLog
+++ b/doc/white-papers/mail/ChangeLog
@@ -1,3 +1,7 @@
+2000-03-01  Dan Winship  <danw@helixcode.com>
+
+	* ibex.sgml: Ibex white paper
+
 2000-02-29  Dan Winship  <danw@helixcode.com>
 
 	* camel.sgml: Reorg a bit more, make the <PRE> section narrower,
diff --git a/doc/white-papers/mail/ibex.sgml b/doc/white-papers/mail/ibex.sgml
new file mode 100644
index 0000000000..dcb8f5ca4b
--- /dev/null
+++ b/doc/white-papers/mail/ibex.sgml
@@ -0,0 +1,158 @@
+<!doctype article PUBLIC "-//Davenport//DTD DocBook V3.0//EN" [
+<!entity Evolution "<application>Evolution</application>">
+<!entity Camel "Camel">
+<!entity Ibex "Ibex">
+]>
+
+<article class="whitepaper" id="ibex">
+
+  <artheader>
+    <title>Ibex: an Indexing System</title>
+
+    <authorgroup>
+      <author>
+	<firstname>Dan</firstname>
+	<surname>Winship</surname>
+	<affiliation>
+	  <address>
+	    <email>danw@helixcode.com</email>
+	  </address>
+	</affiliation>
+      </author>
+    </authorgroup>
+
+    <copyright>
+      <year>2000</year>
+      <holder>Helix Code, Inc.</holder>
+    </copyright>
+
+  </artheader>
+
+  <sect1 id="introduction">
+    <title>Introduction</title>
+
+    <para>
+      &Ibex; is a library for text indexing. It is being used by
+      &Camel; to allow it to quickly search locally-stored messages,
+      either because the user is looking for a specific piece of text,
+      or because the application is contructing a vFolder or filtering
+      incoming mail.
+    </para>
+  </sect1>
+
+  <sect1 id="goals">
+    <title>Design Goals and Requirements for Ibex</title>
+
+    <para>
+      The design of &Ibex; is based on a number of requirements.
+
+    <itemizedlist>
+      <listitem>
+        <para>
+	  First, obviously, it must be fast. In particular, searching
+	  the index must be appreciably faster than searching through
+	  the messages themselves, and constructing and maintaining
+	  the index must not take a noticeable amount of time.
+	</para>
+      </listitem>
+
+      <listitem>
+        <para>
+	  The indexes must not take up too much space. Many users have
+	  limited filesystem quotas on the systems where they read
+	  their mail, and even users who read mail on private machines
+	  have to worry about running out of space on their disks. The
+	  indexes should be able to do their job without taking up so
+	  much space that the user decides he would be better off
+	  without them.
+	</para>
+
+	<para>
+	  Another aspect of this problem is that the system as a whole
+	  must be clever about what it does and does not index:
+	  accidentally indexing a "text" mail message containing
+	  uuencoded, BinHexed, or PGP-encrypted data will drastically
+	  affect the size of the index file. Either the caller or the
+	  indexer itself has to avoid trying to index these sorts of
+	  things.
+	</para>
+      </listitem>
+
+      <listitem>
+        <para>
+	  The indexing system must allow data to be added to the index
+	  incrementally, so that new messages can be added to the
+	  index (and deleted messages can be removed from it) without
+	  having to re-scan all existing messages.
+	</para>
+      </listitem>
+
+      <listitem>
+        <para>
+	  It must allow the calling application to explain the
+	  structure of the data however it wants to, rather than
+	  requiring that the unit of indexing be individual files.
+	  This way, &Camel; can index a single mbox-format file and
+	  treat it as multiple messages.
+	</para>
+      </listitem>
+
+      <listitem>
+        <para>
+	  It must support non-ASCII text, given that many people send
+	  and receive non-English email, and even people who only
+	  speak English may receive email from people whose names
+	  cannot be written in the US-ASCII character set.
+	</para>
+      </listitem>
+    </itemizedlist>
+
+    <para>
+      While there are a number of existing indexing systems, none of
+      them met all (or even most) of our requirements.
+    </para>
+  </sect1>
+
+  <sect1 id="implementation">
+    <title>The Implementation</title>
+
+    <para>
+      &Ibex; is still young, and many of the details of the current
+      implementation are not yet finalized.
+    </para>
+
+    <para>
+      With the current index file format, 13 megabytes of Info files
+      can be indexed into a 371 kilobyte index file&mdash;a bit under
+      3% of the original size. This is reasonable, but making it
+      smaller would be nice. (The file format includes some simple
+      compression, but <application>gzip</application> can compress an
+      index file to about half its size, so we can clearly do better.)
+    </para>
+
+    <para>
+      The implementation has been profiled and optimized for speed to
+      some degree. But, it has so far only been run on a 500MHz
+      Pentium III system with very fast disks, so we have no solid
+      benchmarks.
+    </para>
+
+    <para>
+      Further optimization (of both the file format and the in-memory
+      data structures) awaits seeing how the library is most easily
+      used by &Evolution;: if the indexes are likely to be kept in
+      memory for long periods of time, the in-memory data structures
+      need to be kept small, but the reading and writing operations
+      can be slow. On the other hand, if the indexes will only be
+      opened when they are needed, reading and writing must be fast,
+      and memory usage is less critical.
+    </para>
+
+    <para>
+      Of course, to be useful for other applications that have
+      indexing needs, the library should provide several options, so
+      that each application can use the library in the way that is
+      most suited for its needs.
+    </para>
+  </sect1>
+</article>
-- 
cgit