aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--camel/ChangeLog5
-rw-r--r--camel/devel-docs/camel-index.txt407
2 files changed, 412 insertions, 0 deletions
diff --git a/camel/ChangeLog b/camel/ChangeLog
index c2b7d9d9c3..fbfe9c60ce 100644
--- a/camel/ChangeLog
+++ b/camel/ChangeLog
@@ -1,3 +1,8 @@
+2002-04-19 Not Zed <NotZed@Ximian.com>
+
+ * devel-docs/camel-index.txt: Start of a white-paperish document
+ describing camel-index and older libibex.
+
2002-04-18 Not Zed <NotZed@Ximian.com>
* providers/local/camel-local-store.c (rename_folder): If we get a
diff --git a/camel/devel-docs/camel-index.txt b/camel/devel-docs/camel-index.txt
new file mode 100644
index 0000000000..0ebc250db0
--- /dev/null
+++ b/camel/devel-docs/camel-index.txt
@@ -0,0 +1,407 @@
+
+The history of text-indexing in evolution.
+
+CamelTextIndex was written to address several shortcomings in the
+existing libibex (referred to as libibex2), which had been written to
+address shortcomings in the original libibex.
+
+Mail indexing characteristics
+
+First, i'll cover some of the scenarios that a mail indexing system
+must cover. They are slightly different from other indexing systems,
+at least we wanted them to be.
+
+1. Indexing a few new messages, they may potentially reference most of
+ the alphabet in the index.
+2. Indexing a whole mailbox for the first time
+3. Unindexing anywhere from a few to all existing messages during expunge.
+4. Searching.
+
+Cases 1, 3, and 4 occur the most often, however 2 is the most noticeable
+at first use, or if the index must be reset. So the code needs to
+work fast in all cases, which generally leads to trade-offs being
+made. Each implementation aimed to address or ignore these
+requirements in different ways, with the final implementation probably
+having the best balance so far.
+
+The main issue is that the indexing be performed real time. We index
+as we add the messages. We index before we open the mailbox. We
+index as we remove messages. Because of this we need to approach
+things differently to many other indexing systems; most of which work
+with static data in an off-line mode. This allows them to index the
+whole body of content and use as much memory and cpu time as required.
+
+We probably need to look at doing offline, or at least delayed
+indexing in the future - but this introduces some coherency problems
+with vFolders and any body searches. However, having the indexing
+library a base part of Camel helps in implementing a mechanism to
+achieve this.
+
+Ibex the first
+
+The original ibex used a memory-based hash table to store words. This made
+the index very fast for both lookups and modifications. However any
+queries required a full load of the index into memory, and any updates
+required a full write of the index to disk. After about 5-10 000
+messages occasionaly adding to the index became appreciably slower as
+the whole index needed to be loaded into memory first. This obviously
+took a toll on memory as well.
+
+I wont cover the algorithms used, they were all pretty basic, the only
+real smarts were that index deletes were only flagged and that data
+not written to disk when the index saved.
+
+Evolution 1.x, ibex 2.
+
+In an attempt to rectify the incremental update performance of
+libibex, it was completely rewritten to use an on-disk block-based
+filesystem.
+
+Note: the first attempt used libdb - however performance was so slow
+and the indices were so large it was dropped in favour of a custom
+filesystem-like data file.
+
+The motivation was that a few extra disk lookups during
+retrieval wouldn't be noticeably slower, however it should be able to
+scale up to many more messages with lower memory overhead and slower
+startup time.
+
+The block filesystem contains 3 major components:
+
+1. A hash table that mapped message's to a word id sequence list.
+2. A hash table that mapped word's to a message id sequence list.
+3. A sequence filesystem that stored sequences of id's.
+
+The id's are 32 bit identifiers that are unique for each word or
+message. They are also designed to be reversible and static.
+That is, given the id, you can map it to the string identifier that it
+represents directly, without having to look it up in the hash table.
+
+Other features of this design is that the database file should be
+kept in sync at all times with the state of the index. The message to
+wordid tables are used to remove the messageid's from the word's it
+contains when the message is expunged, and so on.
+
+Indexing operation
+
+The indexing operation consists of the basic steps:
+
+1. Lookup the messageid from the message name, using the messageid table.
+2. Generate a list of words in the message
+3. For each word:
+4. Lookup the wordid and sequence information
+5. If the word doesn't exist, create a new word/wordid
+6. Add the messageid to the word sequence.
+7. Add the wordid to the message sequence.
+
+The initial implementation only used caching at the disk-block level.
+Unfortunately, the simple hash table design chosen (fixed sized base
+table with chained buckets) scaled very poorly above about 10 000
+messages. So this approach proved to be too i/o intensive for
+practical use, and several other caches were added to improve
+performance:
+
+1. Stage (1) above is done entirely in memory. At initial startup
+ the whole list of potential names is read into an in-memory hash
+ table.
+2. Stage (4) above is also done entirely in memory. Even a large
+ cache provided little benefit due to wide distribution of potential
+ words. This cache is only created when adding to the index.
+3. Stage (6) uses the table from stage (4) and concatenates upto
+ approximately one disk blocks worth of messageid's before writing
+ them out to the word sequence.
+4. Stage (7) concatenates all wordid's for a given message before
+ writing them out at once.
+
+As you can see, the added complexity meant we nearly have to cache as
+much as the original version! This also almost removed all of the
+startup-time benefit for incremental update of the index, as the table
+was not stored as compactly on disk as the original version.
+
+However, we only ever stored a subset of the index in memory, and only
+during updates, with some tricks to reduce memory usage for very rare
+words, so the overall memory use was still much lower.
+
+Removing a message
+
+Removing a message is fairly involved:
+
+1. Lookup the messageid and word sequence list from the messageid table.
+2. For each wordid in the sequence list
+3. Lookup the message sequence list directly from the wordid table.
+4. Scan each block in the sequence, and remove any instances of the
+ messageid.
+5. Remove the message to messageid mapping in the messageid table.
+
+Unfortunately caching helped very little here, particularly if many
+messages were removed. Also note that the file could never shrink as
+the data could be spread randomly over it. Removal is an extremely
+expensive an unbounded process. Deleting all of the messages in a
+mailbox is extremely i/o intensive, with blocks potentially being
+accessed dozens of times.
+
+Performing a query
+
+Performing a query is fast:
+
+1. Lookup the messageid sequence list from the wordid table.
+2. For each messageid
+3. Lookup the message name directly from the messageid table.
+
+Even without caching this performs at a very acceptable level.
+
+Summary
+
+This index performs reasonably well upto about 10 000 messages for a
+complete re-index. However with incremental updates it degrads much
+faster, only a few thousand messages added and it becomes tiresomely
+slow and i/o bound. The index becomes more fragmented with random
+updates and removals and heavily bogs down the system as you go much
+beyond those few thousand messages.
+
+The code is also very complicated and hard to follow. There are too
+many special cases, and it is buggy. Detected on-disk structure
+errors result in the index being reset, which although it shrinks the
+index, is very slow.
+
+The indices created are bulky, and never shrink. Because of the
+reverse index used for message removal, there is 50% redundant data at
+all times. Some overly tricky techniques (very much like ReiserFS's
+tail packing) are used to waste as little space as possible, with a
+great impact on performance.
+
+One other problem is that because the index is disk based, we
+use a file descriptor continuously. With some users having
+>100 folders, they quickly run out of process file descriptors and
+evolution fails. To get around this a cache of least recently used
+index files is used to flush away and free file descriptors so they
+can be re-used. This makes it hard to lock the files; this problem
+still exists with the next implementation.
+
+Anyway, a better solution is required.
+
+CamelIndex
+
+The first problem to address was the api. It was starting to age.
+Although adequate, the api wasn't terribly clean, reusable, or
+scalable. The first thing was to objectise the library, and since we
+needed to use it in Camel, the best way was to create a CamelObject.
+
+CamelIndex was born. A mostly abstract class that provides a simple
+common interface for accessing indices, including cursors and utility
+and maintenance functions.
+
+In addition, a number of the features in libibex2 were simplified or
+rewritten and abstracted into the re-usable classes that follow.
+
+By providing simple cursors, more complex queries were easier to write
+and can execute more efficiently; camel searching now does sub-string
+searches for all body queries, and still runs at a very healthy speed
+and uses less memory than before.
+
+CamelBlockFile
+
+This is basically the same block filesystem used in libibex2. It
+handles disk i/o based on blocks (CamelBlock), flushing modified
+blocks to disk, and caching of recently accessed blocks. It was
+enhanced slightly to allow blocks to be locked in memory.
+
+CamelKeyFile
+
+This is a simple reverse-linked list of sequences of keyid's.
+
+The main property of this file is that updates are only ever appended
+to the end of the file, which improves i/o characteristics markedly.
+
+When an existing keyid sequence is updated, it simply points back to
+the start of the previous one, and provides a pointer to the new
+entry. i.e. a simple linked list.
+
+CamelKeyTable
+
+This is taken from the libibex2 code for mapping keys, with few
+changes. It uses a CamelBlockFile for its i/o.
+
+The key table is a linked list of blocks (CamelKeyBlock) which contain
+key strings and and a data pointer and flags for each key. Each block
+is a packed array of string descriptors (CamelKeyKey's).
+
+A keyid (camel_key_t) is a 32 bit descriptor which identifies this key
+in a reversible way. In this case the bottom 10 bits are used to
+identify the index of the key within the key block, and the top 22
+bits are used to identify the key block itself. In this way, given
+the 32 bit key id, we can reference the block containing the key
+directly (with at most 1 access), and access the flags and key string
+using the key index.
+
+Keys can potentially be removed and their keyid's reused by simply
+re-packing the key block. This was used in libibex2, but not in
+CamelIndex.
+
+[diagram - camelkeyblock]
+
+CamelPartitionTable
+
+An implementation of a scalable, on-disk 'perfect' hash table. It
+uses the CamelBlockFile to handle its i/o. This is a completely new
+hash table implementation which was not present in libibex2.
+
+[FIXME: Reference the original paper the algorithm is based on.]
+
+A partition table consists of a list of mapping blocks
+(CamelPartitionMapBlock), which is a compact table that maps a range
+of hashid's to a partition block (CamelPartitionKeyBlock), which
+contains hashid's of that range.
+
+[diagram - camelpartitiontable]
+
+The partition block only maps the hashid to a keyid (see CamelKeyTable)
+which means it can store a lot of keys in each block.
+
+To add a new value to the partition table:
+
+1. Calculate the hash value of the key
+2. Find out which partition block the key will fit into, using the
+ partition table.
+3. If the partition block is full:
+4. If there is room in the next or previous block:
+5. Merge the 2 blocks together, and split at the half-way point
+6. Update the partition table hash indices to match the blocks
+7. Else
+8. Create a new block, and split the existing block across it
+9. Insert the new block into the partition table
+10. Else
+11. Just add the key to the end of the block.
+
+Steps 5 and 8 perform a sorting of the partition key entries by hashid
+to find the midpoint. It may be beneficial to store the hashid's
+sorted always, it would then not require a sort to split the blocks.
+This would also benefit key lookups by being able to use a binary
+search. However, the incremental sort may be more expensive.
+
+If the partition table itself fills up, then perform a similar
+splitting function on its blocks, and store it over multiple blocks.
+With a block size of 1024 bytes, we can fit 127 blocks pointers, each
+with 127 keys in it - around 16000 keys. So we only need 1024 bytes
+of memory for each 16000 on-disk keys (assuming full tables).
+
+Removal is basically the same, but if we end up with an empty block we
+just remove it from the partition table. CamelTextIndex doesn't
+actually use removal although it is implemented in
+CamelPartitionTable.
+
+Lookup is very simple. We basically follow steps 1 and 2, and then
+perform a linear search through the block to find a matching hash id.
+That is our key. This is assuming a perfect hash, additionally the
+code could use the keyid to lookup in a keytable to verify the key is
+indeed the right one. This would require having to support duplicate
+hashid's and would make block splitting slightly more complex, but
+only by a couple of lines of code. This is something that will
+probably have to be addressed in the future.
+
+Using a partition table means that we can tell with 1 disk access
+whether or not a key exists (assuming a perfect hash function), and 1
+more access to look up all of the details of the key since the keyid
+is reversible. Another feature is that the partition table is always
+self-balancing for any data processed in any order.
+
+Yet one more feature is that it is quite easy to order the writes to
+the partition table so that its structure is always consistent, even
+in the event of program failure. Although this has been disabled in
+the current code to take maximal advantage of the block cache.
+
+CamelTextIndex
+
+CamelTextIndex is the implementation of CamelIndex now used by camel
+for indexing mail. It shares some features with the second
+incarnation of libibex, but is generally simpler. It uses the
+previously described classes to implement the CamelIndex interface.
+
+Indexing operation
+
+Indexing operation is similar to libibex2, but without the requirement
+to maintain the reverse index.
+
+1. Lookup the messageid from the message name, using the messageid
+ partition table.
+2. Generate a list of words in the message
+3. For each word
+4. Lookup the wordid and sequence information.
+5. Append the messageid to the word sequence.
+
+In practice we also have a word cache which caches upto 32 messageid's
+for each word before it is written to the key file.
+
+Removing a message
+
+Removal is not immediate. This is one of the major performance
+improvements in CamelIndex.
+
+1. Lookup the messageid from the message name partition table
+2. Use the messageid to set a flag in the message key table to
+ indicate the message has been deleted.
+3. Remove the key hash from the partition table.
+
+This comes down to a maximum of 2 disk reads and 2 disk writes.
+libibex2 had unbounded maximums, depending on the number of words in a
+given message. The key file is not changed.
+
+Because data is not removed from the files at all, an additional
+optional step is required, that of compressing the indices.
+
+Performing a query
+
+Performing a query is much the same as with libibex2. We usually have
+slightly less disk i/o because of a more efficient and scalable hash
+table implementation, and improved locality of reference of the key
+table data.
+
+1. Lookup the messageid from the message name partition table
+2. Use the messageid to get the data pointer directly from the key
+ table.
+3. Iterate through the key file, reading blocks backwards through the
+ file.
+
+Compressing
+
+Although it could have benefited from it, libibex2 did not ever
+compress indices - the only way to compress an index was to remove it
+and have it be rebuilt.
+
+CamelIndex requires a compression stage as data is never removed from
+it otherwise. Because of the much greater locality of reference, the
+compression stage is actually much faster than an incremental removal
+of data inside the data files.
+
+Compressing comprises the following steps:
+
+1. Open a new temporary index, an index block file and an index key
+ file.
+2. For each message in the message partition table
+3. If the message is not marked deleted, add it to the new message
+ partition table, and recored the old messageid to new messageid
+ mapping.
+4. For each word in the word partition table
+5. For each messageid's in the word sequence list
+6. If the messageid maps to a new messageid, remap the messageid,
+ else discard it.
+7. Concatenate upto 256 messageid's in a row before writing to the
+ key file, to improve lookups.
+8. Create a new word in the new word key table
+9. Add the wordid and new sequence id to the word partition table.
+
+Note that at step 8 we could (should?) also check if the word has any
+messages associated with it, and discard the word from the new index.
+
+After compression, the name partition index only contains names which
+are not deleted, and the key file is compressed into larger blocks
+which takes up less space and is faster to retrieve.
+
+During index operations a number of statistics are taken which trigger
+an automatic compress when the file fragmentation or number of deleted
+messages exceed a threshold. So the index maintains itself, and does
+not need manual compression.
+
+
+
+