diff options
-rw-r--r-- | camel/ChangeLog | 5 | ||||
-rw-r--r-- | camel/devel-docs/camel-index.txt | 407 |
2 files changed, 412 insertions, 0 deletions
diff --git a/camel/ChangeLog b/camel/ChangeLog index c2b7d9d9c3..fbfe9c60ce 100644 --- a/camel/ChangeLog +++ b/camel/ChangeLog @@ -1,3 +1,8 @@ +2002-04-19 Not Zed <NotZed@Ximian.com> + + * devel-docs/camel-index.txt: Start of a white-paperish document + describing camel-index and older libibex. + 2002-04-18 Not Zed <NotZed@Ximian.com> * providers/local/camel-local-store.c (rename_folder): If we get a diff --git a/camel/devel-docs/camel-index.txt b/camel/devel-docs/camel-index.txt new file mode 100644 index 0000000000..0ebc250db0 --- /dev/null +++ b/camel/devel-docs/camel-index.txt @@ -0,0 +1,407 @@ + +The history of text-indexing in evolution. + +CamelTextIndex was written to address several shortcomings in the +existing libibex (referred to as libibex2), which had been written to +address shortcomings in the original libibex. + +Mail indexing characteristics + +First, i'll cover some of the scenarios that a mail indexing system +must cover. They are slightly different from other indexing systems, +at least we wanted them to be. + +1. Indexing a few new messages, they may potentially reference most of + the alphabet in the index. +2. Indexing a whole mailbox for the first time +3. Unindexing anywhere from a few to all existing messages during expunge. +4. Searching. + +Cases 1, 3, and 4 occur the most often, however 2 is the most noticeable +at first use, or if the index must be reset. So the code needs to +work fast in all cases, which generally leads to trade-offs being +made. Each implementation aimed to address or ignore these +requirements in different ways, with the final implementation probably +having the best balance so far. + +The main issue is that the indexing be performed real time. We index +as we add the messages. We index before we open the mailbox. We +index as we remove messages. Because of this we need to approach +things differently to many other indexing systems; most of which work +with static data in an off-line mode. This allows them to index the +whole body of content and use as much memory and cpu time as required. + +We probably need to look at doing offline, or at least delayed +indexing in the future - but this introduces some coherency problems +with vFolders and any body searches. However, having the indexing +library a base part of Camel helps in implementing a mechanism to +achieve this. + +Ibex the first + +The original ibex used a memory-based hash table to store words. This made +the index very fast for both lookups and modifications. However any +queries required a full load of the index into memory, and any updates +required a full write of the index to disk. After about 5-10 000 +messages occasionaly adding to the index became appreciably slower as +the whole index needed to be loaded into memory first. This obviously +took a toll on memory as well. + +I wont cover the algorithms used, they were all pretty basic, the only +real smarts were that index deletes were only flagged and that data +not written to disk when the index saved. + +Evolution 1.x, ibex 2. + +In an attempt to rectify the incremental update performance of +libibex, it was completely rewritten to use an on-disk block-based +filesystem. + +Note: the first attempt used libdb - however performance was so slow +and the indices were so large it was dropped in favour of a custom +filesystem-like data file. + +The motivation was that a few extra disk lookups during +retrieval wouldn't be noticeably slower, however it should be able to +scale up to many more messages with lower memory overhead and slower +startup time. + +The block filesystem contains 3 major components: + +1. A hash table that mapped message's to a word id sequence list. +2. A hash table that mapped word's to a message id sequence list. +3. A sequence filesystem that stored sequences of id's. + +The id's are 32 bit identifiers that are unique for each word or +message. They are also designed to be reversible and static. +That is, given the id, you can map it to the string identifier that it +represents directly, without having to look it up in the hash table. + +Other features of this design is that the database file should be +kept in sync at all times with the state of the index. The message to +wordid tables are used to remove the messageid's from the word's it +contains when the message is expunged, and so on. + +Indexing operation + +The indexing operation consists of the basic steps: + +1. Lookup the messageid from the message name, using the messageid table. +2. Generate a list of words in the message +3. For each word: +4. Lookup the wordid and sequence information +5. If the word doesn't exist, create a new word/wordid +6. Add the messageid to the word sequence. +7. Add the wordid to the message sequence. + +The initial implementation only used caching at the disk-block level. +Unfortunately, the simple hash table design chosen (fixed sized base +table with chained buckets) scaled very poorly above about 10 000 +messages. So this approach proved to be too i/o intensive for +practical use, and several other caches were added to improve +performance: + +1. Stage (1) above is done entirely in memory. At initial startup + the whole list of potential names is read into an in-memory hash + table. +2. Stage (4) above is also done entirely in memory. Even a large + cache provided little benefit due to wide distribution of potential + words. This cache is only created when adding to the index. +3. Stage (6) uses the table from stage (4) and concatenates upto + approximately one disk blocks worth of messageid's before writing + them out to the word sequence. +4. Stage (7) concatenates all wordid's for a given message before + writing them out at once. + +As you can see, the added complexity meant we nearly have to cache as +much as the original version! This also almost removed all of the +startup-time benefit for incremental update of the index, as the table +was not stored as compactly on disk as the original version. + +However, we only ever stored a subset of the index in memory, and only +during updates, with some tricks to reduce memory usage for very rare +words, so the overall memory use was still much lower. + +Removing a message + +Removing a message is fairly involved: + +1. Lookup the messageid and word sequence list from the messageid table. +2. For each wordid in the sequence list +3. Lookup the message sequence list directly from the wordid table. +4. Scan each block in the sequence, and remove any instances of the + messageid. +5. Remove the message to messageid mapping in the messageid table. + +Unfortunately caching helped very little here, particularly if many +messages were removed. Also note that the file could never shrink as +the data could be spread randomly over it. Removal is an extremely +expensive an unbounded process. Deleting all of the messages in a +mailbox is extremely i/o intensive, with blocks potentially being +accessed dozens of times. + +Performing a query + +Performing a query is fast: + +1. Lookup the messageid sequence list from the wordid table. +2. For each messageid +3. Lookup the message name directly from the messageid table. + +Even without caching this performs at a very acceptable level. + +Summary + +This index performs reasonably well upto about 10 000 messages for a +complete re-index. However with incremental updates it degrads much +faster, only a few thousand messages added and it becomes tiresomely +slow and i/o bound. The index becomes more fragmented with random +updates and removals and heavily bogs down the system as you go much +beyond those few thousand messages. + +The code is also very complicated and hard to follow. There are too +many special cases, and it is buggy. Detected on-disk structure +errors result in the index being reset, which although it shrinks the +index, is very slow. + +The indices created are bulky, and never shrink. Because of the +reverse index used for message removal, there is 50% redundant data at +all times. Some overly tricky techniques (very much like ReiserFS's +tail packing) are used to waste as little space as possible, with a +great impact on performance. + +One other problem is that because the index is disk based, we +use a file descriptor continuously. With some users having +>100 folders, they quickly run out of process file descriptors and +evolution fails. To get around this a cache of least recently used +index files is used to flush away and free file descriptors so they +can be re-used. This makes it hard to lock the files; this problem +still exists with the next implementation. + +Anyway, a better solution is required. + +CamelIndex + +The first problem to address was the api. It was starting to age. +Although adequate, the api wasn't terribly clean, reusable, or +scalable. The first thing was to objectise the library, and since we +needed to use it in Camel, the best way was to create a CamelObject. + +CamelIndex was born. A mostly abstract class that provides a simple +common interface for accessing indices, including cursors and utility +and maintenance functions. + +In addition, a number of the features in libibex2 were simplified or +rewritten and abstracted into the re-usable classes that follow. + +By providing simple cursors, more complex queries were easier to write +and can execute more efficiently; camel searching now does sub-string +searches for all body queries, and still runs at a very healthy speed +and uses less memory than before. + +CamelBlockFile + +This is basically the same block filesystem used in libibex2. It +handles disk i/o based on blocks (CamelBlock), flushing modified +blocks to disk, and caching of recently accessed blocks. It was +enhanced slightly to allow blocks to be locked in memory. + +CamelKeyFile + +This is a simple reverse-linked list of sequences of keyid's. + +The main property of this file is that updates are only ever appended +to the end of the file, which improves i/o characteristics markedly. + +When an existing keyid sequence is updated, it simply points back to +the start of the previous one, and provides a pointer to the new +entry. i.e. a simple linked list. + +CamelKeyTable + +This is taken from the libibex2 code for mapping keys, with few +changes. It uses a CamelBlockFile for its i/o. + +The key table is a linked list of blocks (CamelKeyBlock) which contain +key strings and and a data pointer and flags for each key. Each block +is a packed array of string descriptors (CamelKeyKey's). + +A keyid (camel_key_t) is a 32 bit descriptor which identifies this key +in a reversible way. In this case the bottom 10 bits are used to +identify the index of the key within the key block, and the top 22 +bits are used to identify the key block itself. In this way, given +the 32 bit key id, we can reference the block containing the key +directly (with at most 1 access), and access the flags and key string +using the key index. + +Keys can potentially be removed and their keyid's reused by simply +re-packing the key block. This was used in libibex2, but not in +CamelIndex. + +[diagram - camelkeyblock] + +CamelPartitionTable + +An implementation of a scalable, on-disk 'perfect' hash table. It +uses the CamelBlockFile to handle its i/o. This is a completely new +hash table implementation which was not present in libibex2. + +[FIXME: Reference the original paper the algorithm is based on.] + +A partition table consists of a list of mapping blocks +(CamelPartitionMapBlock), which is a compact table that maps a range +of hashid's to a partition block (CamelPartitionKeyBlock), which +contains hashid's of that range. + +[diagram - camelpartitiontable] + +The partition block only maps the hashid to a keyid (see CamelKeyTable) +which means it can store a lot of keys in each block. + +To add a new value to the partition table: + +1. Calculate the hash value of the key +2. Find out which partition block the key will fit into, using the + partition table. +3. If the partition block is full: +4. If there is room in the next or previous block: +5. Merge the 2 blocks together, and split at the half-way point +6. Update the partition table hash indices to match the blocks +7. Else +8. Create a new block, and split the existing block across it +9. Insert the new block into the partition table +10. Else +11. Just add the key to the end of the block. + +Steps 5 and 8 perform a sorting of the partition key entries by hashid +to find the midpoint. It may be beneficial to store the hashid's +sorted always, it would then not require a sort to split the blocks. +This would also benefit key lookups by being able to use a binary +search. However, the incremental sort may be more expensive. + +If the partition table itself fills up, then perform a similar +splitting function on its blocks, and store it over multiple blocks. +With a block size of 1024 bytes, we can fit 127 blocks pointers, each +with 127 keys in it - around 16000 keys. So we only need 1024 bytes +of memory for each 16000 on-disk keys (assuming full tables). + +Removal is basically the same, but if we end up with an empty block we +just remove it from the partition table. CamelTextIndex doesn't +actually use removal although it is implemented in +CamelPartitionTable. + +Lookup is very simple. We basically follow steps 1 and 2, and then +perform a linear search through the block to find a matching hash id. +That is our key. This is assuming a perfect hash, additionally the +code could use the keyid to lookup in a keytable to verify the key is +indeed the right one. This would require having to support duplicate +hashid's and would make block splitting slightly more complex, but +only by a couple of lines of code. This is something that will +probably have to be addressed in the future. + +Using a partition table means that we can tell with 1 disk access +whether or not a key exists (assuming a perfect hash function), and 1 +more access to look up all of the details of the key since the keyid +is reversible. Another feature is that the partition table is always +self-balancing for any data processed in any order. + +Yet one more feature is that it is quite easy to order the writes to +the partition table so that its structure is always consistent, even +in the event of program failure. Although this has been disabled in +the current code to take maximal advantage of the block cache. + +CamelTextIndex + +CamelTextIndex is the implementation of CamelIndex now used by camel +for indexing mail. It shares some features with the second +incarnation of libibex, but is generally simpler. It uses the +previously described classes to implement the CamelIndex interface. + +Indexing operation + +Indexing operation is similar to libibex2, but without the requirement +to maintain the reverse index. + +1. Lookup the messageid from the message name, using the messageid + partition table. +2. Generate a list of words in the message +3. For each word +4. Lookup the wordid and sequence information. +5. Append the messageid to the word sequence. + +In practice we also have a word cache which caches upto 32 messageid's +for each word before it is written to the key file. + +Removing a message + +Removal is not immediate. This is one of the major performance +improvements in CamelIndex. + +1. Lookup the messageid from the message name partition table +2. Use the messageid to set a flag in the message key table to + indicate the message has been deleted. +3. Remove the key hash from the partition table. + +This comes down to a maximum of 2 disk reads and 2 disk writes. +libibex2 had unbounded maximums, depending on the number of words in a +given message. The key file is not changed. + +Because data is not removed from the files at all, an additional +optional step is required, that of compressing the indices. + +Performing a query + +Performing a query is much the same as with libibex2. We usually have +slightly less disk i/o because of a more efficient and scalable hash +table implementation, and improved locality of reference of the key +table data. + +1. Lookup the messageid from the message name partition table +2. Use the messageid to get the data pointer directly from the key + table. +3. Iterate through the key file, reading blocks backwards through the + file. + +Compressing + +Although it could have benefited from it, libibex2 did not ever +compress indices - the only way to compress an index was to remove it +and have it be rebuilt. + +CamelIndex requires a compression stage as data is never removed from +it otherwise. Because of the much greater locality of reference, the +compression stage is actually much faster than an incremental removal +of data inside the data files. + +Compressing comprises the following steps: + +1. Open a new temporary index, an index block file and an index key + file. +2. For each message in the message partition table +3. If the message is not marked deleted, add it to the new message + partition table, and recored the old messageid to new messageid + mapping. +4. For each word in the word partition table +5. For each messageid's in the word sequence list +6. If the messageid maps to a new messageid, remap the messageid, + else discard it. +7. Concatenate upto 256 messageid's in a row before writing to the + key file, to improve lookups. +8. Create a new word in the new word key table +9. Add the wordid and new sequence id to the word partition table. + +Note that at step 8 we could (should?) also check if the word has any +messages associated with it, and discard the word from the new index. + +After compression, the name partition index only contains names which +are not deleted, and the key file is compressed into larger blocks +which takes up less space and is faster to retrieve. + +During index operations a number of statistics are taken which trigger +an automatic compress when the file fragmentation or number of deleted +messages exceed a threshold. So the index maintains itself, and does +not need manual compression. + + + + |