TITLE: An in-depth look at the virtual folder mechanism AUTHOR: Giao Nguyen <grail@cafebabe.org> * introduction This document describes a different way of approaching mail organization and how all things are possible in this brave new world. This document does not describe physical storage issues nor interface issues. Historically mail has been organized into folders. These folders usually mapped to a single storage medium. The relationship between mail organization and storage medium was one to one. There was one mail organization for every storage medium. This scheme had its limitations. Efforts at categorizations are only meaningful at the instance that one categorized. To find any piece of data, regardless of how well it was categorized, required some amount of searching. Therefore, any attempts to nullify searching is doomed to fail. It's time to embrace searching as a way of life. These are the terms and their definitions. The example rules used are based on the syntax for VM (http://www.wonderworks.com/vm/) by Kyle Jones whose ideas form the basis for this. I'm only adding the existence of summary files to aid in scaling. I currently use VM and it's virtual-folder rules for my daily mail purposes. To date, my only complaints are speed (it has no caches) and for the unitiated, it's not very user-friendly. Comments, questions, rants, etc. should be directed at Giao Nguyen <grail@cafebabe.org> who will try to address issues in a timely manner. * Definitions ** store A location where mail can be found. This may be a file (Berkeley mbox), directory (MH), IMAP server, POP3 server, Exchange server, Lotus Notes server, a stack of Post-Its by your monitor fed through some OCR system. ** message An individual mail message. ** vfolder A group of messages sharing some commonality. This is the result of a query. The vfolder maybe contained in a store, but it is not necessary that a store holds only one vfolder. There is always an implicit vfolder rule which matches all messages. A store contains the vfolder which is the result of the query (any). It's short for virtual folder or maybe view folder. I dunno. ** default-vfolder The vfolder defined by (any) applied to the store. This is not the inbox. The inbox could easily be defined by a query. A default rule for the inbox could be (new) but it doesn't have to be. Mine happens to be (or (unread) (new)). ** folder The classical mail folder approach: one message organization per store. ** query A search for messages. The result of this is a vfolder. There are two kinds of queries: named queries and lambda queries. More on this later. ** summary file An external file that contains pointers to messages which are matches for a named query. In addition to pointers, the summary file should also contain signatures of the store for sanity checks. When the term "index" is used as a verb, it means to build a summary file for a given name-value pair. * Queries Named queries are analogous to classical mail folders. Because named queries maybe reused, summary files are kept as caches to reduce the overall cost of viewing a vfolder. Summary files are superior to folders in that they allow for the same messages to appear in multiple vfolders without message duplications. Duplications of messages defeats attempts at tagging a message with additional user information like annotations. Named queries will define folders. Lambda queries are similar to named queries except that they have no name. These are created on the fly by the user to filter out or include certain messages. All queries can be layered on top of each other. A lambda query can be layered on a named query and a named query can be layered on a lambda query. The possibilities are endless. The layerings can be done as boolean operations (and, or, not). Short circuiting should be used. Examples: (and (author "Giao") (unread)) The (unread) query should only be evaluated on the results of (author "Giao"). (or (author "Giao") (unread)) Both of these queries should be evaluated. Any matches are added to the resulting vfolder. * Summary files Summary files are only meaningful when applied to the context of the default-vfolder of a store. Summary files should be generated for queries of the form: (function "constant value") Summary files should never be generated for queries of the form: (function (function1)) (and (function "value") (another-function "another value")) Given a query of the form: (and (function "value") (another-function "another value")) The system should use one summary file for (function "value") and another summary file for (another-function "another value"). I will call the prior form the "plain form". It should be noted that the signature of the store should be based on the assumption that new data may have been added to the store since the application generated the summary file. Signatures generated on the entirety of the store will most likely be meaningless for things like POP/IMAP servers. * Incremental indexing When new messages are detected, all known queries should be evaluated on the new messages. vfolders should be notified of new messages that are positive matches for their queries. The indexes generated by this process should be merged into the current indexes for the vfolder. * Can I have multiple stores? I don't see why not. Again, the inbox is a vfolder so you can get a unified inbox consisting of all new mail sent to all your stores or your can get inboxes for each store or any combination your heart desire. You get your cake, eat it, and someone else cleans the dishes! * Why all this? Consider the dynamic nature of the following query: (and (author "Giao") (sent-after (today-midnight))) today-midnight would be a function that is evaluated at run-time to calculate the appropriate object. * Scenarios of usage and their solutions ** Mesage alterations This is a fuzzy area that should be left to the UI to handle. Messages are altered. Read status are altered when a new message is read for example. How do we handle this if our query is for unread messages? Upon viewing the state would change. One idea is to not evaluate the queries unless we're changing between vfolder views. This assumes that one can only view a particular vfolder at a time. For multi-vfolder viewing, a message change should propagate through the vfolder system. Certain effects (as in our example) would not be intuitive. It would not be a clean solution to make special cases but they may be necessary where certain defined fields are ignored when they are changed. Some combination of the above rules can be used. I don't think it's an easy solution. ** Message inclusion and exclusion Messages are included and excluded also with queries. The final query will have the form of: (and (author "Giao") (criteria value) (not (criteria other-value))) Userland criterias may be a label of some sort. These may be userland labels or Message-IDs. What are the performance issues involved in this? With short circuiting, it's not a major problem. The criterias and values are determined by the UI. The vfolder mechanism isn't concerned with such issues. Messages can be included and excluded at will. The idea is often called "arbitrary inclusion/exclusion". This can be done by Message-IDs or other fields. It's been noted that Message-IDs are not unique. I propose that any given vfolder is allocated an inclusion label and an exclusion label. These should be randomly generated. This should be part of the vfolder description. It should be noted that the vfolder description has not been drafted yet. The result is such that the rules for a given named query is: (and (user-query) (label inclusion-label) (not exclusion-label)) ** Query scheduling Consider the following extremely dynamic queries: A: (and (author "Giao") (sent-after (today-midnight))) B: (and (sent-after (today-midnight)) (author "Giao")) C: (or (author "Giao") (sent-after (today-midnight))) Query A would be significantly faster because (author "Giao") is not dynamic. A summary file could be generated for this query. Query B is slow and can be optimized if there was a query compiler of some sort. Query C demonstrates a query in which there is no good optimization which can be applied. These come with a certain amount of baggage. It seems then that for boolean 'and' operations, plain forms should be moved forward and other queries should be moved such that they are evaluated later. I would expect that the majority of queries would be of the plain form. First is that the summary file is tied to the query and the store where the query originates from. Second, a hashing function for strings needs to be calculated for the query so that the query and the summary file can be associated. This hashing function could be similar to the hashing function described in Rob Pike's "The Practice of Programming". (FIXME: Stick page number here) ** Archives Many people are concerned that archives won't be preserved, archives aren't supported, and many other archive related issues. This is the short version. Archives are just that, archives. Archives are stores. Take your vfolder, export it to a store. You are done. If you load up the store again, then the default-vfolder of that store is the view of the vfolder, except the query is different. The point to vfolder is not to do away with classical folder representation but to move the queries to the front where it would make data management easier for people who don't think in terms of files but in terms of queries because ordinary people don't think in terms of files. * Miscellany ** Annotations There should be a scheme to add annotations to messages. Common mail user agents have used a tag in the message header to mark messages as read/unread for example. Extending on this we have the ability to add our own data to a message to add meaning to it. If we have a good scheme for doing this, new possibilities are opened. *** Keywords When sending a message, a message could have certain keywords attached to it. While this can be done with the subject line, the subject line has a tendency to be munged by other mail applications. One popular example is the "[rR]e:" prefix. Using the subject line also breaks the "contract" with other mail user agents. Using keywords in another field in the message header allows the sender to assist the recipient in organizing data automatically. Note that the sender can only provide hints as the sender is unlikely to know the organization schemes of the recipient. ** Scope Let us assume that we have multiple stores. Does a query work on a given store? Or does it work on all stores? Or is it configurable such that a query can work on a user-selected list of stores? * Alternatives to the above Jim Meyer <purp@selequa.com> is putting some notes on where annotations needs to be located. They'll be located here as well as any contributions I may have to them.