diff options
Diffstat (limited to 'devel-docs')
-rw-r--r-- | devel-docs/query/virtual-folder-in-depth.txt | 306 |
1 files changed, 306 insertions, 0 deletions
diff --git a/devel-docs/query/virtual-folder-in-depth.txt b/devel-docs/query/virtual-folder-in-depth.txt new file mode 100644 index 0000000000..fc7b80ca07 --- /dev/null +++ b/devel-docs/query/virtual-folder-in-depth.txt @@ -0,0 +1,306 @@ +* introduction + +This document describes a different way of approaching mail +organization and how all things are possible in this brave new +world. This document does not describe physical storage issues nor +interface issues. + +Historically mail has been organized into folders. These folders +usually mapped to a single storage medium. The relationship between +mail organization and storage medium was one to one. There was one +mail organization for every storage medium. This scheme had its +limitations. + +Efforts at categorizations are only meaningful at the instance that +one categorized. To find any piece of data, regardless of how well +it was categorized, required some amount of searching. Therefore, any +attempts to nullify searching is doomed to fail. It's time to embrace +searching as a way of life. + +These are the terms and their definitions. The example rules used are +based on the syntax for VM (http://www.wonderworks.com/vm/) by Kyle +Jones whose ideas form the basis for this. I'm only adding the +existence of summary files to aid in scaling. I currently use VM and +it's virtual-folder rules for my daily mail purposes. To date, my only +complaints are speed (it has no caches) and for the unitiated, it's +not very user-friendly. + +Comments, questions, rants, etc. should be directed at Giao Nguyen +<grail@cafebabe.org> who will try to address issues in a timely +manner. + +* Definitions + +** store + +A location where mail can be found. This may be a file (Berkeley +mbox), directory (MH), IMAP server, POP3 server, Exchange server, +Lotus Notes server, a stack of Post-Its by your monitor fed through +some OCR system. + +** message + +An individual mail message. + +** vfolder + +A group of messages sharing some commonality. This is the result of a +query. The vfolder maybe contained in a store, but it is not necessary +that a store holds only one vfolder. There is always an implicit +vfolder rule which matches all messages. A store contains the vfolder +which is the result of the query (any). It's short for virtual folder +or maybe view folder. I dunno. + +** default-vfolder + +The vfolder defined by (any) applied to the store. This is not the +inbox. The inbox could easily be defined by a query. A default rule +for the inbox could be (new) but it doesn't have to be. Mine happens +to be (or (unread) (new)). + +** folder + +The classical mail folder approach: one message organization per +store. + +** query + +A search for messages. The result of this is a vfolder. There are two +kinds of queries: named queries and lambda queries. More on this +later. + +** summary file + +An external file that contains pointers to messages which are matches +for a named query. In addition to pointers, the summary file should +also contain signatures of the store for sanity checks. When the term +"index" is used as a verb, it means to build a summary file for a +given name-value pair. + +* Queries + +Named queries are analogous to classical mail folders. Because named +queries maybe reused, summary files are kept as caches to reduce +the overall cost of viewing a vfolder. Summary files are superior to +folders in that they allow for the same messages to appear in multiple +vfolders without message duplications. Duplications of messages +defeats attempts at tagging a message with additional user information +like annotations. Named queries will define folders. + +Lambda queries are similar to named queries except that they have no +name. These are created on the fly by the user to filter out or +include certain messages. + +All queries can be layered on top of each other. A lambda query can be +layered on a named query and a named query can be layered on a lambda +query. The possibilities are endless. + +The layerings can be done as boolean operations (and, or, not). Short +circuiting should be used. + +Examples: + +(and (author "Giao") + (unread)) + +The (unread) query should only be evaluated on the results of (author +"Giao"). + +(or (author "Giao") + (unread)) + +Both of these queries should be evaluated. Any matches are added to the +resulting vfolder. + +* Summary files + +Summary files are only meaningful when applied to the context of the +default-vfolder of a store. + +Summary files should be generated for queries of the form: + +(function "constant value") + +Summary files should never be generated for queries of the form: + +(function (function1)) + +(and (function "value") + (another-function "another value")) + +Given a query of the form: + +(and (function "value") + (another-function "another value")) + +The system should use one summary file for (function "value") and +another summary file for (another-function "another value"). I will +call the prior form the "plain form". + +It should be noted that the signature of the store should be based on +the assumption that new data may have been added to the store since +the application generated the summary file. Signatures generated on +the entirety of the store will most likely be meaningless for things +like POP/IMAP servers. + +* Incremental indexing + +When new messages are detected, all known queries should be evaluated +on the new messages. vfolders should be notified of new messages that +are positive matches for their queries. The indexes generated by this +process should be merged into the current indexes for the vfolder. + +* Can I have multiple stores? + +I don't see why not. Again, the inbox is a vfolder so you can get a +unified inbox consisting of all new mail sent to all your stores or +your can get inboxes for each store or any combination your heart +desire. You get your cake, eat it, and someone else cleans the dishes! + +* Why all this? + +Consider the dynamic nature of the following query: + +(and (author "Giao") + (sent-after (today-midnight))) + +today-midnight would be a function that is evaluated at run-time to +calculate the appropriate object. + +* Scenarios of usage and their solutions + +** Mesage alterations + +This is a fuzzy area that should be left to the UI to handle. Messages +are altered. Read status are altered when a new message is read for +example. How do we handle this if our query is for unread messages? +Upon viewing the state would change. + +One idea is to not evaluate the queries unless we're changing between +vfolder views. This assumes that one can only view a particular +vfolder at a time. For multi-vfolder viewing, a message change should +propagate through the vfolder system. Certain effects (as in our +example) would not be intuitive. + +It would not be a clean solution to make special cases but they may be +necessary where certain defined fields are ignored when they are +changed. Some combination of the above rules can be used. I don't +think it's an easy solution. + +** Message inclusion and exclusion + +Messages are included and excluded also with queries. The final query +will have the form of: + +(and (author "Giao") + (criteria value) + (not (criteria other-value))) + +Userland criterias may be a label of some sort. These may be userland +labels or Message-IDs. What are the performance issues involved in +this? With short circuiting, it's not a major problem. + +The criterias and values are determined by the UI. The vfolder +mechanism isn't concerned with such issues. + +Messages can be included and excluded at will. The idea is often +called "arbitrary inclusion/exclusion". This can be done by +Message-IDs or other fields. It's been noted that Message-IDs are not +unique. + +I propose that any given vfolder is allocated an inclusion label and an +exclusion label. These should be randomly generated. This should be +part of the vfolder description. It should be noted that the vfolder +description has not been drafted yet. + +The result is such that the rules for a given named query is: + +(and (user-query) + (label inclusion-label) + (not exclusion-label)) + +** Query scheduling + +Consider the following extremely dynamic queries: + +A: +(and (author "Giao") + (sent-after (today-midnight))) + +B: +(and (sent-after (today-midnight)) + (author "Giao")) + +C: +(or (author "Giao") + (sent-after (today-midnight))) + +Query A would be significantly faster because (author "Giao") is not +dynamic. A summary file could be generated for this query. Query B is +slow and can be optimized if there was a query compiler of some +sort. Query C demonstrates a query in which there is no good +optimization which can be applied. These come with a certain amount of +baggage. + +It seems then that for boolean 'and' operations, plain forms should be +moved forward and other queries should be moved such that they are +evaluated later. I would expect that the majority of queries would be +of the plain form. + +First is that the summary file is tied to the query and the store +where the query originates from. Second, a hashing function for +strings needs to be calculated for the query so that the query and the +summary file can be associated. This hashing function could be similar +to the hashing function described in Rob Pike's "The Practice of +Programming". (FIXME: Stick page number here) + +** Archives + +Many people are concerned that archives won't be preserved, archives +aren't supported, and many other archive related issues. This is the +short version. + +Archives are just that, archives. Archives are stores. Take your +vfolder, export it to a store. You are done. If you load up the store +again, then the default-vfolder of that store is the view of the +vfolder, except the query is different. + +The point to vfolder is not to do away with classical folder +representation but to move the queries to the front where it would +make data management easier for people who don't think in terms of +files but in terms of queries because ordinary people don't think in +terms of files. + +* Miscellany + +** Annotations + +There should be a scheme to add annotations to messages. Common mail +user agents have used a tag in the message header to mark messages as +read/unread for example. Extending on this we have the ability to add +our own data to a message to add meaning to it. If we have a good +scheme for doing this, new possibilities are opened. + +*** Keywords + +When sending a message, a message could have certain keywords attached +to it. While this can be done with the subject line, the subject line +has a tendency to be munged by other mail applications. One popular +example is the "[rR]e:" prefix. Using the subject line also breaks the +"contract" with other mail user agents. Using keywords in another +field in the message header allows the sender to assist the recipient +in organizing data automatically. Note that the sender can only +provide hints as the sender is unlikely to know the organization +schemes of the recipient. + +** Scope + +Let us assume that we have multiple stores. Does a query work on a +given store? Or does it work on all stores? Or is it configurable such +that a query can work on a user-selected list of stores? + +* Alternatives to the above + +Jim Meyer <purp@selequa.com> is putting some notes on where +annotations needs to be located. They'll be located here as well as +any contributions I may have to them.
\ No newline at end of file |