TITLE: An in-depth look at the virtual folder mechanism
AUTHOR: Giao Nguyen <grail@cafebabe.org>

* introduction

This document describes a different way of approaching mail
organization and how all things are possible in this brave new
world. This document does not describe physical storage issues nor
interface issues.

Historically mail has been organized into folders. These folders
usually mapped to a single storage medium. The relationship between
mail organization and storage medium was one to one. There was one
mail organization for every storage medium. This scheme had its
limitations.

Efforts at categorizations are only meaningful at the instance that
one categorized. To find any piece of data, regardless of how well
it was categorized, required some amount of searching. Therefore, any
attempts to nullify searching is doomed to fail. It's time to embrace
searching as a way of life.

These are the terms and their definitions. The example rules used are
based on the syntax for VM (http://www.wonderworks.com/vm/) by Kyle
Jones whose ideas form the basis for this. I'm only adding the
existence of summary files to aid in scaling. I currently use VM and
it's virtual-folder rules for my daily mail purposes. To date, my only 
complaints are speed (it has no caches) and for the unitiated, it's
not very user-friendly.

Comments, questions, rants, etc. should be directed at Giao Nguyen
<grail@cafebabe.org> who will try to address issues in a timely
manner.

* Definitions

** store 

A location where mail can be found. This may be a file (Berkeley
mbox), directory (MH), IMAP server, POP3 server, Exchange server,
Lotus Notes server, a stack of Post-Its by your monitor fed through
some OCR system.

** message 

An individual mail message.

** vfolder 

A group of messages sharing some commonality. This is the result of a
query. The vfolder maybe contained in a store, but it is not necessary
that a store holds only one vfolder. There is always an implicit
vfolder rule which matches all messages. A store contains the vfolder
which is the result of the query (any). It's short for virtual folder
or maybe view folder. I dunno.

** default-vfolder 

The vfolder defined by (any) applied to the store. This is not the
inbox. The inbox could easily be defined by a query. A default rule
for the inbox could be (new) but it doesn't have to be. Mine happens
to be (or (unread) (new)).

** folder 

The classical mail folder approach: one message organization per
store.

** query 

A search for messages. The result of this is a vfolder. There are two
kinds of queries: named queries and lambda queries. More on this
later.

** summary file 

An external file that contains pointers to messages which are matches
for a named query. In addition to pointers, the summary file should
also contain signatures of the store for sanity checks. When the term
"index" is used as a verb, it means to build a summary file for a
given name-value pair.

* Queries

Named queries are analogous to classical mail folders. Because named
queries maybe reused, summary files are kept as caches to reduce
the overall cost of viewing a vfolder. Summary files are superior to
folders in that they allow for the same messages to appear in multiple
vfolders without message duplications. Duplications of messages
defeats attempts at tagging a message with additional user information
like annotations. Named queries will define folders.

Lambda queries are similar to named queries except that they have no
name. These are created on the fly by the user to filter out or
include certain messages.

All queries can be layered on top of each other. A lambda query can be 
layered on a named query and a named query can be layered on a lambda
query. The possibilities are endless.

The layerings can be done as boolean operations (and, or, not). Short
circuiting should be used. 

Examples:

(and (author "Giao")
     (unread))

The (unread) query should only be evaluated on the results of (author
"Giao").

(or (author "Giao")
    (unread))

Both of these queries should be evaluated. Any matches are added to the
resulting vfolder.

* Summary files

Summary files are only meaningful when applied to the context of the
default-vfolder of a store.

Summary files should be generated for queries of the form:

(function "constant value")

Summary files should never be generated for queries of the form:

(function (function1))

(and (function "value")
     (another-function "another value"))

Given a query of the form:

(and (function "value")
     (another-function "another value"))

The system should use one summary file for (function "value") and
another summary file for (another-function "another value"). I will
call the prior form the "plain form".

It should be noted that the signature of the store should be based on
the assumption that new data may have been added to the store since
the application generated the summary file. Signatures generated on
the entirety of the store will most likely be meaningless for things
like POP/IMAP servers. 

* Incremental indexing

When new messages are detected, all known queries should be evaluated
on the new messages. vfolders should be notified of new messages that
are positive matches for their queries. The indexes generated by this
process should be merged into the current indexes for the vfolder.

* Can I have multiple stores?

I don't see why not. Again, the inbox is a vfolder so you can get a
unified inbox consisting of all new mail sent to all your stores or
your can get inboxes for each store or any combination your heart
desire. You get your cake, eat it, and someone else cleans the dishes!

* Why all this?

Consider the dynamic nature of the following query:

(and (author "Giao")
     (sent-after (today-midnight)))

today-midnight would be a function that is evaluated at run-time to
calculate the appropriate object.

* Scenarios of usage and their solutions

** Mesage alterations

This is a fuzzy area that should be left to the UI to handle. Messages 
are altered. Read status are altered when a new message is read for
example. How do we handle this if our query is for unread messages?
Upon viewing the state would change.

One idea is to not evaluate the queries unless we're changing between
vfolder views. This assumes that one can only view a particular
vfolder at a time. For multi-vfolder viewing, a message change should
propagate through the vfolder system. Certain effects (as in our
example) would not be intuitive.

It would not be a clean solution to make special cases but they may be 
necessary where certain defined fields are ignored when they are
changed. Some combination of the above rules can be used. I don't
think it's an easy solution.

** Message inclusion and exclusion

Messages are included and excluded also with queries. The final query
will have the form of:

(and (author "Giao")
     (criteria value)
     (not (criteria other-value)))

Userland criterias may be a label of some sort. These may be userland
labels or Message-IDs. What are the performance issues involved in
this? With short circuiting, it's not a major problem.

The criterias and values are determined by the UI. The vfolder
mechanism isn't concerned with such issues.

Messages can be included and excluded at will. The idea is often
called "arbitrary inclusion/exclusion". This can be done by
Message-IDs or other fields. It's been noted that Message-IDs are not
unique. 

I propose that any given vfolder is allocated an inclusion label and an 
exclusion label. These should be randomly generated. This should be
part of the vfolder description. It should be noted that the vfolder
description has not been drafted yet.

The result is such that the rules for a given named query is:

(and (user-query)
     (label inclusion-label)
     (not exclusion-label))

** Query scheduling

Consider the following extremely dynamic queries:

A:
(and (author "Giao")
     (sent-after (today-midnight)))

B:
(and (sent-after (today-midnight))
     (author "Giao"))

C:
(or (author "Giao")
    (sent-after (today-midnight)))

Query A would be significantly faster because (author "Giao") is not
dynamic. A summary file could be generated for this query. Query B is
slow and can be optimized if there was a query compiler of some
sort. Query C demonstrates a query in which there is no good
optimization which can be applied. These come with a certain amount of
baggage.

It seems then that for boolean 'and' operations, plain forms should be 
moved forward and other queries should be moved such that they are
evaluated later. I would expect that the majority of queries would be
of the plain form.

First is that the summary file is tied to the query and the store
where the query originates from. Second, a hashing function for
strings needs to be calculated for the query so that the query and the 
summary file can be associated. This hashing function could be similar 
to the hashing function described in Rob Pike's "The Practice of
Programming". (FIXME: Stick page number here)

** Archives

Many people are concerned that archives won't be preserved, archives
aren't supported, and many other archive related issues. This is the
short version.

Archives are just that, archives. Archives are stores. Take your
vfolder, export it to a store. You are done. If you load up the store
again, then the default-vfolder of that store is the view of the
vfolder, except the query is different.

The point to vfolder is not to do away with classical folder
representation but to move the queries to the front where it would
make data management easier for people who don't think in terms of
files but in terms of queries because ordinary people don't think in
terms of files.

* Miscellany

** Annotations

There should be a scheme to add annotations to messages. Common mail
user agents have used a tag in the message header to mark messages as
read/unread for example. Extending on this we have the ability to add
our own data to a message to add meaning to it. If we have a good
scheme for doing this, new possibilities are opened.

*** Keywords

When sending a message, a message could have certain keywords attached 
to it. While this can be done with the subject line, the subject line
has a tendency to be munged by other mail applications. One popular
example is the "[rR]e:" prefix. Using the subject line also breaks the 
"contract" with other mail user agents. Using keywords in another
field in the message header allows the sender to assist the recipient
in organizing data automatically. Note that the sender can only
provide hints as the sender is unlikely to know the organization
schemes of the recipient.

** Scope

Let us assume that we have multiple stores. Does a query work on a
given store? Or does it work on all stores? Or is it configurable such 
that a query can work on a user-selected list of stores?

* Alternatives to the above

Jim Meyer <purp@selequa.com> is putting some notes on where
annotations needs to be located. They'll be located here as well as
any contributions I may have to them.