Stability --------- * ibex_open should never crash, and should never return NULL without errno being set. Should check for errors when reading. Performance ----------- * Profiling, keep thinking about data structures, etc. * Check memory usage * See if writing the "inverse image" of long ref streams helps compression without hurting performance now. (ie, if a word appears in more than half of the files, write out the list of files it _doesn't_ appear in). (I tried this before, and it wasn't working well, but the file format and data structures have changed a lot.) * We could save a noticeable chunk of time if normalize_word computed the hash of the word and then we could pass that into g_hash_table_insert somehow. * Make a copy of the buffer to be indexed (or provide interface for caller to say ibex can munge the provided data) and then use that rather than constantly copying things. ? Functionality ------------- * ibex file locking * specify file mode in ibex_open * ibex_find* need to normalize the search words... should this be done by the caller or by ibex_find? * Needs to be some way to do a secondary search after getting results back from ibex_find* (ie, for "foo near bar"). This either has to be done by ibex, or requires us to export the normalize interface. * Does there need to be an ibex_find_any, or is that easy enough for the caller to do? * utf8_trans needs to cover at least two more code pages. This is tricky because it's not clear whether some of the letters there should be translated to ASCII or left as UTF8. This requires some investigation. * ibex_index_* need to ignore HTML tags. NAME = [A-Za-z][A-Za-z0-9.-]* </?{NAME}(\s*{NAME}(\s*=\s*({NAME}|"[^"]*"|'[^']*')))*> <!(--([^-]*|-[^-])--\s*)*> ugh. ok, simplifying, we get: <[^!](([^"'>]*("[^"]*"|'[^']*'))*> or <!(--([^-]*|-[^-])--\s*)*> which is still not simple. sigh. * ibex_index_* need to recognize and ignore "non-text". Particularly BinHex and uuencoding.