This question about Missing functionality: Asked
possible improvements to support many pages & attachments
I am working with
KinoSearch (KS) and TWiki for a industrial client, and I'm looking for feedback and suggestions before working on some enhancements. (I'm posting versions of this message to
KinoSearch, TWiki, and
FosWiki sites).
We have ~50 MB of wiki pages and ~5000 MB of attachments (eg Adobe PDF, Microsoft Word,
PowerPoint, & Excel) hosted on a virtual server. Currently it takes ~30 minutes to make an index (150MB) of just the wiki pages, and we would like to stringify and index the attachments.
We've had problems where the indexing would hang or crash, leaving an incomplete index file, and incorrect search results. I think these problems have been fixed (thanks to Marvin Humphrey). However, I still worry about creating a single index for the entire site. The longer it takes to build, the more risk that something will go wrong.
Would it make sense to use separate indices for each web or for each file type? From skimming the KS forums, it looks like KS doesn't (yet)
provide much support for this, though it's being considered:
http://www.rectangular.com/pipermail/kinosearch/2006-August/006513.html
Or should multiple indexes be combined into one large index?
http://www.rectangular.com/pipermail/kinosearch/2006-July/004847.html
To use multiple indexes, I think I would need to:
- make the indexer and updater accept parameters for web(s), file type(s), and index location(s)
- figure out which index(s) to update when a file changes
- make the searcher accept parameters for web(s), file type(s), and index location(s)
- make a parent searcher that forwards queries to appropriate children based on web, file type, etc, and then combines the results (modeled on KS's
MultiSearcher)
- maybe make a parent indexer (and updater) that forwards files to appropriate children based on web, file type, etc
Should this be managed within KS or in the wiki plugin/addon code?
I'm more familiar with the latter, but the former might be more widely useful.
Separate but related issues:
For TWiki/FosWiki, would it make sense to expose the index path
so there is a reasonable default that can be overridden?
Among other things, this would make testing easier.
For Excel files, we're using the default stringifier in the wiki addon,
which indexes the contents of all spreadsheet cells.
I think of cells as containing either numbers, formulas, or text (anything else).
Would it make sense for the stringifier to skip numbers & formulas,
or to make this a configuration option?
Any suggestions for related things to do while I'm inside the code?
Thank you for your time and consideration.
-- r1 - 10 May 2010 - 16:19:55 -
ClifKussmaul
Suggestion: When re-indexing, (as upposed to updating the index) it would be good not to "clobber" the current index files. Maybe the new index files could be built in a temporary directory and then moved or copied over to the active production directory when done.
--
KiltBear - 15 Sep 2010