Foswiki

%META:TOPICINFO{author="ProjectContributor" comment="reprev" date="1393633825" format="1.1" reprev="7" version="8"}%
%META:TOPICPARENT{name="Plugins"}%
---+ Solr Plugin

%SHORTDESCRIPTION%

%TOC%
---++ About Solr

Solr is an open source enterprise search server based on the Lucene Java search
library, with XML/HTTP and JSON APIs, hit highlighting, faceted search,
caching, replication, and a web administration interface. It runs in a Java
servlet container such as Tomcat or Jetty.

This extension comes with a pre-configured Jetty configured to run a Solr webapp
right away.
---++ Installation
---+++ Install the Plugin

First follow the normal plugin instructions as follows.
You do not need to install anything in the browser to use this extension. The following instructions are for the administrator who installs the extension on the server.

Open configure, and open the "Extensions" section. Use "Find More Extensions" to get a list of available extensions. Select "Install".

If you have any problems, or if the extension isn't available in =configure=, then you can still install manually from the command-line. See http://foswiki.org/Support/ManuallyInstallingExtensions for more help.
---+++ Download and Install Solr

The current plugin is based on *Solr 4.6.1*.  Download Solr 4.6.1 from the [[http://archive.apache.org/dist/lucene/solr/][Apache Archive]] and unpack it to the directory where you would like solr and jetty to run.  For example, if you want to install solr in &lt;solr-dir&gt; = /opt/solr and you've downloaded the solr 4.6.1 tar file to your ~/Downloads directory:

<verbatim>cd /opt
sudo tar xvfz ~/Downloads/solr-4.6.1.tgz
sudo mv solr-4.6.1 solr</verbatim>

Verify your install by running the default example:

<verbatim>cd <solr-dir>/example
sudo java -jar start.jar</verbatim>

Go to http://localhost:8983/solr/ in a browser running on your Foswiki server and verify that you can see the solr admin web page. If not, please consult the [[https://lucene.apache.org/solr/4_6_1/][Solr 4.6.1 documentation]].
---+++ Modify your Solr installation for use with Foswiki

Once you have the example up and running successfully, you need to set it up for use with Foswiki. The following instructions assume you want to run solr using [[http://jetdfty.codehaus.org/jetty/][Jetty]]. Consult the Solr documentation if you would like to run Solr under another servlet container like [[http://tomcat.apache.org/][Tomcat]].

First rename the example directory under your solr directory to solr-jetty. This isn't strictly necessary but seems more professional for a production installation. The rest of the instructions assume that you've done this, but if you prefer not to, just remember to substitute =example= for =solr-jetty= when following the instructions below.

<verbatim>sudo mv <solr-dir>/example <solr-dir>/solr-jetty</verbatim>

Create a foswiki core based on the example collection1 core, and disable the collection1 core:

<verbatim>cd <solr-dir>/solr-jetty/solr
sudo cp -a collection1/ foswiki/
sudo mv collection1/core.properties collection1/core.properties.orig</verbatim>

Next, unpack the foswiki-specific configuration files from the &lt;foswiki-dir&gt;/solr directory:

<verbatim>cd <solr-dir>/solr-jetty/solr/foswiki
sudo tar xvfz <foswiki-dir>/solr/solr-config.tgz</verbatim>
---+++ Configure the plugin and test Solr with Foswiki

Start the Solr webservice, just as you did for the example. The admin page in the browser should now show the foswiki core in the Core Selector dropdown.

<verbatim>cd <solr-dir>/solr-jetty
sudo java -jar start.jar</verbatim>

In [[%SCRIPTURL{"configure"}%][configure]] modify the !SolrPlugin foswiki configuration:
   * Configure ={SolrPlugin}{Url}= as =http://localhost:8983/solr/foswiki=
   * Configure ={SolrPlugin}{SolrHome}= as =&lt;solr-dir&gt;/solr-jetty= (for example =/opt/solr/solr-jetty= - the directory that contains start.jar)
   * Configure other options as desired.
---+++ Indexing your content for the first time

<span style="font-size: 1em; line-height: 1.385em;">Go to </span> SolrSearch <span style="font-size: 1em; line-height: 1.385em;">.</span>  <span style="font-size: 1em; line-height: 1.385em;">The page should show a search box but no results since we haven't indexed anything yet.</span>

If everything is working, index a single page:
<verbatim>cd <foswiki-dir>/tools
sudo ./solrindex topic=Main.WebHome</verbatim>

Go to SolrSearch and check that you now see one result - the page you just indexed.

If that worked fine, index everything:
<verbatim>cd <foswiki-dir>/tools
sudo ./solrindex mode=full optimize=on &> ~/solrindex.log</verbatim>

This will crawl all webs, topics and attachment and submit them to the Solr server which in turn will build up the search index. This can take a while depending on the amount of content and number of users registered to your site already. During this process attachments are "stringified" using the StringifierContrib so be sure that you've properly configured that plugin first.  StringifierContrib converts attachments into a plain text format that Solr can read. !SolrPlugin will cache the stringified version of all attachments and only process them again if the corresponding binary version has changed. Thus a next full index run will process significantly faster.

Also note that if you have not specified the CONTENT_LANGUAGE in your %USERSWEB%.SitePreferences, Solr will attempt to detect the content language for each item that it indexes.  This works pretty well, but not perfectly.  If you have a single language site, it's recommended to set the CONTENT_LANGUAGE preference accordingly.
 
!SolrPlugin does read the access rights of all users to a document while indexing it. This access control information is indexed together with the document to secure entries in the database. Any request will take these under consideration so that only users with view rights to a document can retrieve it using !SolrSearch.

---+++ Configure solr logging

By default, Solr is configured to log to both the console and a log file in =/solr-jetty/logs=.  Logging is at INFO level, which results in a lot of output.  This is good for getting started and debugging, but once things are up and running well, you may want to adjust the log level to something less verbose like WARN.  You can do this by editing =/solr-jetty/resources/log4j.properties=.  You can also comment out the parts of the file that result in logging to the console.
---+++ Make solr start automatically as a service

The steps here vary depending on your distribution and which service manager you are using.  The following example uses Upstart:
   1 Create a new file in =/etc/init= called =solr.conf=
   1 Type in the following lines (modify the directory path if necessary) and save the file:
<verbatim>description "Service script for Solr search engine"
start on runlevel [2345]
stop on runlevel [!2345]

kill timeout 30
respawn

script
   chdir /opt/solr/solr-jetty
   exec /usr/bin/java -jar start.jar
end script</verbatim>

Start the service and check that it has started successfully:
<verbatim>sudo service solr start
sudo initctl list | grep solr</verbatim>

---+++ Setting up separate cores for virtual hosting or multiple foswiki instances

If you have multiple foswiki instances running on your server, or a single foswiki instance hosting multiple virtual hosts using Foswiki:Extensions/VirtualHostingContrib, then assign a separate "core" to each
virtual host. This ensures that indexes are created for each virtual host separately and content from one host does not leak to another while still using a single solr service answering queries for all foswiki instances or virtual hosts installed.

Multiple cores are created the same way that you created the single foswiki core described above - just in a directory with a different &lt;core-name&gt;. For virtual hosts, be sure to set the core name to match the name of your virtual host directory:
<verbatim>cd <solr-dir>/solr-jetty/solr
sudo cp -a collection1/ <core-name>/</verbatim>

Next, unpack the foswiki-specific configuration files from the &lt;foswiki-dir&gt;/solr directory:

<verbatim>cd <solr-dir>/solr-jetty/solr/<core-name>
sudo tar xvfz <foswiki-dir>/solr/solr-config.tgz</verbatim>

Restart the solr service. For example, if using Upstart:

<verbatim>sudo service solr restart</verbatim>

Verify that the new core appears in the cores dropdown list of the admin webpage on your local host.

If you are configuring the core to support a separate foswiki instance, be sure to set ={SolrPlugin}{Url}= to =http://localhost:8983/solr/=.  When you run the solrindex script from the tools directory of that foswiki instance, it will use that core.

If you are configuring the core to support virtual foswiki hosts (a single instance of foswiki with multiple virual sites), be sure to set ={SolrPlugin}{Url}= as =http://localhost:8983/solr/= without the core name.  In this case, you need to use the virtualhost versions of the helper tools and specify the host:
   * virtualhost-solrsearch
   * virtualhost-solrindex
   * virtualhost-solrdelete

For example, to index a single topic for a specific host:

<verbatim>cd ..<foswiki-dir>/tools
./virtualhost-solrindex host=<core-name> topic=Main.WebHome</verbatim>

Or to index all topics for all virtual hosts and redirect output to ~/solrindex.out:

<verbatim>cd ..<foswiki-dir>/tools
./virtualhost-solrindex host=all mode=full optimize=on &> ~/solrindex.log</verbatim>

The =solrjob= tool itself is a wrapper around the =solrindex= and will either use =solrindex= or =virtualhost-solrindex=
depending on the =host= commandline parameter.

---+++ Setting up immediate delta indexing

Whenever a topic or attachment in Foswiki changes, solr has to read the changed documents and update its index. This can either be done immediately when a topic is changed or can be done periodically using a cron job . This can be configured to your needs:

Enable/disable updates on save:
<verbatim>$Foswiki::cfg{SolrPlugin}{EnableOnSaveUpdates} = 0;</verbatim>

Enable/disable updates when a new attachment has been uploaded:
<verbatim>$Foswiki::cfg{SolrPlugin}{EnableOnUploadUpdates} = 0;</verbatim>

Enable/disable updates when a topic or attachment has been moved or deleted:
<verbatim>$Foswiki::cfg{SolrPlugin}{EnableOnRenameUpdates} = 0;</verbatim>

All of these settings are disabled by default. That's because updating solr's index might take a noticeable amount of time when clicking on "save" in the wiki editor, even more when the saved topic has got a lot of attachments since when the topic is indexed all of its attachments are re-indexed as well.

---+++ Setting up asynchronous delta indexing

If you decide not to use immediate indexing, you'll need to set up a cron job to do a delta-mode re-index on a regular basis.  Delta mode will re-index all documents that have changed since the last delta-mode indexing was performed.  This can be set up by using a cron job like this:

<verbatim>0-59/5 * * * * <httpd-user> FOSWIKI_ROOT=<foswiki-dir> <foswiki-dir>/tools/solrjob --mode delta --host <core-name></verbatim>

Where:
   * =&lt;httpd-user&gt;= is the user name that your web server is running under - for example www-data
   * =&lt;foswiki-dir&gt;= is the path to your foswiki root directory - for example /var/www/foswiki
   * =&lt;core-name&gt;= is your virtual host core name or =all= for all virtual hosts. If you are not using virtual hosts, leave out the =--host &lt;core-name&gt;= argument
This will start =solrindex= in delta mode every 5 minutes. You could shorten this interval even more. However 5 minutes seems to be a good trade-off of wasting resources vs. having all content updated in a timely manner.

Instead of waiting for cron to trigger the delta indexing job, [[http://iwatch.sourceforge.net/][iwatch]] is a much better alternative giving you near-realtime indexing. Iwatch is available on linux systems that implement the inotify kernel service. That way the underlying operating system will trigger the =solrjob= script as soon as a file has changed. An example of an =iwatch.xml= file triggering a delta index job looks like this:

<verbatim><?xml version="1.0" ?>
<!DOCTYPE config SYSTEM "/etc/iwatch/iwatch.dtd" >

<config>
  <guard email="root@localhost" name="IWatch"/>
  <watchlist>
    <title>Foswiki</title>
    <contactpoint email="root@localhost" name="Administrator"/>
    <path type="recursive" alert="off" syslog="on" filter=".*\.txt$" exec="su <httpd-user> -c 'FOSWIKI_ROOT=<foswiki-dir> <foswiki-dir>/tools/solrjob --host <core-name>'"><data-dir></path>
    <!-- add other virtual hosts here -->
  </watchlist>
</config></verbatim>

Where:
   * =&lt;httpd-user&gt;= is the user name that your web server is running under - for example =www-data=
   * =&lt;foswiki-dir&gt;= is the path to your foswiki root directory - for example /var/www/foswiki
   * =&lt;core-name&gt;= is your core name is the core name for the virtual host of interest or =all= for all virtual hosts. If you are not using virtual hosts, leave out the =--host &lt;core-name&gt;= argument.
   * =&lt;data-dir&gt;= is the path to the data directory for your foswiki installation or virtual host. For example =/var/www/foswiki/data= or =/var/www/foswiki/host2/data=.
You can set iwatch up to run as a service, for example, under Upstart:

Setting up iwatch as a service under Upstart is similar to setting up solr as a service:
   1 Create a new file in =/etc/init= called =iwatch.conf=
   1 Type in the following lines (modify the path to the iwatch.xml file if necessary) and save the file:
<verbatim>description "Service script for iwatch"

start on runlevel [2345]
stop on runlevel [!2345]

expect fork
respawn

exec /usr/bin/iwatch -d -f /etc/iwatch/iwatch.xml</verbatim>

Note that delta indexing is strongly recommended when one of the settings =EnableOnSaveUpdates=, =EnableOnUploadUpdates= or =EnableOnRenameUpdates= is disabled.

---+++ Setting up asynchronous full re-indexing

It is strongly recommended to *fully reindex* all of your documents regularly, presumably every 24h at night.
To achieve this install a cron-job like this one:

<verbatim>0 0 * * * <httpd-user> <foswiki-dir>/tools/solrjob --mode full --host <core-name></verbatim>

Where:
   * =&lt;httpd-user&gt;= is the user name that your web server is running under - for example =www-data=
   * =&lt;foswiki-dir&gt;= is the path to your foswiki root directory - for example =/var/www/foswiki=
   * =&lt;core-name&gt;= is the name of your virtual host core if you are using virtual hosts. Leave out the =--host &lt;core-name&gt;= argument if you are not using virtual hosts.
---++ Usage

In general it is recommended to replace Foswiki's default !AutoViewTemplatePlugin with Foswiki:Extensions/AutoTemplatePlugin. This will allow you to replace the default WebSearch, WebSearchAdvanced, WebChanges and %SYSTEMWEB%.SiteChanges with a solr-driven interface for better usability and performance.

Configure !AutoTemplatePlugin by adding the following ={ViewTemplateRules}=

<verbatim>$Foswiki::cfg{Plugins}{AutoTemplatePlugin}{ViewTemplateRules} = {
...
  'WebSearchAdvanced' => 'SolrSearchView',
  'WebSearch' => 'SolrSearchView',
  'WebChanges' => 'WebChangesView',
  'SiteChanges' => 'SiteChangesView',
...
};</verbatim>

You might also override the !WebSearch of an individual web using a rule along the following lines:

<verbatim>   'MyWeb.WebSearch' => 'SolrSearchView'</verbatim>

---+++ Faceted search interface

<a href="%ATTACHURLPATH%/SolrPluginSnap1.png" class="foswikiImage">
  <img alt="" src="%ATTACHURLPATH%/SolrPluginSnap1.png" width="300" />
</a>

<a href="%ATTACHURLPATH%/SolrPluginSnap2.png" class="foswikiImage">
  <img alt="" src="%ATTACHURLPATH%/SolrPluginSnap2.png" width="300" />
</a>

%RED%Todo: explain usage%ENDCOLOR%

---+++ Macros

SolrPlugin comes with a set of search macros tailored to the extensive capabilities of solr's responses to search queries.
All of them make use of the same set of options to render a response as listed in [[#SOLRSEARCH][SOLRSEARCH]].

<a name="SOLRSEARCH"></a>
---++++ SOLRSEARCH

This is the most central macro of SolrPlugin. It allows to interact with the solr search server and display results within wiki applications.
An example search looks like
<verbatim class="tml">
%SOLRSEARCH{"test"
  format="   1 $web.$topic$n"
  sort="date desc"
}%</verbatim>

This will list the 10 most recently changed topics that match the string "test".
To list the 20 most recently changed topics topics that have the string "test" in their name use
<verbatim class="tml">
%SOLRSEARCH{"topic_search:test"
  format="   1 $web.$topic$n"
  sort="date desc"
  rows="20"
}%</verbatim>

SOLRSEARCH allows you to use the full power of the lucene query language.
Consult the [[http://lucene.apache.org/java/3_1_0/queryparsersyntax.html][Lucene Query Syntax]]
guide to learn more about how to form more complicated queries.

In addition to the full featured lucene query language SOLRSEARCH allows you to
run a query in "dismax" mode. The dismax query handler is a maximum tolerant
query processor that is robust against any sorts of user input to be fed to the
search engine. While the default lucene query handler comes will a quite
complex query language of itself, the dismax handler sacrifices this by aiming
for more robustness.

| *Parameter* | *Description* | *Default* |
| id | a search can be cached optionally for the time of the current request, for example using id="solr1". further calls to %SOLRFORMAT can make use of the cached solr response to render it independent from the location of the %SOLRSEARCH call on the wiki page | |
| search | query string: depending on the search type this can either be a free-form text (type=dismax), a valid lucene query (type=standard) or a combination of both (edismax) | <nop>*:* |
| type | dismax/edismax/standard: query type | standard |
| fields | list of fields to be returned in the result; by default all fields in solr documents are returned; communication between Foswiki and the solr search can be optimized by specifying only those fields that you are interested in while rendering the response | *, score |
| *Flags:* |||
| jump | on/off: jump to the topic specified explicitly in the seach string  | on |
| lucky | on/off: jump to the first result found | off |
| highlight | switch on/off highlighting of found terms | off |
| spellcheck | switch on/off spellchecking to propose alternative spellings in case no search result was found | off |
| *Pagination:* |||
| start | integer index within the result from where to start listing results | 0 |
| rows | maximum number of documents to return | 10 |
| *Filter parameters:* |||
| web | filter by web: this can be any webname | all |
| contributor | filter by contributor to a topic | |
| filter | lucene query to filter results | |
| extrafilter | additional lucene filter query (see %SYSTEMWEB%.SolrSearchBaseTemplate for the difference in =filter= and =extrafilter= | |
| reverse | on/off - reverts sorting if switched on; note: this overrides sorting order specified in =sort= | off |
| sort | sorting expression; examples: =score desc=, =date desc=, =createdate=, =topic_sort= | |
| *Dismax Parameter:* |||
| boostquery | a raw query string (in the solr query syntax) that will be included with the user's query to influence the score. example: =type:topic^1000= will boost results of type topic | see solrconfig.xml and %SYSTEMWEB%.SolrSearchBaseTemplate |
| queryfields | list of fields and their boosts giving each field a significance when a term was found in them. the format supported is fieldOne^2.3 fieldTwo fieldThree^0.4, which indicates that fieldOne has a boost of 2.3, fieldTwo has the default boost, and fieldThree has a boost of 0.4 ... this indicates that matches in fieldOne are much more significant than matches in fieldTwo, which are more significant than matches in fieldThree | see solrconfig.xml and %SYSTEMWEB%.SolrSearchBaseTemplate |
| phrasefields | list of fields and their boosts similar to =queryfields=. this parameter may contain fields and boosts that _pharses_ (specified in quotes) are matched against. boosting those fields higher than their counterpart in =queryfields= allows you to prefer phrase matches over separate word matches | see solrconfig.xml and %SYSTEMWEB%.SolrSearchBaseTemplate |
| *Faceting:* |||
| facets | list of facets to be rendered during search; each facet can be a =title=name= pair specifying the facet name and the title label used to display it in the result; example: %BR% =%MAKETEXT<nop>{"Webs"}%=web, %MAKETEXT<nop>{"Topic type"}%=field_TopicType_lst= | |
| facetquery | query to be used for a facet query | |
| facetoffset | used to page through a list of facets being returned by a search | |
| facetlimit | maximum number of values to be displayed per facet; this is a list of pairs =name=integer= specifying a per-facet limit; example: =50, tag=100, contributor=10, category=10= will constraint the global limit of facet values to be returned to 50, tags to 100, list the top 10 contributors in the hit set as well as the 10 most used categories | 100 |
| facetmincount | minimum frequency of a facet to be included in the result | 1 |
| facetprefix | prefix string of a facet to be included | |
| facetdatestart | part of a date facet describing the start of a time interval | NOW/DAY-7DAYS |
| facetdateend | part of a date facet describing the end of a time interval | NOW/DAY+1DAYS |
| facetdateother | part of a date facet describing the time intervals excluding the one specified with facetdatestart and facetdateend | before |
| hidesingle | comma separated list of facets to be hidden if there's only one choice left | |
| disjunctivefacets | list of facets that are queried using OR; so searching within one facet will _expand_ the search instead of drilling down | facet values are combined using AND |
| combinedfacets | list of facets where values are queried in each of them using OR; for example listing =field_ProjectMembers_lst= and =field_ProjectManager_s= will result in a lucne filter of the form =field_ProjectMembers_lst:%WIKINAME% OR field_ProjectManager_s:%WIKINAME%= | |
| *Formating results:* |||
| correction | format string for corrections proposed by the spellchecker | Did you mean &lt;a href='$url'&gt;$correction&lt;/a&gt; |
| header | format string prepended to the result | |
| format | format string used to render each hit in the result set | |
| separator | format string used to separate hit results rendered using =format= | |
| footer | format string appended to the result | |
| header_interesting | format string prepended to more-like-this queries (see =%SOLRSIMILAR=) | |
| format_interesting | format string used to render more-like-this results | |
| separator_interesting | format string used to separate hit results in more-like-this queries | |
| footer_interesting | format string appended to more-like-this queries | |
| include_interesting | regular expression terms must match in a more-lile-this result | |
| exclude_interesting | regular expression terms must _not_ match in a more-lile-this result | |
| header_&lt;facet&gt; | format string prepended to a facet result | |
| format_&lt;facet&gt; | format string used to render a facet value | |
| separator_&lt;facet&gt; | format string used to separate facet values | |
| footer_&lt;facet&gt; | format string appended to facet results | |
| include_&lt;facet&gt; | regular expression facet values must match to be displayed | |
| exclude_&lt;facet&gt; | regular expression facet values must _not_ match to be displayed | |
<!--
| morelikethis | switch on/off more-like-this results per document found | off |
-->

---++++ SOLRFORMAT

When a solr response has been cached using the =id= parameter to [[#SOLRSEARCH][SOLRSEARCH]], it can be reused by subsequent calls to %SOLRFORMAT.
<verbatim class="tml">
%SOLRSEARCH{"test" 
  id="solr1"
  facets="web,contributor"
  facetlimit="web=10, contributor=10"
}%

<noautolink>
*Results:*
%SOLRFORMAT{"solr1"
  format="   1 [[$web.$topic][$topic]]$n"
}%

*Webs:*
%SOLRFORMAT{"solr1"
  format_web="   * $key ($count)$n"
}%

*Contributors:*
%SOLRFORMAT{"solr1"
  format_contributor="   * $key ($count)$n"
  exclude_contributor="UnknownUser|AdminGroup|AdminUser|RegistrationAgent|TestUser"
}%
</noautolink></verbatim>

---++++ SOLRSIMILAR

SOLRSIMILAR allows to return a list of similar topics given the current one.

| *Parameter* | *Description* | *Default* |
| "..." | query string referencing the document(s) to return similar ones for | =id:%BASEWEB%.%BASETOPIC% |
| like | list of fields used to compute similarity | category, tag |
| fields | list of fields and their boost value to be included in result items | web, topic, title, score |
| filter | restricts results to those matching this filter | type:topic |
| include | switches on/off inclusion of the matched document found in the query parameter | off |
| limit | maximum number of results to return | 100 |
| boost | | |
| mintermfrequency | | |
| mindocumentfrequency | | |
| mindwordlength | | |
| maxdwordlength | | |
%SOLRSIMILAR{"id:%BASEWEB%.%BASETOPIC%"
   filter="web:%BASEWEB% type:topic"
   fields="web,topic,title,score"
   header="<h2>Similar Topics</h2>$n<ul>" 
   footer="</ul>"
   format="<li>
       <a href='%SCRIPTURLPATH{"view"}%/$web/$topic' title='score: $score'>
         $percntDBCALL{\"Applications.RenderTopicThumbnail\"
           OBJECT=\"$web.$topic\"
           TYPE=\"plain\"
         }$percnt<!-- -->
         <!-- -->$title
         $percntDBQUERY{
           header=\"<br /><span class='foswikiTopicInfo foswikiSmallish TMLhtml TMLhtml TMLhtml TMLhtml TMLhtml TMLhtml TMLhtml TMLhtml TMLhtml TMLhtml TMLhtml TMLhtml TMLhtml TMLhtml TMLhtml TMLhtml TMLhtml TMLhtml TMLhtml TMLhtml'>\"
           topic=\"$web.$topic\" 
           format=\"$formfield(Summary)\"
           footer=\"</span>\"
         }$percnt
         %CLEAR%
       </a>
     </li>"
   separator="$n"
   rows="5"
}%<!-- solrsimilar -->
<verbatim class="foswikiHidden">
---++++ SOLRSCRIPTURL

---+++ Rest inteface

---++++ search

---++++ terms

---++++ similar

---++++ autocomplete

---+++ Commandline tools

---++++ solrstart

---++++ solrindex

---++++ solrdelete

---+++ Perl interface

---++++ registerIndexTopicHandler()

---++++ registerIndexAttachmentHandler()</verbatim>

---++ Solr indexing schema

!SolrPlugin comes with a custom schema to index general Foswiki data as defined
in the =&lt;solr-home-dir&gt;conf/schema.xml= file. It offers support for generic
!DataForm values, so adding any new !DataForm definition will allow to use
those formfields for faceting directly without changing configurations or having to reindex
the content.

The process of indexing content is configured on the Foswiki side which will crawl all webs, topics
and their attachments thus creating lucene documents which are then sent over to the solr server.
A lucene document is made up of fields of a certain type which defines the way the document should be processed
by the solr server. This is configured in the =schema.xml= file.

While the schema is able to cover all Foswiki related data it is still kept generic enough to be used for non-wiki
content as well. Different kinds of content are distinguished using the =collection= field (see below).

---+++ Field types

This is the list of the most common field types used in the default schema.
See the =schema.xml= for more exotic field types like =point= and =location=
useful for spacial search.

| *Type* | *Description* |
| string | not analyzed, but indexed/stored verbatim |
| boolean | boolean values (true, false) |
| binary | the data should be sent/retrieved in as Base64 encoded strings |
| int, float, long, double | default numeric field types. for faster range queries, consider the tint/tfloat/tlong/tdouble types |
| date | the format for this date field is of the form 1995-12-31T23:59:59Z, \
and is a more restricted form of the canonical representation of \
[[http://www.w3.org/TR/xmlschema-2/#dateTime][dateTime]]. The trailing \
"Z" designates UTC time and is mandatory. Optional fractional seconds \
are allowed: 1995-12-31T23:59:59.999Z All other components are \
mandatory. Note: for faster range queries, consider the tdate type |
| text_ws | a text field that only splits on whitespace for exact matching of words |
| text | a general text field that has reasonable, generic cross-language \
defaults: it tokenizes with !StandardTokenizer, removes stop words from \
case-insensitive "stopwords.txt", and down cases. At query time only, it also \
applies synonyms. |
| text_generic | same as =text= but also splits words on case change while generating word parts. \
a general unstemmed text field - good if one does not know the language of the field. \
this field type is usful when searching for parts of a !WikiWord |
| text_substring | same as =text_generic= but with substring decomposition |
| text_spell | generic text analysis for spell checking |
| text_sort | this is a text field suitable for sorting alphabetically |
| text_rev | a general unstemmed text field that indexes tokens normally and also \
reversed, to enable more efficient leading wildcard queries. |
| type | a text field used to analyse different content type; Type mappings are defined in =typemaping.txt=; for example, that's where all image file extension are mapped to "image", same for "video" |

---+++ Fields

| *Name* | *Type* | *Multivalued* | *Stored* | *Description* |
| timestamp | tdate | | stored | time when the document was added to the index |
| spell | text_spell | multivalued | | used for spellchecking |
| id | string | | stored | unique identifier for each document; this is the _external_ id usable in applications; there's an internal solr document id not related to this field |
| collection | string | | stored | identifies a set of documents comming from the same content collection; by default all content stored in Foswiki (topics and attachments) is gathered in the =wiki= collection set in =Foswiki::cfg{SolrPlugin}{DefaultCollection}= |
| language | string | | stored | language of the current document; this may be specified explicitly using the =CONTENT_LANGUAGE= preference, or set to "detect" to let the solr update chain detect the language automatically |
| url | string | | stored | url used to access the document being indexed |
| type | type | | stored | holds the type facet of the document; this is "image" for all kinds of images, "video" for all kinds of videos, "topic" for Foswiki topics and the verbatim file extension for everything else; note: plugins like Foswiki:Extensions/MetaCommentPlugin might use specific types as well (like "comment" in this case) |
| web | string | | stored | name of the web this document is located in |
| topic | string | | stored | name of the topic |
| webtopic | string | | stored | concatenation of the web and topic part |
| access_granted | string | multivalued | | this field controls access of users to this topic or attachment in the search index; every query is augmented with an ACL check against this field; only users listed in this field are allowed view rights; special value is "all" when there are no view restrictions |
| title | string | | stored | title of a document; a topic title is read from a =TopicTitle= formfield, a =TOPICTITLE= preference variable or defaults to the =topic= name itself; for attachments this is the filename with the extension stripped off |
| summary | text_generic | | stored | this is a plainified summary of the topic text |
| author | string | | stored | the name of the user that changed the document most recently |
| contributor | string | multivalued | stored | list of users that contributed to this topic at some point in time |
| date | tdate | | stored | time the the document was changed last |
| version | float | | | current version of the topic |
| text | text_generic | | | document text |
| createauthor | string | | stored | author of the initial version of this document |
| createdate | tdate | | stored | date when the initial version of this document was created |
| catchall | text_generic | multivalued | stored | copy-field that gathers content from (allmost) all fields; this is the default search field for the "standard" query parser; note that fields to be queried can be configured per request using the "dismax" handler |
| substrings | text_substring | multivalued | | holds substring analysis of the most important search fields |
| phonetic | phonetic | multivalued | | holds the phonetic analysis of the most important search fields |
| state | string | | | used by comments or any other application that tracks specific states of a document, such as "new", "unapproved", "approved", "draft", "unpublished", "published", ... |
| parent | string | | stored | parent topic of the current topic |
| form | string | | stored | name of the form attached to the current topic |
| preference | string | multivalued | stored | this field catches all topic preferences. each preference is captured in a dynamic field as well (see dynamic fields below) |
| attachment | string | multivalued | stored | list of all attachment names of this topic |
| outgoing | string | multivalued | stored | list of all outgoing links; this information is used to detect backlinks |
| category | string | multivalued | stored | list of categories this document is in; note: this field will only be used if Foswiki:Extensions/ClassificationPlugin is installed; it will populate it with the list of all categories up to !TopCategory; content of this field is copied to =category_search= as well (see generic fields below) |
| tag | string | multivalued | stored | list of tags assigned to this document; note: this field will only be used if Foswiki:Extensions/ClassificationPlugin is installed; content of this field is copied to =category_search= as well (see generic fields below) |
| name | string | | stored | filename of an attachment |
| comment | text_generic | | stored | comment field of an attachment |
| size | tint | | stored | size of an attachment in bytes |

---+++ Dynamic fields

Dynamic fields are generated based on the content properties of the document to
be indexed. Fields are specified using some kind of wildcard in =schema.xml=.
When a document is indexed, the wildcard will be expanded to create a proper
field name. Dynamic fields allow to apply specific ways of analyzing fields
based on their name, as well as cover fields that aren't known in advance, like
the name of all formfields of a !DataForm that ever could be invented.

When !SolrPlugin is about to index a !DataForm attached to a topic, it tries to
guess the data type of each formfield. Normally, Foswiki does not specify any
type information within a !DataForm definition. Exceptions are (1) date: these
are mapped to a *_dt field and (2) checkbox, select, radio, textboxlist: these
are potentially multi-value fields and are thus indexed in a *_lst field.

Every other formfield is stored into an *_s field as well as into a *_search field.
The former captures the exact content while the latter analyses the text more thoroughly
optimized for fuzzy searching.

!DataForm formfields are mapped to lucene document fields by prepending the =field_*=
prefix to prevent name clashes with other dynamic fields generated on the fly.
So for example a formfield =ProjectManager= will be stored in =field_ProjectManager_s=
and =field_ProjectManager_search=. Likewise a select+multi formfield =ProjectMembers=
will be stored in =field_ProjectMembers_lst= as it is a multivalued field.

If a formfield name already comes with one of the below suffixes (_i, _l, _f, _dt, etc)
then this suffix will be used instead of any heuristics trying to derive the best
field type for the lucene field. That way !DataForm fields although untyped by Foswiki
can be indexed type-specific nevertheless.

Similarly topic preferences are indexed using a =preference_*= prefix.

| *Name* | *Type* | *Multivalued* | *Stored* | *Description* |
| *_i | tint | | stored | fields with a _i suffix are indexed as an integer number |
| *_l | tlong | | stored | fields with a _l suffix are indexed as a long integer |
| *_f | tfloat | | stored | fields with a _f suffix are indexed as a float |
| *_d | tdouble | | stored | fields with a _f suffix are indexed as a double precision float |
| *_b | boolean | | stored | true, false |
| *_s | string | | stored | dynamic field for unanalyzed text |
| *_t | text_generic | | stored | generic text |
| *_dt | tdate | | stored | a dateTime value |
| *_lst | string | multivalued | stored | this field is used for any multi-valued formfield in !DataForms like, select, radio, checkbox, textboxlist |
| preference_* | string | | stored | preference values such as =preference_NAMEOFPREFERENCE_t= |
| *_search | text_generic | | stored | generic text, optimized for searching |
| *_sort | text_sort | | stored | text optimized for sorting alphabetically |

---+++ Copy fields

Finally, after having defined all field type there are some fields that are created by copying some
source field to a destination field using the =copyField= feature of solr. So while most of a lucene document
to be indexed is created by the crawler and indexer explicitly, some more are created automatically to facilitate
specific search applications. The destination fields are then analysed using the dynamic field definitions as given above.

| *Source* | *Destination* |
| web | web_sort |
| topic | topic_sort |
| title | title_sort |
| category | category_search |
| tag | tag_search |
| title | title_search |
| topic | topic_search |
| web | web_search |
| webtopic | webtopic_search |
| attachment | catchall |
| category | catchall |
| comment | catchall |
| field_* | catchall |
| form | catchall |
| name | catchall |
| tag | catchall |
| text | catchall |
| title | catchall |
| topic | catchall |
| type | catchall |
| state | catchall |
| attachment | substrings |
| category | substrings |
| comment | substrings |
| contributor | substrings |
| field_* | substrings |
| form | substrings |
| name | substrings |
| tag | substrings |
| text | substrings |
| title | substrings |
| topic | substrings |
| type | substrings |
| attachment | phonetic |
| category | phonetic |
| comment | phonetic |
| contributor | phonetic |
| field_* | phonetic |
| form | phonetic |
| name | phonetic |
| tag | phonetic |
| text | phonetic |
| title | phonetic |
| topic | phonetic |
| type | phonetic |
| attachment | spell |
| comment | spell |
| field_* | spell |
| form | spell |
| name | spell |
| text | spell |
| title | spell |
| topic | spell |
| web | spell |
<verbatim class="foswikiHidden">
---++ Templates

---+++ Structure of !SolrSearchBaseTemplate

---+++ Replacing !WebSearch and !WebChanges

---+++ Creating custom search interfaces</verbatim>

---++ Plugin Info

<!--
   * Set SHORTDESCRIPTION = Enterprise Search Engine for Foswiki based on [[http://lucene.apache.org/solr/][Solr]]
-->
|  Author: | Foswiki:Main.MichaelDaum |
|  Copyright: | � 2009-2013, Michael Daum http://michaeldaumconsulting.com |
|  License: | GPL ( [[http://www.gnu.org/copyleft/gpl.html][GNU General Public License]]) |
|  Release: | 14 Jul 2013 |
|  Version: | v2.0.0 |
|  Home: | Foswiki:Extensions/%TOPIC% |
|  Support: | Foswiki:Support/%TOPIC% |
|  Change History: | <!-- versions below in reverse order --> |
|  14 Jul 2013: | added support for [[Foswiki:Extensions/PiwikPlugin][PiwikPlugin]] |
|  14 Mar 2013: | improved indexing performance; \
added configurable http timeouts takling to the solr backend; \
fixed language mappings for multilingual content; \
fixes due to latest changes in jquery.moment |
|  17 Oct 2011: | fixed WebServices::Solr to only encode to utf8 if needed; \
fixed handling character encoding on a pure utf8 foswiki; \
fixed schema for spell correction |
|  29 Sep 2011: | improved schema.xml: \
replaced !StandardTokenizer with !WhitespaceTokenizer, \
using new !ClassicTokenizer and !ClassicFilter to feed the spellchecker, \
switched spellchecker to !JaroWinklerDistance and lowered the \
frequency threshold for a term to be added to the \
spellchecker; \
building the spellchecker when optimizing the index now; \
fixed detecting the content language |
|  28 Sep 2011: | added multilanguage support per document; \
fixed default values in %SOLRSIMILAR; \
speeding up indexing by better caching ACLs; \
implemented mapping facet values to any other label; \
during query time; \
added Language facet to default search interface |
|  26 Sep 2011: | improved default boosting in dismax to prefer topic hits a lot stronger than attachments; \
improved default cache settings for better default performace; \
added support to distribute updates and search in a master-slave setup; \
added =boostquery=, =queryfields=, =phrasefields= parameter to customize boosting and sorting; \
improved default schema while documenting it |
|  21 Sep 2011: | upgrading to solr-3.4.0; \
fixed utf8 handling; \
added jump and i-feel-lucky options; \
made hidesingle configurable per facet; \
added disjunctivefacets and combinedfacets; \
fixed handling of date fields; \
support new ui::autocomplete in !JQueryPlugin; \
using type-specific icons in Foswiki:Extensions/MimeIconPlugin if installed; \
fixed quoting lucene queries; \
indexing outgoing links to support fast backlinks; \
adding fields createauthor, language and collection to schema; \
disabling phonetic boost in schema by default; \
be more robust in case of mallformed !DataForm definitions; \
copying every string field into a search field also to allow exact as well as fuzzy search; \
enhancing normalizeWebTopicName to create uniform web names using dots, not slashes everywhere; \
fixed parsing inline topic permissions; \
externalized sidebar pager into a new plugin of its own: Foswiki:Extensions/JQSerialPagerContrib; \
upgrading to <nop>WebService::Solr-0.14 ... which now requires CPAN:XML::Easy instead of CPAN:XML::Generator; \
lots of improvements to !SolrSearchBaseTemplate; \
now supporting Foswiki:Extensions/InfiniteScrollContrib in !SolrSearch; \
documentation improvements |
|  19 Apr 2011: | shipping a multicore setup by default; \
added support for Foswiki:Extensions/VirtualHostingContrib; \
fixed utf8 recoding; \
some usability improvements to faceted search interface; \
fixing illegal control characters in output (Oliver Schaub) |
|  16 Dec 2010: | added =state= field to schema used for approval workflows; \
added solrjob to ease cronjobbing indexing; \
added docu how to use iwatch for almost-realtime indexing; \
fixed dependencies to include Foswiki:Extensions/FilterPlugin as well; \
fixed mapping facet values to their display title in search interface; \
fixed delta updates not properly removing outdated attachment entries when these where moved/renamed; \
and some minor html improvements |
|  03 Dec 2010: | fixed solr-based !WebChanges and !SiteChanges using !PatternSkin |
|  01 Dec 2010: | adjustments due to changes in stringifier api; \
fixed removal of deleted webs from search index |
|  22 Nov 2010: | fixes integration with pattern skin |
|  18 Nov 2010: | initial public release |

%META:FILEATTACHMENT{name="SolrPluginSnap1.png" attachment="SolrPluginSnap1.png" attr="" comment="" date="1290091281" size="93552" user="ProjectContributor" version="1"}%
%META:FILEATTACHMENT{name="SolrPluginSnap2.png" attachment="SolrPluginSnap2.png" attr="" comment="" date="1290091283" size="158013" user="ProjectContributor" version="1"}%