|As of Solr 4.0, SolrCloud is the preferred way to distribute indexes for redundancy, failover, and improved performance. Index Replication and Distributed Search are considered obsolete technologies; while still supported, they are not in active development. See the section on Using SolrCloud in LucidWorks for more information on using SolrCloud with LucidWorks Search.|
Consider using distributed search when an index becomes too large to fit on a single system, or when a single query takes too long to execute. Distributed search can reduce the latency of a query by splitting the index into multiple shards and querying across all shards in parallel, merging the results.
Distributed search should not be used if queries to a single index are fast enough but one simply wishes to expand the capacity (queries per second) of the system. In this case, standard Index Replication should be used.
To utilize distributed search, the index must be split into shards across multiple servers. Each shard is a LucidWorks Search server containing a complete index that can be queried independently, but which only contains a fraction of the complete search collection.
|If using distributed indexing with a Solr XML data source type, you may encounter a situation where the crawl never ends without a restart of LucidWorks. This is due to a problem in the distributed index processor and the way Solr XML files are crawled by LucidWorks.
There are two possible solutions to this problem:
One method of splitting the search collection into multiple shards is to index some documents to each shard instead of sending all documents to a single shard. Updates to a document should always be sent to the same shard, and documents should not be duplicated on different shards.
A Distributed Update Processor can be enabled to automatically support distributed indexing by sending update requests to multiple servers (shards).
Enabling distributed indexing is done via the solrconfig.xml file, found in $LWE_HOME/solr/cores/collection/conf (replace collection with the name of the collection that is being configured for distributed indexing). By default it is not enabled. The solrconfig.xml file needs to be installed on each shard, and the shards should be listed in the same order in each file.
The distributed update processor is controlled by two parameters, shards and self, which may either be specified in solrconfig.xml, or supplied with a specific update request to Solr.
- shards lists the servers in the cluster. The list should be exactly the same (that is, in the same order) in the configuration file for every server in the cluster.
- self should be different for each server in the cluster and should match the entry in shards for the particular server. It is used to allow updates for the particular server to be directly added rather than going through the HTTP interface. If it is missing, distributed update will still work, but will be less efficient.
To start using distributed indexing, find the following section in solrconfig.xml, and uncomment the shard location definitions. Below is an example of shard definition that is not commented out.
If distributed indexing has been configured as above, then any indexing initiated from the LucidWorks Search administration user interface, such as crawling directories, will be appropriately handled by sending some documents to each server. One can use the distributed update processor in conjunction with any update handler while directly updating Solr. The /update/xml and /update/csv update handlers are already configured to use distrib, the distributed update processor, by default.
If an update handler has not been configured to use the distributed update processor, it may be specified in the URL via the update.processor parameter:
If the self and shards parameters are not configured in solrconfig.xml, then they may be specified as arguments on the update url.
Update commands may be sent to any server with distributed indexing configured correctly. Document adds and deletes are forwarded to the appropriate server/shard based on a hash of the unique document id. commit commands and deleteByQuery commands are sent to every server in shards.
After a logical index is split across multiple shards, distributed search is used to make requests to all shards, merging the results to make it appear as if it came from a single server.
One can use distributed search with Solr request handlers such as standard, dismax, or lucid (the handler used by the LucidWorks Search), or any other search handler based on org.apache.solr.handler.component.SearchHandler.
The following Solr components currently support distributed searching:
- The Query component that returns documents matching a query
- The Facet component, for facet.query and facet.field requests where facet.sorted=true (the default: return the constraints with the highest counts)
- The Highlighting component, which highlights results
- The Debug component
The presence of the shards parameter in a request will cause that request to be distributed across all shards in the list. The syntax of shards is host1:port1/base_url1,host2:port2/base_url2,...
The example below would query across 3 different shards, combining the results:
As a convenience to clients, a new request handler could be created with shards set as a default like any other ordinary parameter.
|The shards parameter should not be set as a default in the standard request handler as this could cause infinite recursion.|
To provide fault tolerance and increased scalability, standard replication can be used to provide multiple identical copies of each index shard. Each shard would have a master and multiple slaves.
Only the master for each shard should be configured in distributed indexing or specified to the distributed update processor. There is no fault tolerance while indexing - if the master for a shard goes down, indexing should be suspended.
Each shard will have multiple replicas. A Virtual IP (VIP) should be configured in the load balancer for each shard, consisting of all replicas. LucidWorks Search distributed search configuration, and the shards parameter for distributed search requests should use these VIPs.
A single VIP consisting of all the shard VIPs should be configured for all external systems to use the search service.