|
Consider using distributed search when an index becomes too large to fit on a single system, or when a single query takes too long to execute. Distributed search can reduce the latency of a query by splitting the index into multiple shards and querying across all shards in parallel, merging the results.
|
This functionality is available in LucidWorks Enterprise but not LucidWorks Cloud.
|
Distributed Indexing
To utilize distributed search, the index must be split into shards across multiple servers. Each shard is a LucidWorks Enterprise server containing a complete index that can be queried independently, but which only contains a fraction of the complete search collection.
Manual Distributed Indexing
One method of splitting the search collection into multiple shards is to index some documents to each shard instead of sending all documents to a single shard. Updates to a document should always be sent to the same shard, and documents should not be duplicated on different shards.
Manual Configuration
A Distributed Update Processor can be enabled to automatically support distributed indexing by sending update requests to multiple servers (shards).
Enabling distributed indexing is done via the solrconfig.xml file, found in $LWE_HOME/solr/cores/collection/conf (replace collection with the name of the collection that is being configured for distributed indexing). By default it is not enabled. The solrconfig.xml file needs to be installed on each shard, and the shards should be listed in the same order in each file.
The distributed update processor is controlled by two parameters, shards and self, which may either be specified in solrconfig.xml, or supplied with a specific update request to Solr.
- shards lists the servers in the cluster. The list should be exactly the same (that is, in the same order) in the configuration file for every server in the cluster.
- self should be different for each server in the cluster and should match the entry in shards for the particular server. It is used to allow updates for the particular server to be directly added rather than going through the HTTP interface. If it is missing, distributed update will still work, but will be less efficient.
To start using distributed indexing, find the following section in solrconfig.xml, and uncomment the shard location definitions. Below is an example of shard definition that is not commented out.
<updateRequestProcessorChain name="distrib"> <processor class="com.lucid.update.DistributedUpdateProcessorFactory"> <!-- example configuration... "shards should be in the *same* order for every server in a cluster. Only "self" should change to represent what server *this* is. --> <str name="self">localhost:8983/solr</str> <arr name="shards"> <str>localhost:8983/solr</str> <str>localhost:7574/solr</str> </arr> </processor> <processor class="solr.LogUpdateProcessorFactory"> <int name="maxNumToLog">10</int> </processor> <processor class="solr.RunUpdateProcessorFactory"/> </updateRequestProcessorChain>
Indexing Documents
If distributed indexing has been configured as above, then any indexing initiated from the LucidWorks Enterprise administration user interface, such as crawling directories, will be appropriately handled by sending some documents to each server. One can use the distributed update processor in conjunction with any update handler while directly updating Solr. The /update/xml and /update/csv update handlers are already configured to use distrib, the distributed update processor, by default.
If an update handler has not been configured to use the distributed update processor, it may be specified in the URL via the update.processor parameter:
http://localhost:8888/solr/collection1/update?update.processor=distrib
If the self and shards parameters are not configured in solrconfig.xml, then they may be specified as arguments on the update url.
http://localhost:8888/solr/collection1/update?update.processor=distrib&self=localhost:8888/solr&
shards=localhost:8983/solr,localhost:7574/solr,localhost:8888/solr
Update commands may be sent to any server with distributed indexing configured correctly. Document adds and deletes are forwarded to the appropriate server/shard based on a hash of the unique document id. commit commands and deleteByQuery commands are sent to every server in shards.
Distributed Search
After a logical index is split across multiple shards, distributed search is used to make requests to all shards, merging the results to make it appear as if it came from a single server.
Programmatic Distributed Search
One can use distributed search with Solr request handlers such as standard, dismax, or lucid (the handler used by the LucidWorks Enterprise), or any other search handler based on org.apache.solr.handler.component.SearchHandler.
Supported Components
The following Solr components currently support distributed searching:
- The Query component that returns documents matching a query
- The Facet component, for facet.query and facet.field requests where facet.sorted=true (the default: return the constraints with the highest counts)
- The Highlighting component, which highlights results
- The Debug component
The presence of the shards parameter in a request will cause that request to be distributed across all shards in the list. The syntax of shards is host1:port1/base_url1,host2:port2/base_url2,...
The example below would query across 3 different shards, combining the results:
http://localhost:8888/solr/collection1/select?shards=localhost:8983/solr,localhost:7574/solr,localhost:8888/solr&q=super
As a convenience to clients, a new request handler could be created with shards set as a default like any other ordinary parameter.
| The shards parameter should not be set as a default in the standard request handler as this could cause infinite recursion. |
Scalability and Fault Tolerance
To provide fault tolerance and increased scalability, standard replication can be used to provide multiple identical copies of each index shard. Each shard would have a master and multiple slaves.
Indexing in a Fault Tolerant Distributed Configuration
Only the master for each shard should be configured in distributed indexing or specified to the distributed update processor. There is no fault tolerance while indexing - if the master for a shard goes down, indexing should be suspended.
Searching in a Fault Tolerant Distributed Configuration
Each shard will have multiple replicas. A Virtual IP (VIP) should be configured in the load balancer for each shard, consisting of all replicas. LucidWorks Enterprise distributed search configuration, and the shards parameter for distributed search requests should use these VIPs.
A single VIP consisting of all the shard VIPs should be configured for all external systems to use the search service.
