By default, LucidWorks Search will crawl as much content as it can (within limits set on the data source), parse the documents to extract fields, and finally index the documents in one seamless step. However, there may be times when you would like to do some processing on the documents before indexing them, perhaps to add metadata or to modify data in specific fields. In that case, it is possible to only crawl the content and save it in a batch for later parsing and/or indexing. This is called Batch Processing and allows you to separate the fetching data from the process of parsing the rich formats (such as PDFs, Microsoft Office documents, and so on), as well as the process of indexing the parsed content in Solr.
Batches consist of the following two parts:
- a container with raw documents, and the protocol-level metadata per document
- a container with parsed documents, ready to be indexed.
The exact format of this storage is specific to a crawler controller implementation. Currently a simple file-based store is used, with a binary format for the raw content part and a JSON format for the parsed documents. The first container is created during the fetching phase, and the second container is created during the parsing phase. A new round of fetching creates a new batch if one or more of the parameters described above requires it.
It's not possible to configure Batch Crawling with the LucidWorks Search Admin UI. To work with batches and batch jobs, use the Batch Operations API. The basic workflow is as follows:
- Create a data source using the Admin UI or Data Sources API. Don't start crawling yet.
- Configure the data source to be saved as a batch by setting the indexing parameter to false using the Data Sources API. You can also set the caching and indexing parameters as described below.
- Start the crawl and let it finish.
- Get the batch_id for the data source using the Batch Operations API call: GET http://localhost:8888/api/collections/collection1/batches.
- Using the Batch Operations API, start the batch job for your data source using the batch_id obtained in the previous step: PUT http://localhost:8888/api/collections/collection1/batches/crawler/job/batch_id.
To instruct LucidWorks Search not to parse or index the crawled documents, set the indexing parameter of a data source to false using the Data Sources API. You can also set the parsing and caching parameters to true or false, depending on your needs. Batch crawling attributes for data sources are as follows:
|parsing||boolean||true||If true, the raw content fetched from remote repositories is immediately parsed in order to extract the plain text and metadata. If false, the content is not parsed: it is stored in a new batch with its protocol-level metadata. New batches are created during each crawl run as needed.|
|caching||boolean||false||If true, the raw content is stored in a batch even if immediate parsing and/or indexing is requested. You can use this to preserve the intermediate data in case of crawling or indexing failure, or in cases where full re-indexing is needed and you would like to avoid fetching the raw content again.|
|indexing||boolean||true||If true, the parsed content is sent to Solr for indexing. If false, the parsed document is not indexed: it is stored in a batch (either a newly created one, or the one where the corresponding raw content was stored). Set this attribute to false to enable batch crawling.|
|When you configure a data source to process documents as a batch, information about crawl attempts will display in the Admin UI for that data source (even though you cannot configure the batch parameters via the UI). So, you can use the Data Sources API to enabled caching and/or disable indexing, and initiate the crawl through the Admin UI. The UI will show the number of documents found, updated, deleted, etc.|
Not all crawler controllers support all batch processing operations. For example, the Aperture crawler (lucid.aperture) does not support raw content storage: it behaves as if the "parsing" parameter is always true and caching is always false.
You can also use the Batch Operations to get the status of or stop running batch jobs, delete batches and batch jobs, and so on.