The High Volume HDFS (HV-HDFS) Crawler is a Map-Reduce-enabled crawler designed to leverage the scaling qualities of Apache Hadoop while indexing content into LucidWorks Search. In conjunction with LucidWork's usage of SolrCloud, applications should be able to meet their large scale indexing and search requirements.
To achieve this, HV-HDFS consists of a series of Map-Reduce-enabled Jobs to convert raw content into documents that can be indexed into LucidWorks which in turn relies on the Behemoth project (specifically, the LWE fork/branch of Behemoth hosted on Github) for Map-Reduce-ready document conversion via Apache Tika and writing of documents to LucidWorks.
The HV-HDFS crawler is currently marked as "Early Access" and is thus subject to changes in how it works in future releases.
System Requirements
- Apache Hadoop. We've tested with Hadoop 0.20.2 and 1.0. Other versions will likely work as well, but we recommend thorough testing for compatibility.
- LucidWorks running in SolrCloud mode.
Please note, explanation of setting up Hadoop is beyond the scope of this document. We recommend reading one of the many tutorials found online or one of the books on Hadoop.
Using HV-HDFS in LucidWorks
Once Hadoop and SolrCloud are ready, configure a data source within LucidWorks Search, either with the High Volume HDFS data source type in the Admin UI or using the Data Sources API.
| Unlike other crawlers in LucidWorks Search, this crawler currently has no way of tracking which content is new, updated, or deleted. Thus, all content found is reported as "new" with each crawl. |
How it Works
The HV-HDFS crawler consists of three stages designed to take in raw content and output results to LucidWorks. These stages are:
- Create one or more Hadoop SequenceFiles from the raw content, where the key in the SequenceFile is the name of the file and the value is the contents of the file stored as a BehemothDocument. This process is currently done sequentially. A BehemothDocument is an intermediate form designed to allow for reuse across several systems without having to reprocess the raw content. The results are written to the Work Path area with a path name with a pattern of inputToBehemoth_<Collection Name><Data Source Id><System Time in Milliseconds>.
- Run a Map-Reduce job to extract text and metadata from the raw content using Apache Tika. This is similar to the LucidWorks approach of extracting content from crawled documents, except it is done with Map-Reduce.
- Run a Map-Reduce job to send the extracted content from HDFS to LucidWorks using the SolrJ client. This implementation works with SolrJ's CloudServer Java client which is aware of where LucidWorks is running via Zookeeper.
| The processing approach is currently all or nothing when it comes to ingesting the raw content and all 3 steps must be completed each time, regardless of whether the raw content hasn't changed. Future versions may allow the crawler to restart from the SequenceFile conversion process. |
Differences from Other Hadoop Crawlers in LucidWorks
While the HV-HDFS, Hadoop File System (HDFS) and Hadoop File System over S3 (S3H) crawlers all use Hadoop to access Hadoop's distributed file system, there is a big difference in how they utilize those resources. The HDFS and S3H data sources are designed to be polite and crawl through the content stored in HDFS just as if they were crawling a web site or any other file system.
The HV-HDFS crawler, on the other hand, is designed to take full advantage of the scaling abilities of Hadoop's Map-Reduce architecture. Thus, it runs jobs using all of the nodes available in the Hadoop cluster just like any other Hadoop Map-Reduce job. This has significant ramifications for performance since it is designed to move a lot of content, in parallel, as fast as possible (depending on Hadoop's capabilities), from its raw state to the LucidWorks index. Thus, you will need to design your LucidWorks SolrCloud implementation accordingly and make sure to provision the appropriate number of nodes.
Conversion to SequenceFiles
The first step of the crawl process converts the input content into a SequenceFile. In order to do this, the entire contents of that file must be read into memory so that it can be written out as a BehemothDocument in the SequenceFile. Thus, you should be careful to ensure that the system does not load into memory a file that is larger than the Java Heap size of the process. In certain cases, Behemoth can work with existing files such as SequenceFiles to convert them to Behemoth SequenceFiles. Contact Lucid Imagination for possible alternative approaches.
Example: Indexing Shakespeare with Map-Reduce
The following steps demonstrate indexing the complete works of Shakespeare using the LucidWorks Search HV-HDFS crawler.
Prepare the Content
- Download the data into a temporary directory.
- cd /tmp
- mkdir shakespeare
- wget http://www.it.usyd.edu.au/~matty/Shakespeare/shakespeare.tar.gz
- Unpack the archive with tar -xf shakespeare.tar.gz -C shakespeare
- Load the data to Hadoop with <PATH TO HADOOP>/bin/hadoop fs -put shakespeare ./shakespeare
Setup LucidWorks
Start LucidWorks in SolrCloud mode and make note of where Zookeeper is running. See the section on Using SolrCloud in LucidWorks for more information on how to start in SolrCloud mode.
Setup the Data Source and Run
Create a new data source using either the Admin UI or the Data Source API, as described above. The "path" would be the location of the Hadoop NameNode, such as, hdfs://lucidserver:54310/user/lucidworks/shakespeare.
Once the data source is created, you can use the Hadoop UI to track the progress of the various Map-Reduce jobs. You can also inspect your specified work path to see the intermediate files using the Hadoop filesystem commands (e.g., hadoop fs -ls).
