A Hadoop file system (HDFS) can be crawled. The following parameters are available to configure Hadoop as a data source for LucidWorks.
| LucidWorks has been tested with Hadoop v0.20.2 only at this time. Other releases labeled 0.20.2 are expected to use the same protocols and may work with LucidWorks, but have not been tested.
Any release prior to 0.20.x or after 0.20.xxx is not expected to work as the protocols have changed. |
| Field | Description |
|---|---|
| Name | A name you want to give this data source. Data source names may contain any combination of letters, digits, spaces and other characters, and data source names are case sensitive. |
| Commit within (minutes) | Defines the maximum interval between commits, which is the process that adds the crawled documents to the index, which allows crawled documents to be searchable while crawling continues. The default is 15 minutes. Commits can be resource-heavy so if your index is large, you may want to set this higher to avoid system slowdowns in either crawling or searching. |
| Commit when crawl finishes | If this box is checked, the commit will not start until the end of the crawl. This may be preferable if documents do not need to be available to users immediately, or if your index is large and you want to avoid system resource conflicts. |
| URL | The path to the top node of the file system. This path is entered as the complete path to the directory containing the documents to be indexed. Various types of relative paths, such as ../ or ~/, are not supported. |
| Crawl Depth | The number of levels of directories that should be crawled. Enter 1 to crawl only the directory entered, or more than 1 to index files found in subdirectories. Leaving it blank will crawl all subdirectories. |
| Constrain to | If tree is chosen, the crawler will only access pages using the URL as the base path. For example, if crawling /Path/to/Files, the tree option is the equivalent of /Path/to/Files/*. Choosing none will allow the crawl to go to directories outside the base directory path entered if symbolic links are present. |
| Skip Files Larger Than (bytes) | The crawler can skip files larger than a specified size. This may be useful if index size or the length of time it takes to crawl a data source is a concern. The default of 10,485,760 bytes (10Mb) is automatically entered. |
| Include Paths | Defines the directories in the filesystem that should be crawled for indexing. If left blank, all subdirectory paths will be followed (limited by the crawl depth). This feature can be used to limit a filesystem crawl to specific subdirectories of a base directory path. For example, if the base directory path is /Path/to/Files, Include Paths could be used to limit the crawl to subdirectories /Path/to/Files/Archive/2010/* and /Path/to/Files/Archive/2011/*. Regular expressions are used here, see Using Regular Expressions for more information. |
| Exclude Paths | Directories on the site that should not be crawled and that should be excluded from the index. This can be used to exclude certain document types from being indexed. The same regular expression syntax can be used to specify Exclude Paths as is used for Include Paths definition. |
When you have filled in the form, click Create Data source.