|
LucidWorks Search Data sources are one of the ways in which content can be loaded into the Big Data system. Documents indexed in this way do not go through the extract-transform-load workflow, as described in Workflows, which means the incoming documents cannot be analyzed for clustering or statistically interesting phrases. Documents can also be sent directly to the system, using the Big Data Document Indexing API. More background information on this topic is available in the section on Loading Data. Because LucidWorks Big Data includes LucidWorks Search "under the hood", all of the data source types available with LucidWorks Search are available to the Big Data system. Below we give a couple of examples of using the data sources in the Big Data context, but full details on each type is available in the LucidWorks Search documentation. |
Topics covered in this section: |
Background Information
Data sources describe the target repository of documents and access method. This description is then used to create a crawl job to be executed by a specific crawler implementation (called crawler controllers).
A data source is defined by selecting a crawler controller, then specifying a valid type for that crawler. Some crawlers work with several types, other crawlers only support one type. It is important to match the correct crawler with the type of data source to be indexed.
Each crawler and specified type has different supported attributes. A few attributes are common across all crawlers and types, and some types share attributes with other types using the same crawler. Review the supported attributes carefully when creating data sources with this API.
At present, LucidWorks Search includes the following built-in crawler controllers that support the following kinds of data sources:
| Crawler Controller | Symbolic Name | Data Source Types Supported |
|---|---|---|
| Aperture-based crawlers | lucid.aperture |
|
| DataImportHandler-based JDBC crawler | lucid.jdbc |
|
| SolrXML crawler | lucid.solrxml |
|
| Google Connector Manager-based crawler | lucid.gcm |
|
| Remote file system and pseudo-file system crawler | lucid.fs |
|
| External data | lucid.external |
|
| Twitter stream | lucid.twitter.stream |
|
| High-Volume HDFS | lucid.map.reduce.hdfs |
|
| We'll only cover the Twitter and High-Volume HDFS connectors in this guide. Please see the LucidWorks Search documentation for full details of each type of data source supported. |
API Entry Points
/sda/v1/client/collections/collection/datasources: list or create data sources in a particular collection
/sda/v1/client/collections/collection/datasources/id: start, remove, or get details for a particular data source
Get a List of Data Sources
GET /sda/v1/client/collections/collection/datasources
Input
Path Parameters
| Key | Description |
|---|---|
| collection | The collection name. |
Query Parameters
None.
Output
Output Content
A JSON map of attributes to values. The exact set of attributes for a particular data source depends on the type. There is, however, a set of attributes common to all data source types. Specific attributes are discussed in sections for those types later in this section.
Common Attributes
These attributes are used for all data source types (except where specifically noted).
General Attributes
| Key | Type | Description |
|---|---|---|
| id | 32-bit integer | The numeric ID for this data source. |
| type | string | The type of this data source. Valid types are:
|
| crawler | string | Crawler implementation that handles this type of data source. The crawler must be able to support the specified type. Valid crawlers are:
|
| collection | string | The name of the document collection that documents will be indexed into. |
| name | string | A human-readable name for this data source. Names may consist of any combination of letters, digits, spaces and other characters. Names are case-insensitive, and do not need to be unique: several data sources can share the same name. |
| category | string | The category of this data source: Web, FileSystem, Jdbc, SolrXml, Sharepoint, External, or Other. For informational purposes only. |
| mapping | JSON map | Attributes that define how incoming fields in the content will be handled. They can be mapped to other fields in the content, fields can be explicitly inserted into the content, or defaults for missing content can be supplied. If mapping is something you'd like to work with, please see the LucidWorks Search documentation section on Data Sources. |
| output_type | string | A fully qualified class name of the format of the output from the crawl. For LucidWorks Big Data, this must always be com.lucid.sda.hbase.lws.HBaseUpdateController. The default is intended for direct indexing by Solr, but LucidWorks Big Data first stores all documents in HBase and synchronizes with Solr. For this reason, this should always be specified as com.lucid.sda.hbase.lws.HBaseUpdateController. If this is not specified, and the default is used, documents will not be available for Analysis. |
| output_args | string | A Zookeeper connect string that the HBase library can understand, such as localhost:2181. |
Field Mapping
The output also includes the field mapping for the data source, which is modifiable as part of the regular data source update API.
Each of the attributes will be shown under the main attribute mapping, which contains a JSON map with several kyes & values. For more information, please see the LucidWorks Search documentation section on Data Sources.
Optional Commit Rules
The following attributes are optional and relate to when new documents will be added to the index:
| Key | Type | Description |
|---|---|---|
| commit_within | integer | Number of milliseconds that defines the maximum interval between commits while indexing documents. The default is 900,000 milliseconds (15 minutes). |
| commit_on_finish | boolean | When true (the default), then commit will be invoked at the end of crawl. |
Batch Processing
Batch processing allows crawling a repository of content, but not indexing the content until a later time (perhaps after some additional processing). A few attributes control batch processing and are also optional. They are not covered in this section, but are covered in the section of LucidWorks Search documentation on Processing Documents in Batches.
Twitter Stream Attributes
The Twitter Stream data source type uses Twitter's streaming API to index tweets on a continuous basis.
This data source uses the lucid.twitter.stream crawler. Unlike other crawlers, which generally have some kind of defined end point (even if that end point is after hundreds of thousands or millions of documents), this crawler opens a stream and will not stop until Twitter stops. The Data Source Jobs API in LucidWorks Search will allow you to stop the stream if necessary.
This data source is in early stages of development, and does not yet process deletes. Deletes will be shown in Data Source History statistics, but these are deleted tweets marked as such from the streaming API - the actual tweets may or may not be in the index (and if they were, the data source does not yet process them).
In order to successfully configure a Twitter stream, you must first register your application with Twitter and accept their terms of service. The registration process will provide you with the required OAuth tokens you need to access the stream. To get the tokens, follow these steps:
- Make sure you have a Twitter account, and go to http://dev.twitter.com/ and sign in.
- Choose "Create an App" link and fill out the required details. The callback field can be skipped. Hit "Create Application" to register your application.
- The next page will contain the Consumer Key and Consumer Secret, which you will need to configure the data source in LucidWorks.
- At the bottom of the same page, choose "Create My Access Token".
- The next page will contain the Access Token and Token Secret, which you will also need to configure the data source in LucidWorks.
While you need a Twitter account to register an application, you do not use your Twitter credentials to configure this data source. Take the Consumer Key, Consumer Secret, Access Token, and Token Secret information and store it where you can access it while configuring the data source.
When creating a data source of type twitter_stream, the value lucid.twitter.stream must be supplied for the crawler attribute, described in the section on common attributes above. The common attributes are available for configuration in addition to those listed below.
| Key | Type | Required | Default | Description |
|---|---|---|---|---|
| access_token | string | Yes | Null | The access token is provided after registering with Twitter and requesting an access token (see above). |
| consumer_key | string | Yes | Null | The consumer key is provided after registering with Twitter (see above). |
| consumer_secret | string | Yes | Null | The consumer secret is provided after registering with Twitter (see above). It should be treated as a password for your registered application. |
| filter_follow | list | No | Null | A set of specific Twitter user IDs to filter the stream. If combined with another filter, they act as OR statements on the stream (i.e., the tweet must match the user ID or the keyword or the location). Note that this is not the user handle or screen name, but a numeric ID assigned by Twitter. To find the ID, you could do an API request like: {{
https://api.twitter.com/1/users/show.xml?screen_name=usaa
}}, replacing "usaa" with the user handle as necessary. The ID is found in the "id" field of the XML output. |
| filter_locations | list | No | Null | A set of bounding boxes (latitude/longitude, up, down, right, left, etc.) to filter the stream for geographic location. If combined with another filter, they act as OR statements on the stream (i.e., the tweet must match the user ID or the keyword or the location). |
| filter_track | list | No | Null | A set of keywords to filter the stream. If combined with another filter, they act as OR statements on the stream (i.e., the tweet must match the user ID or the keyword or the location). |
| max_docs | long | No | -1 | While testing the feed, it may be desirable to limit initial streams to a specific number of tweets. The default for this is "-1", which doesn't close the connection before it is manually closed. |
| sleep | integer | Yes | 10000 | Twitter will occasionally throttle streaming, in which case you can configure the data source to wait the requisite amount of time before trying again. The default is 10,000 milliseconds, which should be sufficient for most scenarios. |
| token_secret | string | Yes | Null | The token key is provided after registering with Twitter and requesting an access token (see above). It should be treated as a password for your API access. |
{
"access_token": "G9IdEbD0bK7F8DFuo20srKKboGud9ecrx8i0MdG-RxFUFfbKx0",
"caching": false,
"category": "Other",
"collection": "collection1",
"commit_on_finish": true,
"commit_within": 900000,
"consumer_key": "dQF16QSFRnRFYwyJunRjdj1",
"consumer_secret": "vVx4OWuxvvXSVQ0BUYOkSu8DCNvkvvUaVl0vuDWvU4",
"crawler": "lucid.twitter.stream",
"id": 4,
"indexing": true,
"mapping": {
...
},
"max_docs": -1,
"name": "Twitter Stream",
"output_type":"com.lucid.sda.hbase.lws.HBaseUpdateController"
"output_args":"localhost:2181"
"parsing": true,
"sleep": 10000,
"token_secret": "9hE4rcCWVwCDL6C1CCzWwcWtr1c14iw3M4xiD3CDV9rS",
"type": "twitter_stream"
}
High-Volume HDFS Attributes
The High Volume HDFS (HV-HDFS) data source uses a MapReduce-enabled crawler designed to leverage the scaling qualities of Apache Hadoop while indexing content.
To achieve this, HV-HDFS consists of a series of MapReduce enabled Jobs to convert raw content into documents that can be indexed which in turn relies on the Behemoth project (we specifically leverage the LWE fork of this project) for MapReduce ready document conversion via Apache Tika and writing of documents to LucidWorks.
The HV-HDFS data source is currently marked as "Early Access" and is thus subject to changes in how it works in future releases.
| Before using the HV-HDFS Data Source type, please review the section on Using the High Volume HDFS Crawler in LucidWorks Search documentation. |
When creating a data source of type high_volume_hdfs, the value lucid.map.reduce.hdfs must be supplied for the crawler attribute, described in the section on common attributesabove. The common attributes are available for configuration in addition to those listed below.
| Key | Type | Required | Default | Description |
|---|---|---|---|---|
| MIME_type | string | No | Null | Allows limiting the crawl to content of a specific MIME type. If a value is entered, Behemoth will skip Tika's MIME detection and process the content with the appropriate parser. The MIME type should be entered in full, such as "application/pdf" for PDF documents. |
| hadoop_conf | string | Yes | Null | The location of the Hadoop configuration directory that contains the Hadoop core-site.xml, mapred-site.xml and other Hadoop configuration files. This path must reside on the same machine as the LucidWorks server. Hadoop does not need to be running on the same server as LucidWorks, but the configuration directory must be available from the LucidWorks server. |
| path | string | Yes | Null | The input path where your data resides. It is not required that this be in HDFS to begin with, since the first step of the process converts content to one or more SequenceFiles in HDFS. For example, hdfs://bob:54310/path/to/content and file:///path/to/local/content would both be valid inputs. |
| recurse | boolean | No | True | If true, the default, the crawler will crawl all subdirectories of the input path. Set to false if the crawler should stay within the top directory specified with the path attribute. |
| tika_content _handler |
string | No | com.digitalpebble. behemoth.tika. TikaProcessor |
In most cases, the default, com.digitalpebble.behemoth.tika.TikaProcessor, does not need to be changed. If you have need to change it, enter the fully qualified class name of a Behemoth Tika Processor that is capable of extracting content from documents. The class must be available on the classpath in the Job jar used by LucidWorks. |
| work_path | string | Yes | /tmp | A path to use for intermediate storage. Note the connector does not clean up temporary content so it can be used in debugging, if necessary. Once the job is complete, content stored in this location temporarily can be safely deleted. For example, hdfs://bob:54310/tmp/hv_hdfs would be a valid path. |
| zookeeper_host | string | Yes | Null | The host and port where ZooKeeper is running and coordinates SolrCloud activity, entered as hostname:port. |
{
"MIME_type": "",
"commit_on_finish": true,
"mapping": {
...
},
"collection": "collection1",
"work_path": "hdfs://example:54310/tmp",
"type": "high_volume_hdfs",
"recurse": true,
"crawler": "lucid.map.reduce.hdfs",
"id": 3,
"category": "External",
"tika_content_handler": "com.digitalpebble.behemoth.tika.TikaProcessor",
"zookeeper_host": "allie:9888",
"name": "Example",
"path": "/Users/projects/content/citeseer/1",
"commit_within": 900000,
"hadoop_conf": "/Users/projects/hadoop/hadoop-0.20.2/conf"
"output_type":"com.lucid.sda.hbase.lws.HBaseUpdateController"
"output_args":"localhost:2181"
}
Examples
Input
Get all data sources for the "documentation" collection.
curl -u administrator:foo -X GET http://localhost:8341/sda/v1/client/collections/documentation/datasources
Output
The output below omits the mapping sub-attributes, which define how incoming content is handled. This example data source was created automatically by requesting documents to be added to the index with the Document Indexing API.
[
{
"children": [],
"collection": "documentation",
"createTime": 1337800027787,
"id": "14",
"properties": {
"callback": null,
"category": "External",
"collection": "documentation",
"commit_on_finish": true,
"commit_within": 900000,
"crawler": "lucid.external",
"id": 14,
"mapping": {
...
},
"name": "documentation_SDA_DS",
"output_args": "localhost:2181",
"output_type": "com.lucid.sda.hbase.lws.HBaseUpdateController",
"source": "SDA",
"source_type": "sda_document_service",
"type": "external",
"url": "external:SDA"
},
"status": "EXISTS",
"throwable": null
}
]
Create a Data Source
POST /sda/v1/client/collections/collection/datasources
Input
Path Parameters
| Key | Description |
|---|---|
| collection | The collection name |
Query Parameters
None
Input content
JSON block with all attributes. The ID field, if present, will be ignored. See attributes in section on getting a list of data sources.
Output
Output Content
JSON representation of new data source. Attributes returned are listed in the section on getting a list of data sources.
Examples
Input
Create a new data source to consume Twitter in the "documentation" collection. Note that the access_token, consumer_key, consumer_secret and token_secret are only examples and should be replaced with your own values.
curl -u administrator:foo -X POST -H 'Content-type: application/json'
-d '{
"access_token":"G9IdEbD0bK7F8DFuo20srKKboGud9ecrx8i0MdG-RxFUFfbKx0",
"consumer_key":"dQF16QSFRnRFYwyJunRjdj1",
"consumer_secret":"vVx4OWuxvvXSVQ0BUYOkSu8DCNvkvvUaVl0vuDWvU4",
"crawler":"lucid.twitter.stream",
"name":"My Twitter Stream",
"token_secret":"9hE4rcCWVwCDL6C1CCzWwcWtr1c14iw3M4xiD3CDV9rS",
"type":"twitter_stream",
"sleep":"10000",
"max_docs":"100"
"output_type":"com.lucid.sda.hbase.lws.HBaseUpdateController"
"output_args":"localhost:2181"
}' http://localhost:8341/sda/v1/client/collections/documentation/datasources
Output
{
"children": [],
"collection": "documentation",
"createTime": 1337801651887,
"id": "documentation",
"properties": {
"access_token": "G9IdEbD0bK7F8DFuo20srKKboGud9ecrx8i0MdG-RxFUFfbKx0",
"caching": false,
"category": "Other",
"collection": "documentation",
"commit_on_finish": true,
"commit_within": 900000,
"consumer_key": "dQF16QSFRnRFYwyJunRjdj1",
"consumer_secret": "vVx4OWuxvvXSVQ0BUYOkSu8DCNvkvvUaVl0vuDWvU4",
"crawler": "lucid.twitter.stream",
"id": 49,
"indexing": true,
"mapping": {
"datasource_field": "data_source",
"default_field": null,
"dynamic_field": "attr",
"literals": {},
"lucidworks_fields": true,
"mappings": {},
"multi_val": {},
"original_content": false,
"types": {},
"unique_key": "id",
"verify_schema": true
},
"max_docs": 100,
"name": "My Twitter Stream",
"output_type":"com.lucid.sda.hbase.lws.HBaseUpdateController"
"output_args":"localhost:2181"
"parsing": true,
"sleep": 10000,
"token_secret": "9hE4rcCWVwCDL6C1CCzWwcWtr1c14iw3M4xiD3CDV9rS",
"type": "twitter_stream",
"url": "https://stream.twitter.com"
},
"status": "CREATED",
"throwable": null
}
Get Data Source Details
This API provides the settings information for a specific data source.
GET /sda/v1/client/collections/collection/datasources/id
| Note that the only way to find the id of a data source is to either store it on creation, or use the API call referenced above to get a list of data sources. |
Input
Path Parameters
| Key | Description |
|---|---|
| collection | the collection name |
| id | The data source ID |
Query Parameters
None.
Input content
None.
Output
Output Content
| Key | Type | Description |
|---|---|---|
| collection | string | The collection the data source belongs to. |
| id | integer | The ID of the data source. |
| properties | JSON map | This includes all the attributes for the data source. |
| status | string | The state of the data source. In most cases, this will be simply EXISTS. |
Examples
Get all of the parameters for data source 49, created in the previous step.
Input
curl -u administrator:foo -X GET http://localhost:8341/sda/v1/client/collections/documentation/datasources/49
Output
The mapping attributes have been omitted from this example but will be returned in a successful response.
{
"children": [],
"collection": "documentation",
"createTime": 1337873186893,
"id": "documentation",
"properties": {
"access_token": "G9IdEbD0bK7F8DFuo20srKKboGud9ecrx8i0MdG-RxFUFfbKx0",
"caching": false,
"category": "Other",
"collection": "documentation",
"commit_on_finish": true,
"commit_within": 900000,
"consumer_key": "dQF16QSFRnRFYwyJunRjdj1",
"consumer_secret": "vVx4OWuxvvXSVQ0BUYOkSu8DCNvkvvUaVl0vuDWvU4",
"crawler": "lucid.twitter.stream",
"id": 49,
"indexing": true,
"mapping": {
...
},
"max_docs": 100,
"name": "My Twitter Stream",
"output_args": "localhost:2181",
"output_type": "com.lucid.sda.hbase.lws.HBaseUpdateController",
"parsing": true,
"sleep": 10000,
"token_secret": "9hE4rcCWVwCDL6C1CCzWwcWtr1c14iw3M4xiD3CDV9rS",
"type": "twitter_stream",
"url": "https://stream.twitter.com"
},
"status": "EXISTS",
"throwable": null
}
Update Data Source Details
This API allows updating the settings information for a specific data source.
PUT /sda/v1/client/collections/collection/datasources/id
| Note that the only way to find the id of a data source is to either store it on creation, or use the API call referenced above to get a list of data sources. |
Input
Path Parameters
| Key | Description |
|---|---|
| collection | the collection name |
| id | The data source ID |
Query Parameters
None.
Input content
JSON block with either all attributes or just those that need updating. Data source type, crawler type, and ID cannot be updated. Possible attributes are listed in the section above on getting a list of data sources.
Output
Output Content
The output is essentially a status report. If successful, the output will contain a line "status":"SUCCEEDED". It may also report FAILED if the document was invalid for some reason. The other parts of the response are not essential at this point, only the status. Children will also be returned, which in the case of adding documents, a SUCCEEDED indicates that the documents were successfully added to the Solr/LucidWorks index.
Examples
Input
Update the "max_docs" attribute for data source 49, created in the previous step.
curl -u administrator:foo -verbose -X PUT -H 'Content-type: application/json' -d '{"max_docs":"150"}'
http://localhost:8341/sda/v1/client/collections/documentation/datasources/49
Output
{
"children": [],
"collection": "documentation",
"createTime": 1337972829851,
"id": "documentation",
"status": "SUCCEEDED",
"throwable": null
}
Start a Data Source
POST /sda/v1/client/collections/collection/datasources/id
Input
Path Parameters
| Key | Description |
|---|---|
| collection | The collection name |
| id | The data source ID |
Query Parameters
None
Input content
JSON block with either all attributes or just those that need updating. Data source type, crawler type, and ID cannot be updated. Other attributes are listed in the section on getting a list of data sources.
Output
Output Content
The output is essentially a status report. If successful, the output will contain a line "status":"RUNNING". The other parts of the response are not essential at this point, only the status.
To check the ongoing status, you can use the LucidWorks Search Data Source Status API.
Examples
Input
Start data source 49.
curl -u administrator:foo -X POST http://localhost:8341/sda/v1/client/collections/documentation/datasources/49
Output
{
"children": [],
"collection": "documentation",
"createTime": 1337809434187,
"id": "49",
"status": "RUNNING",
"throwable": null
}
Delete a Data Source
DELETE /sda/v1/client/collections/collection/datasources/id
Input
Path Parameters
| Key | Description |
|---|---|
| collection | the collection name |
| id | The data source ID |
Query Parameters
None.
Input content
None.
Output
Output Content
None.
Examples
Input
Delete data source 48.
curl -u administrator:foo -X DELETE http://localhost:8341/sda/v1/client/collections/documentation/datasources/48
Output
None. Check the listing of data sources to confirm deletion.