Support Resources

LucidWorks Support Portal
LucidWorks Big Data Forum

LucidWorks Big Data

PDF Versions

This is the documentation for LucidWorks Big Data v1.1.

Skip to end of metadata
Go to start of metadata

LucidWorks Search Data sources are one of the ways in which content can be loaded into the Big Data system. Documents indexed in this way do not go through the extract-transform-load workflow, as described in Workflows, which means the incoming documents cannot be analyzed for clustering or statistically interesting phrases. Documents can also be sent directly to the system, using the Big Data Document Indexing API. More background information on this topic is available in the section on Loading Data.

Because LucidWorks Big Data includes LucidWorks Search "under the hood", all of the data source types available with LucidWorks Search are available to the Big Data system. Below we give a couple of examples of using the data sources in the Big Data context, but full details on each type is available in the LucidWorks Search documentation.

Background Information

Data sources describe the target repository of documents and access method. This description is then used to create a crawl job to be executed by a specific crawler implementation (called crawler controllers).

A data source is defined by selecting a crawler controller, then specifying a valid type for that crawler. Some crawlers work with several types, other crawlers only support one type. It is important to match the correct crawler with the type of data source to be indexed.

Each crawler and specified type has different supported attributes. A few attributes are common across all crawlers and types, and some types share attributes with other types using the same crawler. Review the supported attributes carefully when creating data sources with this API.

At present, LucidWorks Search includes the following built-in crawler controllers that support the following kinds of data sources:

Crawler Controller Symbolic Name Data Source Types Supported
Aperture-based crawlers lucid.aperture
  • Local file system
  • Web
DataImportHandler-based JDBC crawler lucid.jdbc
  • JDBC database
SolrXML crawler lucid.solrxml
  • Solr XML files
Google Connector Manager-based crawler lucid.gcm
  • Microsoft SharePoint (Microsoft Office SharePoint Server 2007, Microsoft Windows SharePoint Services 3.0, SharePoint 2010)
Remote file system and pseudo-file system crawler lucid.fs
  • SMB / CIFS (Windows sharing) filesystem
  • Hadoop Distributed File System (HDFS)
  • Amazon S3 file system (also known as "S3 native")
  • HDFS over Amazon S3
  • FTP
External data lucid.external
  • Externally generated data pushed to LucidWorks via Solr
Twitter stream lucid.twitter.stream
  • Twitter Stream using Twitter's stream API
High-Volume HDFS lucid.map.reduce.hdfs
  • High Volume crawling of a Hadoop File System
We'll only cover the Twitter and High-Volume HDFS connectors in this guide. Please see the LucidWorks Search documentation for full details of each type of data source supported.

API Entry Points

/sda/v1/client/collections/collection/datasources: list or create data sources in a particular collection

/sda/v1/client/collections/collection/datasources/id: start, remove, or get details for a particular data source

Get a List of Data Sources

GET /sda/v1/client/collections/collection/datasources

Input

Path Parameters

Key Description
collection The collection name.

Query Parameters

None.

Output

Output Content

A JSON map of attributes to values. The exact set of attributes for a particular data source depends on the type. There is, however, a set of attributes common to all data source types. Specific attributes are discussed in sections for those types later in this section.

Common Attributes

These attributes are used for all data source types (except where specifically noted).

General Attributes

Key Type Description
id 32-bit integer The numeric ID for this data source.
type string The type of this data source. Valid types are:
  • file for a filesystem (remote or local, but must be paired with the correct crawler, as below)
  • web for HTTP or HTTPS web sites
  • jdbc for a JDBC database
  • solrxml for files in Solr XML format
  • sharepoint for a SharePoint repository
  • smb for a Windows file share (CIFS)
  • hdfs for a Hadoop filesystem
  • s3 for a native S3 filesystem
  • s3h for a Hadoop-over-S3 filesystem
  • external for an externally-managed data source
  • twitter_stream for a Twitter stream
  • high_volume_hdfs for high-volume crawling of a Hadoop filesystem
crawler string Crawler implementation that handles this type of data source. The crawler must be able to support the specified type. Valid crawlers are:
  • lucid.aperture for web and file types
  • lucid.fs for file, smb, hdfs, s3h, s3, and ftp types
  • lucid.gcm for sharepoint type
  • lucid.jdbc for jdbc type
  • lucid.solrxml for solrxml type
  • lucid.external for external type
  • lucid.twitter.stream for twitter_stream type
  • lucid.map.reduce.hdfs for high_volume_hdfs type
collection string The name of the document collection that documents will be indexed into.
name string A human-readable name for this data source. Names may consist of any combination of letters, digits, spaces and other characters. Names are case-insensitive, and do not need to be unique: several data sources can share the same name.
category string The category of this data source: Web, FileSystem, Jdbc, SolrXml, Sharepoint, External, or Other. For informational purposes only.
mapping JSON map Attributes that define how incoming fields in the content will be handled. They can be mapped to other fields in the content, fields can be explicitly inserted into the content, or defaults for missing content can be supplied. If mapping is something you'd like to work with, please see the LucidWorks Search documentation section on Data Sources.
output_type string A fully qualified class name of the format of the output from the crawl. For LucidWorks Big Data, this must always be com.lucid.sda.hbase.lws.HBaseUpdateController. The default is intended for direct indexing by Solr, but LucidWorks Big Data first stores all documents in HBase and synchronizes with Solr. For this reason, this should always be specified as com.lucid.sda.hbase.lws.HBaseUpdateController. If this is not specified, and the default is used, documents will not be available for Analysis.
output_args string A Zookeeper connect string that the HBase library can understand, such as localhost:2181.

Field Mapping
The output also includes the field mapping for the data source, which is modifiable as part of the regular data source update API.

Each of the attributes will be shown under the main attribute mapping, which contains a JSON map with several kyes & values. For more information, please see the LucidWorks Search documentation section on Data Sources.

Optional Commit Rules
The following attributes are optional and relate to when new documents will be added to the index:

Key Type Description
commit_within integer Number of milliseconds that defines the maximum interval between commits while indexing documents. The default is 900,000 milliseconds (15 minutes).
commit_on_finish boolean When true (the default), then commit will be invoked at the end of crawl.

Batch Processing
Batch processing allows crawling a repository of content, but not indexing the content until a later time (perhaps after some additional processing). A few attributes control batch processing and are also optional. They are not covered in this section, but are covered in the section of LucidWorks Search documentation on Processing Documents in Batches.

Twitter Stream Attributes

The Twitter Stream data source type uses Twitter's streaming API to index tweets on a continuous basis.

This data source uses the lucid.twitter.stream crawler. Unlike other crawlers, which generally have some kind of defined end point (even if that end point is after hundreds of thousands or millions of documents), this crawler opens a stream and will not stop until Twitter stops. The Data Source Jobs API in LucidWorks Search will allow you to stop the stream if necessary.

This data source is in early stages of development, and does not yet process deletes. Deletes will be shown in Data Source History statistics, but these are deleted tweets marked as such from the streaming API - the actual tweets may or may not be in the index (and if they were, the data source does not yet process them).

In order to successfully configure a Twitter stream, you must first register your application with Twitter and accept their terms of service. The registration process will provide you with the required OAuth tokens you need to access the stream. To get the tokens, follow these steps:

  1. Make sure you have a Twitter account, and go to http://dev.twitter.com/ and sign in.
  2. Choose "Create an App" link and fill out the required details. The callback field can be skipped. Hit "Create Application" to register your application.
  3. The next page will contain the Consumer Key and Consumer Secret, which you will need to configure the data source in LucidWorks.
  4. At the bottom of the same page, choose "Create My Access Token".
  5. The next page will contain the Access Token and Token Secret, which you will also need to configure the data source in LucidWorks.

While you need a Twitter account to register an application, you do not use your Twitter credentials to configure this data source. Take the Consumer Key, Consumer Secret, Access Token, and Token Secret information and store it where you can access it while configuring the data source.

When creating a data source of type twitter_stream, the value lucid.twitter.stream must be supplied for the crawler attribute, described in the section on common attributes above. The common attributes are available for configuration in addition to those listed below.

Key Type Required Default Description
access_token string Yes Null The access token is provided after registering with Twitter and requesting an access token (see above).
consumer_key string Yes Null The consumer key is provided after registering with Twitter (see above).
consumer_secret string Yes Null The consumer secret is provided after registering with Twitter (see above). It should be treated as a password for your registered application.
filter_follow list No Null A set of specific Twitter user IDs to filter the stream. If combined with another filter, they act as OR statements on the stream (i.e., the tweet must match the user ID or the keyword or the location). Note that this is not the user handle or screen name, but a numeric ID assigned by Twitter. To find the ID, you could do an API request like: {{ https://api.twitter.com/1/users/show.xml?screen_name=usaa

}}, replacing "usaa" with the user handle as necessary. The ID is found in the "id" field of the XML output.

filter_locations list No Null A set of bounding boxes (latitude/longitude, up, down, right, left, etc.) to filter the stream for geographic location. If combined with another filter, they act as OR statements on the stream (i.e., the tweet must match the user ID or the keyword or the location).
filter_track list No Null A set of keywords to filter the stream. If combined with another filter, they act as OR statements on the stream (i.e., the tweet must match the user ID or the keyword or the location).
max_docs long No -1 While testing the feed, it may be desirable to limit initial streams to a specific number of tweets. The default for this is "-1", which doesn't close the connection before it is manually closed.
sleep integer Yes 10000 Twitter will occasionally throttle streaming, in which case you can configure the data source to wait the requisite amount of time before trying again. The default is 10,000 milliseconds, which should be sufficient for most scenarios.
token_secret string Yes Null The token key is provided after registering with Twitter and requesting an access token (see above). It should be treated as a password for your API access.
Example twitter_stream data source
{
     "access_token": "G9IdEbD0bK7F8DFuo20srKKboGud9ecrx8i0MdG-RxFUFfbKx0",
     "caching": false,
     "category": "Other",
     "collection": "collection1",
     "commit_on_finish": true,
     "commit_within": 900000,
     "consumer_key": "dQF16QSFRnRFYwyJunRjdj1",
     "consumer_secret": "vVx4OWuxvvXSVQ0BUYOkSu8DCNvkvvUaVl0vuDWvU4",
     "crawler": "lucid.twitter.stream",
     "id": 4,
     "indexing": true,
     "mapping": {
        ...
     },
     "max_docs": -1,
     "name": "Twitter Stream",
     "output_type":"com.lucid.sda.hbase.lws.HBaseUpdateController"
     "output_args":"localhost:2181"
     "parsing": true,
     "sleep": 10000,
     "token_secret": "9hE4rcCWVwCDL6C1CCzWwcWtr1c14iw3M4xiD3CDV9rS",
     "type": "twitter_stream"
}

High-Volume HDFS Attributes

The High Volume HDFS (HV-HDFS) data source uses a MapReduce-enabled crawler designed to leverage the scaling qualities of Apache Hadoop while indexing content.

To achieve this, HV-HDFS consists of a series of MapReduce enabled Jobs to convert raw content into documents that can be indexed which in turn relies on the Behemoth project (we specifically leverage the LWE fork of this project) for MapReduce ready document conversion via Apache Tika and writing of documents to LucidWorks.

The HV-HDFS data source is currently marked as "Early Access" and is thus subject to changes in how it works in future releases.

Before using the HV-HDFS Data Source type, please review the section on Using the High Volume HDFS Crawler in LucidWorks Search documentation.

When creating a data source of type high_volume_hdfs, the value lucid.map.reduce.hdfs must be supplied for the crawler attribute, described in the section on common attributesabove. The common attributes are available for configuration in addition to those listed below.

Key Type Required Default Description
MIME_type string No Null Allows limiting the crawl to content of a specific MIME type. If a value is entered, Behemoth will skip Tika's MIME detection and process the content with the appropriate parser. The MIME type should be entered in full, such as "application/pdf" for PDF documents.
hadoop_conf string Yes Null The location of the Hadoop configuration directory that contains the Hadoop core-site.xml, mapred-site.xml and other Hadoop configuration files. This path must reside on the same machine as the LucidWorks server. Hadoop does not need to be running on the same server as LucidWorks, but the configuration directory must be available from the LucidWorks server.
path string Yes Null The input path where your data resides. It is not required that this be in HDFS to begin with, since the first step of the process converts content to one or more SequenceFiles in HDFS. For example, hdfs://bob:54310/path/to/content and file:///path/to/local/content would both be valid inputs.
recurse boolean No True If true, the default, the crawler will crawl all subdirectories of the input path. Set to false if the crawler should stay within the top directory specified with the path attribute.
tika_content
_handler
string No com.digitalpebble.
behemoth.tika.
TikaProcessor
In most cases, the default, com.digitalpebble.behemoth.tika.TikaProcessor, does not need to be changed. If you have need to change it, enter the fully qualified class name of a Behemoth Tika Processor that is capable of extracting content from documents. The class must be available on the classpath in the Job jar used by LucidWorks.
work_path string Yes /tmp A path to use for intermediate storage. Note the connector does not clean up temporary content so it can be used in debugging, if necessary. Once the job is complete, content stored in this location temporarily can be safely deleted. For example, hdfs://bob:54310/tmp/hv_hdfs would be a valid path.
zookeeper_host string Yes Null The host and port where ZooKeeper is running and coordinates SolrCloud activity, entered as hostname:port.
Example high_volume_hdfs data source
{
     "MIME_type": "",
     "commit_on_finish": true,
     "mapping": {
        ...
     },
     "collection": "collection1",
     "work_path": "hdfs://example:54310/tmp",
     "type": "high_volume_hdfs",
     "recurse": true,
     "crawler": "lucid.map.reduce.hdfs",
     "id": 3,
     "category": "External",
     "tika_content_handler": "com.digitalpebble.behemoth.tika.TikaProcessor",
     "zookeeper_host": "allie:9888",
     "name": "Example",
     "path": "/Users/projects/content/citeseer/1",
     "commit_within": 900000,
     "hadoop_conf": "/Users/projects/hadoop/hadoop-0.20.2/conf"
     "output_type":"com.lucid.sda.hbase.lws.HBaseUpdateController"
     "output_args":"localhost:2181"
}

Examples

Input
Get all data sources for the "documentation" collection.

curl -u administrator:foo -X GET http://localhost:8341/sda/v1/client/collections/documentation/datasources

Output
The output below omits the mapping sub-attributes, which define how incoming content is handled. This example data source was created automatically by requesting documents to be added to the index with the Document Indexing API.

[
    {
        "children": [],
        "collection": "documentation",
        "createTime": 1337800027787,
        "id": "14",
        "properties": {
            "callback": null,
            "category": "External",
            "collection": "documentation",
            "commit_on_finish": true,
            "commit_within": 900000,
            "crawler": "lucid.external",
            "id": 14,
            "mapping": {
                ...
            },
            "name": "documentation_SDA_DS",
            "output_args": "localhost:2181", 
            "output_type": "com.lucid.sda.hbase.lws.HBaseUpdateController", 
            "source": "SDA",
            "source_type": "sda_document_service",
            "type": "external",
            "url": "external:SDA"
        },
        "status": "EXISTS",
        "throwable": null
    }
]

Create a Data Source

POST /sda/v1/client/collections/collection/datasources

Input

Path Parameters

Key Description
collection The collection name

Query Parameters

None

Input content

JSON block with all attributes. The ID field, if present, will be ignored. See attributes in section on getting a list of data sources.

Output

Output Content

JSON representation of new data source. Attributes returned are listed in the section on getting a list of data sources.

Examples

Input

Create a new data source to consume Twitter in the "documentation" collection. Note that the access_token, consumer_key, consumer_secret and token_secret are only examples and should be replaced with your own values.

curl -u administrator:foo -X POST -H 'Content-type: application/json'
-d '{
   "access_token":"G9IdEbD0bK7F8DFuo20srKKboGud9ecrx8i0MdG-RxFUFfbKx0",
   "consumer_key":"dQF16QSFRnRFYwyJunRjdj1",
   "consumer_secret":"vVx4OWuxvvXSVQ0BUYOkSu8DCNvkvvUaVl0vuDWvU4",
   "crawler":"lucid.twitter.stream",
   "name":"My Twitter Stream",
   "token_secret":"9hE4rcCWVwCDL6C1CCzWwcWtr1c14iw3M4xiD3CDV9rS",
   "type":"twitter_stream",
   "sleep":"10000",
   "max_docs":"100"
   "output_type":"com.lucid.sda.hbase.lws.HBaseUpdateController"
   "output_args":"localhost:2181"
}' http://localhost:8341/sda/v1/client/collections/documentation/datasources

Output

{
    "children": [],
    "collection": "documentation",
    "createTime": 1337801651887,
    "id": "documentation",
    "properties": {
        "access_token": "G9IdEbD0bK7F8DFuo20srKKboGud9ecrx8i0MdG-RxFUFfbKx0",
        "caching": false,
        "category": "Other",
        "collection": "documentation",
        "commit_on_finish": true,
        "commit_within": 900000,
        "consumer_key": "dQF16QSFRnRFYwyJunRjdj1",
        "consumer_secret": "vVx4OWuxvvXSVQ0BUYOkSu8DCNvkvvUaVl0vuDWvU4",
        "crawler": "lucid.twitter.stream",
        "id": 49,
        "indexing": true,
        "mapping": {
            "datasource_field": "data_source",
            "default_field": null,
            "dynamic_field": "attr",
            "literals": {},
            "lucidworks_fields": true,
            "mappings": {},
            "multi_val": {},
            "original_content": false,
            "types": {},
            "unique_key": "id",
            "verify_schema": true
        },
        "max_docs": 100,
        "name": "My Twitter Stream",
        "output_type":"com.lucid.sda.hbase.lws.HBaseUpdateController"
        "output_args":"localhost:2181"
        "parsing": true,
        "sleep": 10000,
        "token_secret": "9hE4rcCWVwCDL6C1CCzWwcWtr1c14iw3M4xiD3CDV9rS",
        "type": "twitter_stream",
        "url": "https://stream.twitter.com"
    },
    "status": "CREATED",
    "throwable": null
}

Get Data Source Details

This API provides the settings information for a specific data source.

GET /sda/v1/client/collections/collection/datasources/id

Note that the only way to find the id of a data source is to either store it on creation, or use the API call referenced above to get a list of data sources.

Input

Path Parameters

Key Description
collection the collection name
id The data source ID

Query Parameters

None.

Input content

None.

Output

Output Content

Key Type Description
collection string The collection the data source belongs to.
id integer The ID of the data source.
properties JSON map This includes all the attributes for the data source.
status string The state of the data source. In most cases, this will be simply EXISTS.

Examples

Get all of the parameters for data source 49, created in the previous step.

Input

curl -u administrator:foo -X GET http://localhost:8341/sda/v1/client/collections/documentation/datasources/49

Output
The mapping attributes have been omitted from this example but will be returned in a successful response.

{
    "children": [], 
    "collection": "documentation", 
    "createTime": 1337873186893, 
    "id": "documentation", 
    "properties": {
        "access_token": "G9IdEbD0bK7F8DFuo20srKKboGud9ecrx8i0MdG-RxFUFfbKx0", 
        "caching": false, 
        "category": "Other", 
        "collection": "documentation", 
        "commit_on_finish": true, 
        "commit_within": 900000, 
        "consumer_key": "dQF16QSFRnRFYwyJunRjdj1", 
        "consumer_secret": "vVx4OWuxvvXSVQ0BUYOkSu8DCNvkvvUaVl0vuDWvU4", 
        "crawler": "lucid.twitter.stream", 
        "id": 49, 
        "indexing": true, 
        "mapping": {
          ...
        }, 
        "max_docs": 100, 
        "name": "My Twitter Stream", 
        "output_args": "localhost:2181", 
        "output_type": "com.lucid.sda.hbase.lws.HBaseUpdateController", 
        "parsing": true, 
        "sleep": 10000, 
        "token_secret": "9hE4rcCWVwCDL6C1CCzWwcWtr1c14iw3M4xiD3CDV9rS", 
        "type": "twitter_stream", 
        "url": "https://stream.twitter.com"
    }, 
    "status": "EXISTS", 
    "throwable": null
}

Update Data Source Details

This API allows updating the settings information for a specific data source.

PUT /sda/v1/client/collections/collection/datasources/id

Note that the only way to find the id of a data source is to either store it on creation, or use the API call referenced above to get a list of data sources.

Input

Path Parameters

Key Description
collection the collection name
id The data source ID

Query Parameters

None.

Input content

JSON block with either all attributes or just those that need updating. Data source type, crawler type, and ID cannot be updated. Possible attributes are listed in the section above on getting a list of data sources.

Output

Output Content

The output is essentially a status report. If successful, the output will contain a line "status":"SUCCEEDED". It may also report FAILED if the document was invalid for some reason. The other parts of the response are not essential at this point, only the status. Children will also be returned, which in the case of adding documents, a SUCCEEDED indicates that the documents were successfully added to the Solr/LucidWorks index.

Examples

Input
Update the "max_docs" attribute for data source 49, created in the previous step.

curl -u administrator:foo -verbose -X PUT -H 'Content-type: application/json' -d '{"max_docs":"150"}' 
http://localhost:8341/sda/v1/client/collections/documentation/datasources/49

Output

{
    "children": [], 
    "collection": "documentation", 
    "createTime": 1337972829851, 
    "id": "documentation", 
    "status": "SUCCEEDED", 
    "throwable": null
}

Start a Data Source

POST /sda/v1/client/collections/collection/datasources/id

Input

Path Parameters

Key Description
collection The collection name
id The data source ID

Query Parameters

None

Input content

JSON block with either all attributes or just those that need updating. Data source type, crawler type, and ID cannot be updated. Other attributes are listed in the section on getting a list of data sources.

Output

Output Content

The output is essentially a status report. If successful, the output will contain a line "status":"RUNNING". The other parts of the response are not essential at this point, only the status.

To check the ongoing status, you can use the LucidWorks Search Data Source Status API.

Examples

Input
Start data source 49.

curl -u administrator:foo -X POST http://localhost:8341/sda/v1/client/collections/documentation/datasources/49

Output

{
    "children": [],
    "collection": "documentation",
    "createTime": 1337809434187,
    "id": "49",
    "status": "RUNNING",
    "throwable": null
}

Delete a Data Source

DELETE /sda/v1/client/collections/collection/datasources/id

Input

Path Parameters

Key Description
collection the collection name
id The data source ID

Query Parameters

None.

Input content

None.

Output

Output Content

None.

Examples

Input
Delete data source 48.

curl -u administrator:foo -X DELETE  http://localhost:8341/sda/v1/client/collections/documentation/datasources/48

Output

None. Check the listing of data sources to confirm deletion.

Labels

api api Delete
collections collections Delete
data_sources data_sources Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.