Newer Versions

v2.1
v2.0

Older Versions

v1.6
v1.7

LucidWorks Enterprise v1.8

Support Resources

LucidWorks Forum
KnowledgeBase

This is the documentation for LucidWorks Enterprise v1.8. The most current release is v2.1.

Skip to end of metadata
Go to start of metadata

Data sources are conduits by which LucidWorks Enterprise (LWE) acquires new content. Data sources describe the target repository of documents and access method. This description is then used to create a crawl job to be executed by a specific crawler implementation (called "crawler controllers").

At present, LWE comes with the following built-in crawler controllers (referred to by their symbolic names given in parentheses) that support the following kinds of data sources:

  • Aperture-based crawlers (lucid.aperture):
    • Local file system
    • Web
  • DataImportHandler-based JDBC crawler (lucid.jdbc):
    • JDBC database
  • SolrXML crawler (lucid.solrxml):
    • Solr XML files
  • Google Connector Manager-based crawler (lucid.gcm):
    • Microsoft SharePoint (Microsoft Office SharePoint Server 2007, Microsoft Windows SharePoint Services 3.0, SharePoint 2010)
  • External crawler for indexing documents crawled by an external process (lucid.external):
    • External data source

API Entry Points

/api/collections/collection/datasources: list or create data sources in a particular collection

/api/collections/collection/datasources/id: update, remove, or get details for a particular data source

Get a List of Data Sources

GET /api/collections/collection/datasources

Input

Path Parameters

Key Description
collection The collection name

Query Parameters

None

Output

Output Content

A JSON map of fields to values. The exact set of fields depends on the kind of data source. Commonly used kinds of data sources use the following symbolic names: file (files on a local file system), web (HTTP/HTTPS web sites), jdbc (JDBC databases), solrxml (files in Solr XML format), and sharepoint (Microsoft SharePoint). All return:

Key Type Description
id 32-bit integer The numeric ID for this data source.
type string The type of this data source: file, web, jdbc, solrxml, sharepoint, and so on
crawler string Crawler implementation that handles this type of data source: lucid.aperture, lucid.fs, lucid.jdbc, and so on
collection string name of the document collection that documents will be indexed into
name string A human-readable name for this data source.

The output also includes the field mapping for the data source, which is modifiable as part of the regular data source update API. The data source key "mapping" contains a JSON map with the following keys and values:

Key Type Description
mappings JSON string-string A map where keys are case-insensitive names of the original metadata key names, and values are case-sensitive names of fields that make sense in the current schema. These target field names are verified against the current schema and if they are not valid these mappings are removed. Please note that null target names are allowed, which means that source fields with such mappings will be discarded.
multiVal JSON string-boolean A map of target field names that is automatically initialized from the schema based on the target field's multiplicity (multiValued field attribute).
types JSON string-string A map pre-initialized from the current schema. Additional validation can be performed on fields with declared non-string types. Currently supported types are DATE, INT, STRING. If not specified fields are assumed to have the type STRING.
defaultField string The field name to use if source name doesn't match any mapping. If null, then dynamicField will be used, and if that is null too then the original name will be returned.
dynamicField string If not null then source names without specific mappings will be mapped to dynamicField_sourceName, after some cleanup of the source name (non-letter characters are replaced with underscore).
uniqueKey string Defines the name of the field in the current schema that is a unique key. Filled in from the current schema.
datasourceField string A prefix for index fields that are needed for LucidWorks faceting and data source management.
literals JSON string-string An optional map that can specify static pairs of keys and values to be added to output documents.

The following fields are optional and are supported across many data source types:

Key Type Description
commitWithin integer Number of milliseconds that defines the maximum interval between commits.
commitOnFinish boolean When true (default) then commit will be invoked at the end of crawl.

The following fields control the batch processing, and are also optional. Note: some crawler controllers don't support batch processing, or support only a subset of options.

Key Type Description
parsing boolean (default is true) When true then crawlers will parse rich formats immediately. When false then other processing is skipped and raw input documents are stored in a batch.
indexing boolean (default is true) When true then parsed documents will be sent immediately for indexing. When false then parsed documents will be stored in a batch.
caching boolean (default is false) When true then both raw and parsed documents will always be stored in a batch, in addition to any other requested processing. If false then batch is not created and documents are not preserved unless as a result of setting other options above.

Below are specific fields for each data source type.

lucid.aperture / File system:

Key Type Description
path string The path of the directory to start reading from. Paths should be entered as the complete directory path or they will be interpreted as relative to $LWE_HOME. On Unix systems, this means the path entered should start at root / level; on Windows, the drive letter and full path should be used (such as C:\path). Various types of relative paths, such as ../ or ~/, are not supported.
follow_links boolean Indicates whether to follow symbolic links in the file system.
bounds string Either "tree" to limit the crawl to a strict subtree, or "none" for no limits.
include_paths list of strings Regex patterns to match on full URLs of files. If not empty then at least one pattern must match to include a file.
exclude_paths list of strings Regex patterns that URLs must not match. If not empty then a file is excluded if
any pattern matches its URL.
crawl_depth 32-bit integer How many path levels to descend


Example file data source
{
    "crawl_depth": 5,
    "follow_links": true,
    "name": "LucidWorks Documentation",
    "path": "D:\\lwe\\docs\\lucidworks",
    "type": "file",
    "crawler": "lucid.aperture"
}

lucid.aperture / Web:

Key Type Description
url string The URL that serves as the crawl seed
bounds string Either "tree" to limit the crawl to a strict subtree, or "none" for no limits.
include_paths list of strings Regex patterns to match on full URLs of files. If not empty then at least one pattern must match to include a file.
exclude_paths list of strings Regex patterns that URLs may not match. If not empty then a file is excluded if
any pattern matches its URL.
crawl_depth 32-bit integer The maximum number of crawl cycles (hops) from the starting URL.


Robots.txt
This data source obeys most of the robots.txt standard, with the exception of the Crawl-Delay directive.

HTTP Proxy
This data source supports communication via an HTTP proxy, either open or authenticated. The following data source properties determine the use of the proxy:

Key Type Description
proxyHost string host name of the proxy. If null or absent then direct access is assumed, and all other proxy-related parameters are ignored.
proxyPort string port number of the proxy
proxyUsername string optional username credential for the proxy
proxyPassword string optional password credential for the proxy


Authentication
At this time, only the Basic and Digest types of HTTP authentication methods are supported. NTLMv1 and NTLMv2 authentication are not supported in this release.

A single data source may have multiple sets of credentials for different sites, or for different realms within a single site. This authentication data is passed within the auth property of the data source as a list of JSON maps, each map with the following properties:

Key Type Description
host string host name (or host:port) where this authentication should be used; may be null to indicate any host
realm string HTTP Realm where this authentication should be used; may be null to indicate any realm
username string user name; must not be null or empty
password string password; must not be null

The HTTP connector first tries to access resources without any authentication. When it receives an HTTP code "401 Authentication Required" the authentication method (Basic, Digest or NTLMv1) is selected automatically, and the closest matching authentication tuple is selected from the ones configured for the data source.

Example web data source
{
    "id": 2,
    "collect_links": true,
    "crawl_depth": 2,
    "exclude_paths": [
        "http://www\\.lucidimagination\\.com/blog/tag/.*",
        "http://www\\.lucidimagination\\.com/search\\?.*"
    ],
    "include_paths": [
        "http://www\\.lucidimagination\\.com/.*"
    ],
    "auth": [
        {
            "host": "www.lucidimagination.com:443",
            "realm": "Test realm",
            "username": "user1",
            "password": "test1"
        },
        {
            "host": "www.lucidimagination.com",
            "realm": null,
            "username": "user2",
            "password": "test2"
        }
    ],
    "name": "Lucid Imagination Website",
    "type": "web",
    "crawler": "lucid.aperture",
    "url": "http://www.lucidimagination.com/"
}

lucid.jdbc / JDBC:

Key Type Description
driver string The class name of the JDBC driver
username string The DB username
password string The DB password
url string The URL of the database instance
sql_select_statement string The select statement to use to generate data
primary_key string The column name of the primary key
delta_sql_query string The query used to obtain updates
nested_queries string The list of additional queries to obtain data from


Example jdbc data source
{
    "id": 2,
    "name": "Test database",
    "type": "jdbc",
    "crawler": "lucid.jdbc",
    "driver": "com.mysql.jdbc.Driver",
    "username": "root",
    "password": "pass",
    "url": "jdbc:mysql://localhost/test",
    "sql_select_statement": "select * from document",
    "primary_key": "id",
    "delta_sql_query": "select id from document where last_modified > $",
    "nested_queries": ["select category from document_category where doc_id=$",
                       "select tag from document_tag where doc_id=$"]
}

lucid.solrxml / Solr XML file

Key Type Description
file string The name of the file to read, or directory containing files to read. Paths should be entered as the complete directory path or they will be interpreted as relative to $LWE_HOME. On Unix systems, this means the path entered should start at root / level; on Windows, the drive letter and full path should be used (such as C:\path). Various types of relative paths, such as ../ or ~/, are not supported.
include_datasource_metadata boolean Option to add data_source and data_source_type fields to each document in addition to fields found in the file
include_paths list of strings An array of URL patterns to include
exclude_paths list of strings An array of URL patterns to exclude


Example solr xml data source
{
    "id": 2,
    "name": "Solr example XML documents",
    "type": "solrxml",
    "crawler": "lucid.solrxml",
    "file": "D:\\lucene_solr\\solr\\example\\exampledocs",
    "include_paths" : [ "*\.xml" ],
    "include_datasource_metadata": true
}

lucid.gcm / SharePoint:

Key Type Description
sharepointUrl string The fully qualified URL for the SharePoint site
username string Username with authorization to crawl the SharePoint repository
password string Password for the username above
domain string The domain where the user is authenticated
kdcserver string Kerberos KDC Hostname
mySiteBaseURL string Used for MOSS 2007 only. The MySite base URL is used to determine the complete MySite URL, so http://server.domain/personal/administrator/default.aspx would be entered as http://server.domain. The credentials provided will allow LucidWorks Enterprise to complete the MySite URL and crawl the content
includedURls (as in, U-R-lower-case L) string The directories on the server that should be crawled for indexing. If left blank, all paths will be followed, even if they lead away from the original URL entered. To limit crawling to a specific site, repeat the URL in this site with a regular expression to indicate all pages from the site. The SharePoint data source uses GNU regular expressions, which may be different from the Java regular expressions used for Web and file system data sources. More information on the syntax can be found in the GNU regular expression lweug18umentation.
excludedURls (as in, U-R-lower-case L) string Directories on the server that should not be crawled and that should be excluded from the index. The same regular expression syntax can be used to specify Excluded URLs as is used for Included URLs.
useSPSearchVisibility string Use SharePoint search visibility options
aliases map of string to string Allows mapping of source URL patterns to aliases that are used to rewrite URLs before indexing
authorization string use "content"


Example SharePoint repository data source
{
    "id": 2,
    "name": "Sharepoint crawl",
    "type": "sharepoint",
    "crawler": "lucid.gcm",
    "sharepointUrl": "http://my.sharepoint.host.com/",
    "username": "user",
    "password": "secret",
    "domain": "myDomain",
    "includeURls": ".*",
    "useSPSearchVisibility": "true",
    "authorization": "content"
}

lucid.fs / SMB

The following properties are common to all remote and pseudo file systems (only SMB is supported in v1.8)

Key Type Description
url string For SMB (Windows Shares) filesystems, the root URL includes the protocol (smb), the host address, and the path to crawl: smb://host/path/to/crawl.
type string One of supported data source types, must be consistent with the root URL's protocol. The following value is supported: smb.
max_bytes long Optional, default is -1.
bounds string Either "tree" to limit the crawl to a strict subtree, or "none" for no limits.
includes list of strings Regex patterns to match on full URLs of files. If not empty then at least one pattern must match to include a file.
excludes list of strings Regex patterns that URLs must not match. If not empty then a file is excluded if any pattern matches its URL.
crawl_depth 32-bit integer How many path levels to descend.
username string Optional
password string Optional
windowsdomain string Required
Example remote filesystem (SMB) source
curl -H 'Content-type:application/json'
-d {
     "name":"<datasource name>",
     "type":"smb",
     "crawler":"lucid.fs",
     "url":"smb://<host>/<path>/",
     "username":"username",
     "password":"password",
     "windowsdomain":"<your windows domain>"
    }
'http://localhost:8888/api/collections/<collection>/datasources'

Response Codes

200: Success OK

Examples

Input

curl 'http://localhost:8888/api/collections/collection1/datasources'

Output

[
    {
        "max_bytes": 10485760,
        "include_paths": [],
        "collect_links": true,
        "exclude_paths": [],
        "mapping": {
            "multiVal": {
                "fileSize": true,
                "author": true,
                "body": true,
                "title": true,
                "keywords": true,
                "description": false,
                "subject": true,
                "fileName": true,
                "dateCreated" : false,
                "attr": true,
                "creator": true
            },
            "defaultField": null,
            "mappings": {
                "content-type ": "mimeType",
                "slide-count": "pageCount",
                "body": "body",
                "slides": "pageCount",
                "subject": "subject",
                "plaintextmessagecontent": "body",
                "lastmodifiedby": "author",
                "content-encoding": "characterSet",
                "date": null,
                "type": null,
                "creator": "creator",
                "author": "author",
                "mimetype": "mimeType",
                "title": "title",
                "plaintextcontent": "body",
                "created": "dateCreated",
                "contributor": "author",
                "description": "description",
                "contentcreated": "dateCreated",
                "pagecount": "pageCount",
                "name": "title",
                "filelastmodified": "lastModified",
                "fullname": "author",
                "fulltext": "body",
                "last-modified": "lastModified" ,
                "messagesubject": "title",
                "keyword": "keywords",
                "contentlastmodified": "lastModified",
                "last-printed": null,
                "links": null,
                "batch_id": "batch_id",
                "crawl_uri": "crawl_uri",
                "filesize": "fileSize",
                "page-count": "pageCount",
                "content-length": "fileSize",
                "filename": "fileName"
            },
            "dynamicField": "attr",
            "types": {
                "filesize": "INT",
                "pagecount" : "INT",
                "lastmodified": "DATE",
                "datecreated": "DATE",
                "date": "DATE"
            },
            "uniqueKey": "id ",
            "datasourceField": "data_source"
        },
        "collection": "collection1",
        "type": "web",
        "url" : "http://www.grantingersoll.com/",
        "crawler": "lucid.aperture",
        "id": 1,
        "bounds": "tree",
        "category": "Web",
        "name": "Sample Site",
        "crawl_depth": 2
    },
    {
        "max_bytes": 21474836 47,
        "include_paths": [],
        "exclude_paths": [],
        "mapping": {
            "multiVal": {
                "fileSize": true,
                "body": true,
                "author": true,
                "title": true,
                "keywords": true,
                "subject": true,
                "descripti on": false,
                "fileName": true,
                "dateCreated": false,
                "attr": true,
                "creator": true
            },
            "defaultField": null,
            "mappings": {
                "slide-count": "pageCount",
                "content-type": "mimeType",
                "body": "body",
                "slides": "pageCount",
                "subject": "subject",
                "plaintextmessagecontent": " body",
                "lastmodifiedby": "author",
                "content-encoding": "characterSet",
                "type": null,
                "date": null,
                "creator": "creator",
                "author": "author",
                "title": "title",
                "mimetype": "mimeType",
                "created": "dateCreated",
                "plaintextcontent": "body",
                "pagecount": "pageCount",
                "contentcreated": "dateCreated",
                "description": "description",
                "contributor": "author ",
                "name": "title",
                "filelastmodified": "lastModified",
                "fullname": "author",
                "fulltext ": "body",
                "messagesubject": "title",
                "last-modified": "lastModified",
                "keyword": "keywords",
                "contentlastmodified": "lastModified",
                "last-printed": null,
                "links": null,
                "batch_id": "batch_id",
                "crawl_uri": "crawl_uri",
                "filesize": "fileSize",
                "page-count": "pageCount",
                "content-length": "fileSize",
                "filename": "fileName"
            },
            "dynamicField": "attr",
            "types": {
                "filesize": "INT",
                "pagecount": "INT",
                "lastmodified": "DATE",
                "datecreated ": "DATE",
                "date": "DATE"
            },
            "uniqueKey": "id",
            "datasourceField": "data_source"
        },
        "follow_links": true,
        "collection": "collection1",
        "type": "file",
        "crawler": "lucid.aperture ",
        "id": 2,
        "bounds": "tree",
        "category": "FileSystem",
        "name": "Small Test Collection",
        "path": "C:\\ Users\\Nick\\Documents\\Business",
        "crawl_depth": 2147483647
    }
]

Create a Data Source

POST /api/collections/collection/datasources

Input

Path Parameters

Key Description
collection The collection name

Query Parameters

None

Input content

JSON block with all fields. The ID field, if present, will be ignored. See fields in section on getting a list of data sources.

Output

Output Content

JSON representation of new data source. Fields returned are listed in the section on getting a list of data sources.

Return Codes

201: created

Examples

Create a data source that includes the content of the Lucid Imagination web site. To keep the size down, only crawl two levels, and do not index the blog tag links or any search links. Also, do not wander off the site and index any external links.

Input

curl -H 'Content-type: application/json' -d '
{
    "crawl_depth": 2,
    "exclude_paths": [
        "http://www\\.lucidimagination\\.com/blog/tag/.*",
        "http://www\\.lucidimagination\\.com/search\\?.*"
    ],
    "include_paths": [
        "http://www\\.lucidimagination\\.com/.*"
    ],
    "name": "Lucid Imagination Website",
    "type": "web",
    "crawler": "lucid.aperture",
    "url": "http://www.lucidimagination.com/"
}' 'http://localhost:8888/api/collections/collection1/datasources'

Output

{
    "id": 6,
    "max_bytes": 10485760,
    "include_paths": [
        "http://www\\.lucidimagination\\.com/.*"
    ] ,
    "collect_links": true,
    "exclude_paths": [
        "http://www\\.lucidimagination\\.com/blog/tag/.*",
        "http://www\\.lucidimagination\\.com/search\\?.*"
    ],
    "mapping": {
        "multiVal ": {
            "fileSize": true,
            "body": true,
            "author": true,
            "title": true,
            "keywords": true,
            "subject": true,
            "description": false,
            "fileName": true,
            "dateCreated": false,
            "attr": true,
            "creator": true
        },
        "defaultField": null,
        "mappings": {
            "slide-count": "pageCount",
            "content- type": "mimeType",
            "body": "body",
            "slides": "pageCount",
            "subject": "subject",
            "plainte xtmessagecontent": "body",
            "lastmodifiedby": "author",
            "content-encoding": "character Set",
            "type": null,
            "date": null,
            "creator": "creator",
            "author": "author",
            "title": "titl e",
            "mimetype": "mimeType",
            "created": "dateCreated",
            "plaintextcontent": "body",
            "page count": "pageCount",
            "contentcreated": "dateCreated",
            "description": "description",
            "contributor": "author",
            "name": "title",
            "filelastmodified": "lastModified",
            "fullname" : "author",
            "fulltext": "body",
            "messagesubject": "title",
            "last-modified": "lastModified",
            "keyword": "keywords",
            "contentlastmodified": "lastModified",
            "last-printed": null,
            "links": null,
            "batch_id": "batch_id",
            "crawl_uri": "crawl_uri",
            "filesize": "fileSize",
            "page-count": "pageCount",
            "content-length": "fileSize",
            "filename": "fileName"
        },
        " dynamicField": "attr",
        "types": {
            "filesize": "INT",
            "pagecount": "INT",
            "lastmodified": "DATE",
            "datecreated": "DATE",
            "date": "DATE"
        },
        "uniqueKey": "id",
        "datasourceField": "data_source"
    },
    "collection": "collection1",
    "type": "web",
    "url": "http://www.lucidimagination.com/",
    "crawler": "lucid.aperture",
    "bounds": "tree",
    "category": "Web" ,
    "name": "Lucid Imagination Website",
    "crawl_depth": 2
}

Get Data Source Details

GET /api/collections/collection/datasources/id

Note that the only way to find the id of a data source is to either store it on creation, or use the API call referenced above to get a list of data sources.

Input

Path Parameters

Key Description
collection the collection name

Query Parameters

Key Type Description
id string The data source ID

Input content

None

Output

Output Content

Fields returned are listed in the section on getting a list of data sources.

Return Codes

200: success ok

404: not found

Examples

Get all of the parameters for data source 6, created in the previous step.

Input

curl 'http://localhost:8888/api/collections/collection1/datasources/6'

Output

{
    "id": 6,
    "max_bytes": 10485760,
    "include_paths": [
        "http://www\\.lucidimagination\\.com/.*"
    ] ,
    "collect_links": true,
    "exclude_paths": [
        "http://www\\.lucidimagination\\.com/blog/tag/.*",
        "http://www\\.lucidimagination\\.com/search\\?.*"
    ],
    "mapping": {
        "multiVal ": {
            "fileSize": true,
            "body": true,
            "author": true,
            "title": true,
            "keywords": true,
            "subject": true,
            "description": false,
            "fileName": true,
            "dateCreated": false,
            "attr": true,
            "creator": true
        },
        "defaultField": null,
        "mappings": {
            "slide-count": "pageCount",
            "content- type": "mimeType",
            "body": "body",
            "slides": "pageCount",
            "subject": "subject",
            "plainte xtmessagecontent": "body",
            "lastmodifiedby": "author",
            "content-encoding": "character Set",
            "type": null,
            "date": null,
            "creator": "creator",
            "author": "author",
            "title": "titl e",
            "mimetype": "mimeType",
            "created": "dateCreated",
            "plaintextcontent": "body",
            "page count": "pageCount",
            "contentcreated": "dateCreated",
            "description": "description",
            "contributor": "author",
            "name": "title",
            "filelastmodified": "lastModified",
            "fullname" : "author",
            "fulltext": "body",
            "messagesubject": "title",
            "last-modified": "lastModified",
            "keyword": "keywords",
            "contentlastmodified": "lastModified",
            "last-printed": null,
            "links": null,
            "batch_id": "batch_id",
            "crawl_uri": "crawl_uri",
            "filesize": "fileSize",
            "page-count": "pageCount",
            "content-length": "fileSize",
            "filename": "fileName"
        },
        " dynamicField": "attr",
        "types": {
            "filesize": "INT",
            "pagecount": "INT",
            "lastmodified": "DATE",
            "datecreated": "DATE",
            "date": "DATE"
        },
        "uniqueKey": "id",
        "datasourceField": "data_source"
    },
    "collection": "collection1",
    "type": "web",
    "url": "http://www.lucidimagination.com/",
    "crawler": "lucid.aperture",
    "bounds": "tree",
    "category": "Web" ,
    "name": "Lucid Imagination Website",
    "crawl_depth": 2
}

Update a Data Source

PUT /api/collections/collection/datasources/id

Input

Path Parameters

Key Description
collection The collection name
id The data source ID

Query Parameters

None

Input content

JSON block with either all fields or just those that need updating. Data source type, crawler type, and ID cannot be updated. Other fields are listed in the section on getting a list of data sources.

Output

Output Content

None

Return Codes

204: success no content

Examples

Change the web data source so that it crawls three levels instead of just two:

Input

curl -X PUT -H 'Content-type: application/json' -d '
{
    "crawl_depth": 3
}' 'http://localhost:8888/api/collections/collection1/datasources/6'

Output

None. (Check properties to confirm changes.)

Delete a Data Source

DELETE /api/collections/collection/datasources/id

Input

Path Parameters

Key Description
collection the collection name
id The data source ID

Query Parameters

None

Input content

None

Output

Output Content

None

Return Codes

204: success no content

404: not found

Examples

Input

curl -X DELETE  -H 'Content-type: application/json' 'http://localhost:8888/api/collections/collection1/datasources/13'

Output

None. Check the listing of data sources to confirm deletion.

Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.