Data sources are conduits by which LucidWorks Enterprise (LWE) acquires new content. Data sources describe the target repository of documents and access method. This description is then used to create a crawl job to be executed by a specific crawler implementation (called "crawler controllers").
At present, LWE comes with the following built-in crawler controllers (referred to by their symbolic names given in parentheses) that support the following kinds of data sources:
- Aperture-based crawlers (lucid.aperture):
- Local file system
- Web
- DataImportHandler-based JDBC crawler (lucid.jdbc):
- JDBC database
- SolrXML crawler (lucid.solrxml):
- Solr XML files
- Google Connector Manager-based crawler (lucid.gcm):
- Microsoft SharePoint (Microsoft Office SharePoint Server 2007, Microsoft Windows SharePoint Services 3.0, SharePoint 2010)
- External crawler for indexing documents crawled by an external process (lucid.external):
- External data source
API Entry Points
/api/collections/collection/datasources: list or create data sources in a particular collection
/api/collections/collection/datasources/id: update, remove, or get details for a particular data source
Get a List of Data Sources
GET /api/collections/collection/datasources
Input
Path Parameters
| Key | Description |
|---|---|
| collection | The collection name |
Query Parameters
None
Output
Output Content
A JSON map of fields to values. The exact set of fields depends on the kind of data source. Commonly used kinds of data sources use the following symbolic names: file (files on a local file system), web (HTTP/HTTPS web sites), jdbc (JDBC databases), solrxml (files in Solr XML format), and sharepoint (Microsoft SharePoint). All return:
| Key | Type | Description |
|---|---|---|
| id | 32-bit integer | The numeric ID for this data source. |
| type | string | The type of this data source: file, web, jdbc, solrxml, sharepoint, and so on |
| crawler | string | Crawler implementation that handles this type of data source: lucid.aperture, lucid.fs, lucid.jdbc, and so on |
| collection | string | name of the document collection that documents will be indexed into |
| name | string | A human-readable name for this data source. |
The output also includes the field mapping for the data source, which is modifiable as part of the regular data source update API. The data source key "mapping" contains a JSON map with the following keys and values:
| Key | Type | Description |
|---|---|---|
| mappings | JSON string-string | A map where keys are case-insensitive names of the original metadata key names, and values are case-sensitive names of fields that make sense in the current schema. These target field names are verified against the current schema and if they are not valid these mappings are removed. Please note that null target names are allowed, which means that source fields with such mappings will be discarded. |
| multiVal | JSON string-boolean | A map of target field names that is automatically initialized from the schema based on the target field's multiplicity (multiValued field attribute). |
| types | JSON string-string | A map pre-initialized from the current schema. Additional validation can be performed on fields with declared non-string types. Currently supported types are DATE, INT, STRING. If not specified fields are assumed to have the type STRING. |
| defaultField | string | The field name to use if source name doesn't match any mapping. If null, then dynamicField will be used, and if that is null too then the original name will be returned. |
| dynamicField | string | If not null then source names without specific mappings will be mapped to dynamicField_sourceName, after some cleanup of the source name (non-letter characters are replaced with underscore). |
| uniqueKey | string | Defines the name of the field in the current schema that is a unique key. Filled in from the current schema. |
| datasourceField | string | A prefix for index fields that are needed for LucidWorks faceting and data source management. |
| literals | JSON string-string | An optional map that can specify static pairs of keys and values to be added to output documents. |
The following fields are optional and are supported across many data source types:
| Key | Type | Description |
|---|---|---|
| commitWithin | integer | Number of milliseconds that defines the maximum interval between commits. |
| commitOnFinish | boolean | When true (default) then commit will be invoked at the end of crawl. |
The following fields control the batch processing, and are also optional. Note: some crawler controllers don't support batch processing, or support only a subset of options.
| Key | Type | Description |
|---|---|---|
| parsing | boolean | (default is true) When true then crawlers will parse rich formats immediately. When false then other processing is skipped and raw input documents are stored in a batch. |
| indexing | boolean | (default is true) When true then parsed documents will be sent immediately for indexing. When false then parsed documents will be stored in a batch. |
| caching | boolean | (default is false) When true then both raw and parsed documents will always be stored in a batch, in addition to any other requested processing. If false then batch is not created and documents are not preserved unless as a result of setting other options above. |
Below are specific fields for each data source type.
lucid.aperture / File system:
| Key | Type | Description |
|---|---|---|
| path | string | The path of the directory to start reading from. Paths should be entered as the complete directory path or they will be interpreted as relative to $LWE_HOME. On Unix systems, this means the path entered should start at root / level; on Windows, the drive letter and full path should be used (such as C:\path). Various types of relative paths, such as ../ or ~/, are not supported. |
| follow_links | boolean | Indicates whether to follow symbolic links in the file system. |
| bounds | string | Either "tree" to limit the crawl to a strict subtree, or "none" for no limits. |
| include_paths | list of strings | Regex patterns to match on full URLs of files. If not empty then at least one pattern must match to include a file. |
| exclude_paths | list of strings | Regex patterns that URLs must not match. If not empty then a file is excluded if any pattern matches its URL. |
| crawl_depth | 32-bit integer | How many path levels to descend |
{
"crawl_depth": 5,
"follow_links": true,
"name": "LucidWorks Documentation",
"path": "D:\\lwe\\docs\\lucidworks",
"type": "file",
"crawler": "lucid.aperture"
}
lucid.aperture / Web:
| Key | Type | Description |
|---|---|---|
| url | string | The URL that serves as the crawl seed |
| bounds | string | Either "tree" to limit the crawl to a strict subtree, or "none" for no limits. |
| include_paths | list of strings | Regex patterns to match on full URLs of files. If not empty then at least one pattern must match to include a file. |
| exclude_paths | list of strings | Regex patterns that URLs may not match. If not empty then a file is excluded if any pattern matches its URL. |
| crawl_depth | 32-bit integer | The maximum number of crawl cycles (hops) from the starting URL. |
Robots.txt
This data source obeys most of the robots.txt standard, with the exception of the Crawl-Delay directive.
HTTP Proxy
This data source supports communication via an HTTP proxy, either open or authenticated. The following data source properties determine the use of the proxy:
| Key | Type | Description |
|---|---|---|
| proxyHost | string | host name of the proxy. If null or absent then direct access is assumed, and all other proxy-related parameters are ignored. |
| proxyPort | string | port number of the proxy |
| proxyUsername | string | optional username credential for the proxy |
| proxyPassword | string | optional password credential for the proxy |
Authentication
At this time, only the Basic and Digest types of HTTP authentication methods are supported. NTLMv1 and NTLMv2 authentication are not supported in this release.
A single data source may have multiple sets of credentials for different sites, or for different realms within a single site. This authentication data is passed within the auth property of the data source as a list of JSON maps, each map with the following properties:
| Key | Type | Description |
|---|---|---|
| host | string | host name (or host:port) where this authentication should be used; may be null to indicate any host |
| realm | string | HTTP Realm where this authentication should be used; may be null to indicate any realm |
| username | string | user name; must not be null or empty |
| password | string | password; must not be null |
The HTTP connector first tries to access resources without any authentication. When it receives an HTTP code "401 Authentication Required" the authentication method (Basic, Digest or NTLMv1) is selected automatically, and the closest matching authentication tuple is selected from the ones configured for the data source.
{
"id": 2,
"collect_links": true,
"crawl_depth": 2,
"exclude_paths": [
"http://www\\.lucidimagination\\.com/blog/tag/.*",
"http://www\\.lucidimagination\\.com/search\\?.*"
],
"include_paths": [
"http://www\\.lucidimagination\\.com/.*"
],
"auth": [
{
"host": "www.lucidimagination.com:443",
"realm": "Test realm",
"username": "user1",
"password": "test1"
},
{
"host": "www.lucidimagination.com",
"realm": null,
"username": "user2",
"password": "test2"
}
],
"name": "Lucid Imagination Website",
"type": "web",
"crawler": "lucid.aperture",
"url": "http://www.lucidimagination.com/"
}
lucid.jdbc / JDBC:
| Key | Type | Description |
|---|---|---|
| driver | string | The class name of the JDBC driver |
| username | string | The DB username |
| password | string | The DB password |
| url | string | The URL of the database instance |
| sql_select_statement | string | The select statement to use to generate data |
| primary_key | string | The column name of the primary key |
| delta_sql_query | string | The query used to obtain updates |
| nested_queries | string | The list of additional queries to obtain data from |
{
"id": 2,
"name": "Test database",
"type": "jdbc",
"crawler": "lucid.jdbc",
"driver": "com.mysql.jdbc.Driver",
"username": "root",
"password": "pass",
"url": "jdbc:mysql://localhost/test",
"sql_select_statement": "select * from document",
"primary_key": "id",
"delta_sql_query": "select id from document where last_modified > $",
"nested_queries": ["select category from document_category where doc_id=$",
"select tag from document_tag where doc_id=$"]
}
lucid.solrxml / Solr XML file
| Key | Type | Description |
|---|---|---|
| file | string | The name of the file to read, or directory containing files to read. Paths should be entered as the complete directory path or they will be interpreted as relative to $LWE_HOME. On Unix systems, this means the path entered should start at root / level; on Windows, the drive letter and full path should be used (such as C:\path). Various types of relative paths, such as ../ or ~/, are not supported. |
| include_datasource_metadata | boolean | Option to add data_source and data_source_type fields to each document in addition to fields found in the file |
| include_paths | list of strings | An array of URL patterns to include |
| exclude_paths | list of strings | An array of URL patterns to exclude |
{
"id": 2,
"name": "Solr example XML documents",
"type": "solrxml",
"crawler": "lucid.solrxml",
"file": "D:\\lucene_solr\\solr\\example\\exampledocs",
"include_paths" : [ "*\.xml" ],
"include_datasource_metadata": true
}
lucid.gcm / SharePoint:
| Key | Type | Description |
|---|---|---|
| sharepointUrl | string | The fully qualified URL for the SharePoint site |
| username | string | Username with authorization to crawl the SharePoint repository |
| password | string | Password for the username above |
| domain | string | The domain where the user is authenticated |
| kdcserver | string | Kerberos KDC Hostname |
| mySiteBaseURL | string | Used for MOSS 2007 only. The MySite base URL is used to determine the complete MySite URL, so http://server.domain/personal/administrator/default.aspx would be entered as http://server.domain. The credentials provided will allow LucidWorks Enterprise to complete the MySite URL and crawl the content |
| includedURls (as in, U-R-lower-case L) | string | The directories on the server that should be crawled for indexing. If left blank, all paths will be followed, even if they lead away from the original URL entered. To limit crawling to a specific site, repeat the URL in this site with a regular expression to indicate all pages from the site. The SharePoint data source uses GNU regular expressions, which may be different from the Java regular expressions used for Web and file system data sources. More information on the syntax can be found in the GNU regular expression lweug18umentation. |
| excludedURls (as in, U-R-lower-case L) | string | Directories on the server that should not be crawled and that should be excluded from the index. The same regular expression syntax can be used to specify Excluded URLs as is used for Included URLs. |
| useSPSearchVisibility | string | Use SharePoint search visibility options |
| aliases | map of string to string | Allows mapping of source URL patterns to aliases that are used to rewrite URLs before indexing |
| authorization | string | use "content" |
{
"id": 2,
"name": "Sharepoint crawl",
"type": "sharepoint",
"crawler": "lucid.gcm",
"sharepointUrl": "http://my.sharepoint.host.com/",
"username": "user",
"password": "secret",
"domain": "myDomain",
"includeURls": ".*",
"useSPSearchVisibility": "true",
"authorization": "content"
}
lucid.fs / SMB
The following properties are common to all remote and pseudo file systems (only SMB is supported in v1.8)
| Key | Type | Description |
|---|---|---|
| url | string | For SMB (Windows Shares) filesystems, the root URL includes the protocol (smb), the host address, and the path to crawl: smb://host/path/to/crawl. |
| type | string | One of supported data source types, must be consistent with the root URL's protocol. The following value is supported: smb. |
| max_bytes | long | Optional, default is -1. |
| bounds | string | Either "tree" to limit the crawl to a strict subtree, or "none" for no limits. |
| includes | list of strings | Regex patterns to match on full URLs of files. If not empty then at least one pattern must match to include a file. |
| excludes | list of strings | Regex patterns that URLs must not match. If not empty then a file is excluded if any pattern matches its URL. |
| crawl_depth | 32-bit integer | How many path levels to descend. |
| username | string | Optional |
| password | string | Optional |
| windowsdomain | string | Required |
curl -H 'Content-type:application/json'
-d {
"name":"<datasource name>",
"type":"smb",
"crawler":"lucid.fs",
"url":"smb://<host>/<path>/",
"username":"username",
"password":"password",
"windowsdomain":"<your windows domain>"
}
'http://localhost:8888/api/collections/<collection>/datasources'
Response Codes
200: Success OK
Examples
Input
curl 'http://localhost:8888/api/collections/collection1/datasources'
Output
[
{
"max_bytes": 10485760,
"include_paths": [],
"collect_links": true,
"exclude_paths": [],
"mapping": {
"multiVal": {
"fileSize": true,
"author": true,
"body": true,
"title": true,
"keywords": true,
"description": false,
"subject": true,
"fileName": true,
"dateCreated" : false,
"attr": true,
"creator": true
},
"defaultField": null,
"mappings": {
"content-type ": "mimeType",
"slide-count": "pageCount",
"body": "body",
"slides": "pageCount",
"subject": "subject",
"plaintextmessagecontent": "body",
"lastmodifiedby": "author",
"content-encoding": "characterSet",
"date": null,
"type": null,
"creator": "creator",
"author": "author",
"mimetype": "mimeType",
"title": "title",
"plaintextcontent": "body",
"created": "dateCreated",
"contributor": "author",
"description": "description",
"contentcreated": "dateCreated",
"pagecount": "pageCount",
"name": "title",
"filelastmodified": "lastModified",
"fullname": "author",
"fulltext": "body",
"last-modified": "lastModified" ,
"messagesubject": "title",
"keyword": "keywords",
"contentlastmodified": "lastModified",
"last-printed": null,
"links": null,
"batch_id": "batch_id",
"crawl_uri": "crawl_uri",
"filesize": "fileSize",
"page-count": "pageCount",
"content-length": "fileSize",
"filename": "fileName"
},
"dynamicField": "attr",
"types": {
"filesize": "INT",
"pagecount" : "INT",
"lastmodified": "DATE",
"datecreated": "DATE",
"date": "DATE"
},
"uniqueKey": "id ",
"datasourceField": "data_source"
},
"collection": "collection1",
"type": "web",
"url" : "http://www.grantingersoll.com/",
"crawler": "lucid.aperture",
"id": 1,
"bounds": "tree",
"category": "Web",
"name": "Sample Site",
"crawl_depth": 2
},
{
"max_bytes": 21474836 47,
"include_paths": [],
"exclude_paths": [],
"mapping": {
"multiVal": {
"fileSize": true,
"body": true,
"author": true,
"title": true,
"keywords": true,
"subject": true,
"descripti on": false,
"fileName": true,
"dateCreated": false,
"attr": true,
"creator": true
},
"defaultField": null,
"mappings": {
"slide-count": "pageCount",
"content-type": "mimeType",
"body": "body",
"slides": "pageCount",
"subject": "subject",
"plaintextmessagecontent": " body",
"lastmodifiedby": "author",
"content-encoding": "characterSet",
"type": null,
"date": null,
"creator": "creator",
"author": "author",
"title": "title",
"mimetype": "mimeType",
"created": "dateCreated",
"plaintextcontent": "body",
"pagecount": "pageCount",
"contentcreated": "dateCreated",
"description": "description",
"contributor": "author ",
"name": "title",
"filelastmodified": "lastModified",
"fullname": "author",
"fulltext ": "body",
"messagesubject": "title",
"last-modified": "lastModified",
"keyword": "keywords",
"contentlastmodified": "lastModified",
"last-printed": null,
"links": null,
"batch_id": "batch_id",
"crawl_uri": "crawl_uri",
"filesize": "fileSize",
"page-count": "pageCount",
"content-length": "fileSize",
"filename": "fileName"
},
"dynamicField": "attr",
"types": {
"filesize": "INT",
"pagecount": "INT",
"lastmodified": "DATE",
"datecreated ": "DATE",
"date": "DATE"
},
"uniqueKey": "id",
"datasourceField": "data_source"
},
"follow_links": true,
"collection": "collection1",
"type": "file",
"crawler": "lucid.aperture ",
"id": 2,
"bounds": "tree",
"category": "FileSystem",
"name": "Small Test Collection",
"path": "C:\\ Users\\Nick\\Documents\\Business",
"crawl_depth": 2147483647
}
]
Create a Data Source
POST /api/collections/collection/datasources
Input
Path Parameters
| Key | Description |
|---|---|
| collection | The collection name |
Query Parameters
None
Input content
JSON block with all fields. The ID field, if present, will be ignored. See fields in section on getting a list of data sources.
Output
Output Content
JSON representation of new data source. Fields returned are listed in the section on getting a list of data sources.
Return Codes
201: created
Examples
Create a data source that includes the content of the Lucid Imagination web site. To keep the size down, only crawl two levels, and do not index the blog tag links or any search links. Also, do not wander off the site and index any external links.
Input
curl -H 'Content-type: application/json' -d '
{
"crawl_depth": 2,
"exclude_paths": [
"http://www\\.lucidimagination\\.com/blog/tag/.*",
"http://www\\.lucidimagination\\.com/search\\?.*"
],
"include_paths": [
"http://www\\.lucidimagination\\.com/.*"
],
"name": "Lucid Imagination Website",
"type": "web",
"crawler": "lucid.aperture",
"url": "http://www.lucidimagination.com/"
}' 'http://localhost:8888/api/collections/collection1/datasources'
Output
{
"id": 6,
"max_bytes": 10485760,
"include_paths": [
"http://www\\.lucidimagination\\.com/.*"
] ,
"collect_links": true,
"exclude_paths": [
"http://www\\.lucidimagination\\.com/blog/tag/.*",
"http://www\\.lucidimagination\\.com/search\\?.*"
],
"mapping": {
"multiVal ": {
"fileSize": true,
"body": true,
"author": true,
"title": true,
"keywords": true,
"subject": true,
"description": false,
"fileName": true,
"dateCreated": false,
"attr": true,
"creator": true
},
"defaultField": null,
"mappings": {
"slide-count": "pageCount",
"content- type": "mimeType",
"body": "body",
"slides": "pageCount",
"subject": "subject",
"plainte xtmessagecontent": "body",
"lastmodifiedby": "author",
"content-encoding": "character Set",
"type": null,
"date": null,
"creator": "creator",
"author": "author",
"title": "titl e",
"mimetype": "mimeType",
"created": "dateCreated",
"plaintextcontent": "body",
"page count": "pageCount",
"contentcreated": "dateCreated",
"description": "description",
"contributor": "author",
"name": "title",
"filelastmodified": "lastModified",
"fullname" : "author",
"fulltext": "body",
"messagesubject": "title",
"last-modified": "lastModified",
"keyword": "keywords",
"contentlastmodified": "lastModified",
"last-printed": null,
"links": null,
"batch_id": "batch_id",
"crawl_uri": "crawl_uri",
"filesize": "fileSize",
"page-count": "pageCount",
"content-length": "fileSize",
"filename": "fileName"
},
" dynamicField": "attr",
"types": {
"filesize": "INT",
"pagecount": "INT",
"lastmodified": "DATE",
"datecreated": "DATE",
"date": "DATE"
},
"uniqueKey": "id",
"datasourceField": "data_source"
},
"collection": "collection1",
"type": "web",
"url": "http://www.lucidimagination.com/",
"crawler": "lucid.aperture",
"bounds": "tree",
"category": "Web" ,
"name": "Lucid Imagination Website",
"crawl_depth": 2
}
Get Data Source Details
GET /api/collections/collection/datasources/id
| Note that the only way to find the id of a data source is to either store it on creation, or use the API call referenced above to get a list of data sources. |
Input
Path Parameters
| Key | Description |
|---|---|
| collection | the collection name |
Query Parameters
| Key | Type | Description |
|---|---|---|
| id | string | The data source ID |
Input content
None
Output
Output Content
Fields returned are listed in the section on getting a list of data sources.
Return Codes
200: success ok
404: not found
Examples
Get all of the parameters for data source 6, created in the previous step.
Input
curl 'http://localhost:8888/api/collections/collection1/datasources/6'
Output
{
"id": 6,
"max_bytes": 10485760,
"include_paths": [
"http://www\\.lucidimagination\\.com/.*"
] ,
"collect_links": true,
"exclude_paths": [
"http://www\\.lucidimagination\\.com/blog/tag/.*",
"http://www\\.lucidimagination\\.com/search\\?.*"
],
"mapping": {
"multiVal ": {
"fileSize": true,
"body": true,
"author": true,
"title": true,
"keywords": true,
"subject": true,
"description": false,
"fileName": true,
"dateCreated": false,
"attr": true,
"creator": true
},
"defaultField": null,
"mappings": {
"slide-count": "pageCount",
"content- type": "mimeType",
"body": "body",
"slides": "pageCount",
"subject": "subject",
"plainte xtmessagecontent": "body",
"lastmodifiedby": "author",
"content-encoding": "character Set",
"type": null,
"date": null,
"creator": "creator",
"author": "author",
"title": "titl e",
"mimetype": "mimeType",
"created": "dateCreated",
"plaintextcontent": "body",
"page count": "pageCount",
"contentcreated": "dateCreated",
"description": "description",
"contributor": "author",
"name": "title",
"filelastmodified": "lastModified",
"fullname" : "author",
"fulltext": "body",
"messagesubject": "title",
"last-modified": "lastModified",
"keyword": "keywords",
"contentlastmodified": "lastModified",
"last-printed": null,
"links": null,
"batch_id": "batch_id",
"crawl_uri": "crawl_uri",
"filesize": "fileSize",
"page-count": "pageCount",
"content-length": "fileSize",
"filename": "fileName"
},
" dynamicField": "attr",
"types": {
"filesize": "INT",
"pagecount": "INT",
"lastmodified": "DATE",
"datecreated": "DATE",
"date": "DATE"
},
"uniqueKey": "id",
"datasourceField": "data_source"
},
"collection": "collection1",
"type": "web",
"url": "http://www.lucidimagination.com/",
"crawler": "lucid.aperture",
"bounds": "tree",
"category": "Web" ,
"name": "Lucid Imagination Website",
"crawl_depth": 2
}
Update a Data Source
PUT /api/collections/collection/datasources/id
Input
Path Parameters
| Key | Description |
|---|---|
| collection | The collection name |
| id | The data source ID |
Query Parameters
None
Input content
JSON block with either all fields or just those that need updating. Data source type, crawler type, and ID cannot be updated. Other fields are listed in the section on getting a list of data sources.
Output
Output Content
None
Return Codes
204: success no content
Examples
Change the web data source so that it crawls three levels instead of just two:
Input
curl -X PUT -H 'Content-type: application/json' -d '
{
"crawl_depth": 3
}' 'http://localhost:8888/api/collections/collection1/datasources/6'
Output
None. (Check properties to confirm changes.)
Delete a Data Source
DELETE /api/collections/collection/datasources/id
Input
Path Parameters
| Key | Description |
|---|---|
| collection | the collection name |
| id | The data source ID |
Query Parameters
None
Input content
None
Output
Output Content
None
Return Codes
204: success no content
404: not found
Examples
Input
curl -X DELETE -H 'Content-type: application/json' 'http://localhost:8888/api/collections/collection1/datasources/13'
Output
None. Check the listing of data sources to confirm deletion.
Labels
Page: Data Source History
Page: Data Source Jobs
Page: Data Source Schedules
Page: Data Source Status