|
Key
This line was removed.
This word was removed. This word was added.
This line was added.
|
Comment:
Changes (2)
View Page HistoryThis description is then used to create a crawl job to be executed by a specific crawler implementation (these are internally called "crawler controllers").
At present, LWE comes with the following built-in crawler controllers (referred to by their symbolic names given in parentheses) that support the following kinds of data sources:
* Aperture-based crawlers (lucid.aperture):
** Local file system
** Web
* DataImportHandler-based JDBC crawler (lucid.jdbc):
** JDBC database
* SolrXML crawler (lucid.solrxml):
** Solr XML files
* Google Connector Manager-based crawler (lucid.gcm):
** Microsoft SharePoint (Microsoft Office SharePoint Server 2007, Microsoft Windows SharePoint Services 3.0, SharePoint 2010)
* Remote file system and pseudo-file system crawler (lucid.fs):
** SMB / CIFS (Windows sharing) filesystem
** Hadoop Distributed File System (HDFS)
** Amazon S3 file system (also known as "S3 native")
** HDFS over Amazon S3
** FTP pseudo-file system
** Kosmos File System (KFS)
* External crawler for indexing documents crawled by an external process (lucid.external):
** External data source
{toc}
h2. API Entry Points
{{[/api/collections/_collection_/datasources|#api1]}}{{:}} [list|#api1] or [create|#api2] data sources in a particular collection
{{[/api/collections/_collection_/datasources/_id_|#api3]}}{{:}} [update|#api4], [remove|#api5], or [get details|#api3] for a particular data source
{anchor:api1}
h2. Get a List of Data Sources
!Rest API^bullet.jpg! {{GET /api/collections/}}{{{}{_}collection{_}{}}}{{/datasources}}
h4. {bgcolor:#FEECC4}{*}Input{*}{bgcolor}
*Path Parameters*
|| Key || Description ||
| collection | The collection name |
*Query Parameters*
None
h4. {bgcolor:#FEECC4}{*}Output{*}{bgcolor}
*Output Content*
{anchor:fields}
A JSON map of fields to values. The exact set of fields depends on the kind of data source. Commonly used kinds of data sources use the following symbolic names: *file* (files on a local file system), *web* (HTTP/HTTPS web sites), *jdbc* (JDBC databases), *solrxml* (files in Solr XML format), and *sharepoint* (Microsoft SharePoint). All return:
|| Key || Type || Description ||
| id | 32-bit integer | The numeric ID for this data source. |
| type | string | The type of this data source: file, web, jdbc, solrxml, sharepoint, and so on |
| crawler | string | Crawler implementation that handles this type of data source: lucid.aperture, lucid.fs, lucid.jdbc, and so on |
|collection| string | name of the document collection that documents will be indexed into|
| name | string | A human-readable name for this data source. |
The output also includes the field mapping for the data source, which is modifiable as part of the regular data source update API. The data source key "mapping" contains a JSON map with the following keys and values:
|| Key || Type || Description||
| mappings| JSON string-string | A map where keys are case-insensitive names of the original metadata key names, and values are case-sensitive names of fields that make sense in the current schema. These target field names are verified against the current schema and if they are not valid these mappings are removed. Please note that null target names are allowed, which means that source fields with such mappings will be discarded.|
| multiVal| JSON string-boolean | A map of target field names that is automatically initialized from the schema based on the target field's multiplicity (multiValued field attribute).|
| types | JSON string-string | A map pre-initialized from the current schema. Additional validation can be performed on fields with declared non-string types. Currently supported types are DATE, INT, STRING. If not specified fields are assumed to have the type STRING.|
| defaultField | string | The field name to use if source name doesn't match any mapping. If null, then dynamicField will be used, and if that is null too then the original name will be returned.|
| dynamicField | string | If not null then source names without specific mappings will be mapped to dynamicField_sourceName, after some cleanup of the source name (non-letter characters are replaced with underscore).|
| uniqueKey| string | Defines the name of the field in the current schema that is a unique key. Filled in from the current schema.|
|datasourceField| string | A prefix for index fields that are needed for LucidWorks faceting and data source management.|
|literals| JSON string-string | An optional map that can specify static pairs of keys and values to be added to output documents. |
The following fields are optional and are supported across many data source types:
|| Key || Type || Description ||
|commitWithin| integer | Number of milliseconds that defines the maximum interval between commits. |
|commitOnFinish| boolean | When true (default) then commit will be invoked at the end of crawl.|
The following fields control the batch processing, and are also optional. Note: some crawler controllers don't support batch processing, or support only a subset of options.
|| Key || Type || Description ||
|parsing| boolean | (default is true) When true then crawlers will parse rich formats immediately. When false then other processing is skipped and raw input documents are stored in a batch.|
|indexing| boolean | (default is true) When true then parsed documents will be sent immediately for indexing. When false then parsed documents will be stored in a batch.|
|caching| boolean | (default is false) When true then both raw and parsed documents will always be stored in a batch, in addition to any other requested processing. If false then batch is not created and documents are not preserved unless as a result of setting other options above.|
Below are specific fields for each data source type.
*lucid.aperture / File system:*
|| Key || Type || Description ||
| path | string | The path of the directory to start reading from. Paths should be entered as the complete directory path or they will be interpreted as relative to {{$LWE_HOME}}. On Unix systems, this means the path entered should start at root / level; on Windows, the drive letter and full path should be used (such as {{C:\path}}). Various types of relative paths, such as ../ or \~/, are not supported. |
| follow_links | boolean | Indicates whether to follow symbolic links in the file system. |
| bounds | string | Either "tree" to limit the crawl to a strict subtree, or "none" for no limits. |
| include_paths | list of strings | Regex patterns to match on full URLs of files. If not empty then at least one pattern must match to include a file. |
| exclude_paths | list of strings | Regex patterns that URLs must not match. If not empty then a file is excluded if \\
any pattern matches its URL. |
| crawl_depth | 32-bit integer | How many path levels to descend |
\\
{code:title=Example file data source|borderStyle=solid|borderColor=#666666}
{
"crawl_depth": 5,
"follow_links": true,
"name": "LucidWorks Documentation",
"path": "D:\\lwe\\docs\\lucidworks",
"type": "file",
"crawler": "lucid.aperture"
}
{code}
*lucid.aperture / Web:*
|| Key || Type || Description ||
| url | string | The URL that serves as the crawl seed |
| bounds | string | Either "tree" to limit the crawl to a strict subtree, or "none" for no limits. |
| include_paths | list of strings | Regex patterns to match on full URLs of files. If not empty then at least one pattern must match to include a file. |
| exclude_paths | list of strings | Regex patterns that URLs may not match. If not empty then a file is excluded if \\
any pattern matches its URL. |
| crawl_depth | 32-bit integer | The maximum number of crawl cycles (hops) from the starting URL. |
\\
h5. Robots.txt
This data source obeys most of the [robots.txt standard|http://www.robotstxt.org/], with the exception of the Crawl-Delay directive.
h5. HTTP Proxy
This data source supports communication via an HTTP proxy, either open or authenticated. The following data source properties determine the use of the proxy:
|| Key || Type || Description ||
| proxyHost | string | host name of the proxy. If null or absent then direct access is assumed, and all other proxy-related parameters are ignored. |
| proxyPort | string | port number of the proxy|
| proxyUsername | string | optional username credential for the proxy|
| proxyPassword | string | optional password credential for the proxy|
\\
h5. Authentication
Three types of HTTP authentication methods are supported: Basic, Digest and NTLMv1. NTLMv2 authentication is not supported in this release.
A single data source may have multiple sets of credentials for different sites, or for different realms within a single site. This authentication data is passed within the "auth" property of the datasource, as a list of JSON maps, each map with the following properties:
|| Key || Type || Description ||
|host|string| host name (or host:port) where this authentication should be used; may be null to indicate any host|
|realm|string| HTTP Realm where this authentication should be used; may be null to indicate any realm|
|username|string|user name; must not be null or empty|
|password|string|password; must not be null|
The HTTP connector first tries to access resources without any authentication. When it receives an HTTP code "401 Authentication Required" the authentication method (Basic, Digest or NTLMv1) is selected automatically, and the closest matching authentication tuple is selected from the ones configured for the data source.
\\
{code:title=Example web data source|borderStyle=solid|borderColor=#666666}
{
"id": 2,
"collect_links": true,
"crawl_depth": 2,
"exclude_paths": [
"http://www\\.lucidimagination\\.com/blog/tag/.*",
"http://www\\.lucidimagination\\.com/search\\?.*"
],
"include_paths": [
"http://www\\.lucidimagination\\.com/.*"
],
"auth": [
{
"host": "www.lucidimagination.com:443",
"realm": "Test realm",
"username": "user1",
"password": "test1"
},
{
"host": "www.lucidimagination.com",
"realm": null,
"username": "user2",
"password": "test2"
}
],
"name": "Lucid Imagination Website",
"type": "web",
"crawler": "lucid.aperture",
"url": "http://www.lucidimagination.com/"
}
{code}
*lucid.jdbc / JDBC:*
|| Key || Type || Description ||
| driver | string | The class name of the JDBC driver |
| username | string | The DB username |
| password | string | The DB password |
| url | string | The URL of the database instance |
| sql_select_statement | string | The select statement to use to generate data |
| primary_key | string | The column name of the primary key |
| delta_sql_query | string | The query used to obtain updates |
| nested_queries | string | The list of additional queries to obtain data from |
\\
{code:title=Example jdbc data source|borderStyle=solid|borderColor=#666666}
{
"id": 2,
"name": "Test database",
"type": "jdbc",
"crawler": "lucid.jdbc",
"driver": "com.mysql.jdbc.Driver",
"username": "root",
"password": "pass",
"url": "jdbc:mysql://localhost/test",
"sql_select_statement": "select * from document",
"primary_key": "id",
"delta_sql_query": "select id from document where last_modified > $",
"nested_queries": ["select category from document_category where doc_id=$",
"select tag from document_tag where doc_id=$"]
}
{code}
*lucid.solrxml / Solr XML file*
|| Key || Type || Description ||
| file | string | The name of the file to read, or directory containing files to read. Paths should be entered as the complete directory path or they will be interpreted as relative to {{$LWE_HOME}}. On Unix systems, this means the path entered should start at root / level; on Windows, the drive letter and full path should be used (such as {{C:\path}}). Various types of relative paths, such as ../ or \~/, are not supported. |
| include_datasource_metadata | boolean | Option to add data_source and data_source_type fields to each document in addition to fields found in the file |
| include_paths | list of strings | An array of URL patterns to include |
| exclude_paths | list of strings | An array of URL patterns to exclude |
\\
{code:title=Example solr xml data source|borderStyle=solid|borderColor=#666666}
{
"id": 2,
"name": "Solr example XML documents",
"type": "solrxml",
"crawler": "lucid.solrxml",
"file": "D:\\lucene_solr\\solr\\example\\exampledocs",
"include_paths" : [ "*\.xml" ],
"include_datasource_metadata": true
}
{code}
*lucid.gcm / SharePoint:*
|| Key || Type || Description ||
| sharepointUrl | string | The fully qualified URL for the SharePoint site |
| username | string | Username with authorization to crawl the SharePoint repository |
| password | string | Password for the username above |
| domain | string | The domain where the user is authenticated |
| kdcserver | string | Kerberos KDC Hostname |
| mySiteBaseURL | string | Used for MOSS 2007 only. The MySite base URL is used to determine the complete MySite URL, so [http://server.domain/personal/administrator/default.aspx] would be entered as [http://server.domain]. The credentials provided will allow LucidWorks Enterprise to complete the MySite URL and crawl the content |
| includedURls (as in, U-R-lower-case L) | string | The directories on the server that should be crawled for indexing. If left blank, all paths will be followed, even if they lead away from the original URL entered. To limit crawling to a specific site, repeat the URL in this site with a regular expression to indicate all pages from the site. The SharePoint data source uses GNU regular expressions, which may be different from the Java regular expressions used for Web and file system data sources. More information on the syntax can be found in the GNU regular expression [lweug18umentation|http://www.gnu.org/software/gawk/manual/html_node/Regexp.html]. |
| excludedURls (as in, U-R-lower-case L) | string | Directories on the server that should not be crawled and that should be excluded from the index. The same regular expression syntax can be used to specify Excluded URLs as is used for Included URLs. |
| useSPSearchVisibility | string | Use SharePoint search visibility options |
| aliases | map of string to string | Allows mapping of source URL patterns to aliases that are used to rewrite URLs before indexing |
| authorization | string | use "content" |
\\
{code:title=Example SharePoint repository data source|borderStyle=solid|borderColor=#666666}
{
"id": 2,
"name": "Sharepoint crawl",
"type": "sharepoint",
"crawler": "lucid.gcm",
"sharepointUrl": "http://my.sharepoint.host.com/",
"username": "user",
"password": "secret",
"domain": "myDomain",
"includeURls": ".*",
"useSPSearchVisibility": "true",
"authorization": "content"
}
{code}
*lucid.fs / Remote or Pseudo Filesystems*
|| Key || Type || Description ||
| url | string | Root URL formats vary by filesystem type: \\ \\ For CIFS (Windows Shares) filesystems, the root URL includes the protocol ({{smb}}), the host address, and the path to crawl: {{smb://_host_/path/to/crawl}}. \\ \\ For FTP, the root URL is a fully qualified FTP URL, with optional username and password parameters. Credentials can be passed as a part of the URL, or submitted as {{username}} and {{password}} properties. For example, {{ftp://<username>@<password>:<hostname>:<port>/path/to/crawl}}. \\ \\ For HDFS (Hadoop), the root URL is a fully-qualified Hadoop file system URL, including the protocol ({{hdfs}}), host name and port of the namenode, and path of the target resource to crawl: {{hdfs://namenode:9000/path/to/crawl}}. \\ \\ For S3n (Amazon) and S3 (Hadoop over Amazon), the root URL is a fully-qualified URL that starts with the {{s3n}} protocol, the name of the bucket, and the path inside the bucket. Both {{AccessKeyId}} and {{SecretAccessKey}} are needed: submit {{AccessKeyId}} as the username and {{SecretAccessKey}} as the password. You can also pass these credentials as part of the URL in the following format: {{s3n://<username>@<password>:bucket/path}} . However, Amazon S3 credentials often contain characters that are not allowed in URLs. In that case, you must pass these credentials by setting the "username" and "password" properties explicitly. |
| url | string | Root URL formats vary by filesystem type: \\ \\ For CIFS (Windows Shares) filesystems, the root URL includes the protocol ({{smb}}), the host address, and the path to crawl: {{smb://_host_/path/to/crawl}}. \\ \\ For FTP, the root URL is a fully qualified FTP URL, with optional username and password parameters. Credentials can be passed as a part of the URL, or submitted as {{username}} and {{password}} properties. For example, {{ftp://<username>@<password>:<hostname>:<port>/path/to/crawl}}. \\ \\ For HDFS (Hadoop), the root URL is a fully-qualified Hadoop file system URL, including the protocol ({{hdfs}}), host name and port of the namenode, and path of the target resource to crawl: {{hdfs://namenode:9000/path/to/crawl}}. \\ \\ For S3n (Amazon) and S3 (Hadoop over Amazon), the root URL is a fully-qualified URL that starts with the {{s3n}} protocol, the name of the bucket, and the path inside the bucket. Both {{AccessKeyId}} and {{SecretAccessKey}} are needed: submit {{AccessKeyId}} as the username and {{SecretAccessKey}} as the password. You can also pass these credentials as part of the URL in the following format: {{s3n://<username>@<password>:bucket/path}} . However, Amazon S3 credentials often contain characters that are not allowed in URLs. In that case, you must pass these credentials by setting the "username" and "password" properties explicitly. |
|type|string| One of supported data source types, MUST be consistent with the root URL's protocol. The following values are supported: file, smb, hdfs, s3n, s3, kfs|
| follow_links | boolean | Indicates whether to follow symbolic links in the file system. |
| bounds | string | Either "tree" to limit the crawl to a strict subtree, or "none" for no limits. |
| bounds | string | Either "tree" to limit the crawl to a strict subtree, or "none" for no limits. |
| exclude_paths | list of strings | Regex patterns that URLs must not match. If not empty then a file is excluded if \\
any pattern matches its URL. |
| crawl_depth | 32-bit integer | How many path levels to descend |
{code:title=Example remote filesystem (CIFS) source|borderStyle=solid|borderColor=#666666}
curl -H 'Content-type:application/json'
-d {
"name":"<datasource name>",
"type":"cifs",
"crawler":"lucid.fs",
"url":"smb://<host>/<path>/",
"username":"username",
"password":"password",
"windowsdomain":"<your windows domain>"
}
'http://localhost:8888/api/collections/<collection>/datasources'
{code}
*Response Codes*
200: Success OK
h4. {bgcolor:#FEECC4}{*}Examples{*}{bgcolor}
*Input*
{code:borderStyle=solid|borderColor=#666666}
curl 'http://localhost:8888/api/collections/collection1/datasources'
{code}
*Output*
{code:borderStyle=solid|borderColor=#666666}
[
{
"max_bytes": 10485760,
"include_paths": [],
"collect_links": true,
"exclude_paths": [],
"mapping": {
"multiVal": {
"fileSize": true,
"author": true,
"body": true,
"title": true,
"keywords": true,
"description": false,
"subject": true,
"fileName": true,
"dateCreated" : false,
"attr": true,
"creator": true
},
"defaultField": null,
"mappings": {
"content-type ": "mimeType",
"slide-count": "pageCount",
"body": "body",
"slides": "pageCount",
"subject": "subject",
"plaintextmessagecontent": "body",
"lastmodifiedby": "author",
"content-encoding": "characterSet",
"date": null,
"type": null,
"creator": "creator",
"author": "author",
"mimetype": "mimeType",
"title": "title",
"plaintextcontent": "body",
"created": "dateCreated",
"contributor": "author",
"description": "description",
"contentcreated": "dateCreated",
"pagecount": "pageCount",
"name": "title",
"filelastmodified": "lastModified",
"fullname": "author",
"fulltext": "body",
"last-modified": "lastModified" ,
"messagesubject": "title",
"keyword": "keywords",
"contentlastmodified": "lastModified",
"last-printed": null,
"links": null,
"batch_id": "batch_id",
"crawl_uri": "crawl_uri",
"filesize": "fileSize",
"page-count": "pageCount",
"content-length": "fileSize",
"filename": "fileName"
},
"dynamicField": "attr",
"types": {
"filesize": "INT",
"pagecount" : "INT",
"lastmodified": "DATE",
"datecreated": "DATE",
"date": "DATE"
},
"uniqueKey": "id ",
"datasourceField": "data_source"
},
"collection": "collection1",
"type": "web",
"url" : "http://www.grantingersoll.com/",
"crawler": "lucid.aperture",
"id": 1,
"bounds": "tree",
"category": "Web",
"name": "Sample Site",
"crawl_depth": 2
},
{
"max_bytes": 21474836 47,
"include_paths": [],
"exclude_paths": [],
"mapping": {
"multiVal": {
"fileSize": true,
"body": true,
"author": true,
"title": true,
"keywords": true,
"subject": true,
"descripti on": false,
"fileName": true,
"dateCreated": false,
"attr": true,
"creator": true
},
"defaultField": null,
"mappings": {
"slide-count": "pageCount",
"content-type": "mimeType",
"body": "body",
"slides": "pageCount",
"subject": "subject",
"plaintextmessagecontent": " body",
"lastmodifiedby": "author",
"content-encoding": "characterSet",
"type": null,
"date": null,
"creator": "creator",
"author": "author",
"title": "title",
"mimetype": "mimeType",
"created": "dateCreated",
"plaintextcontent": "body",
"pagecount": "pageCount",
"contentcreated": "dateCreated",
"description": "description",
"contributor": "author ",
"name": "title",
"filelastmodified": "lastModified",
"fullname": "author",
"fulltext ": "body",
"messagesubject": "title",
"last-modified": "lastModified",
"keyword": "keywords",
"contentlastmodified": "lastModified",
"last-printed": null,
"links": null,
"batch_id": "batch_id",
"crawl_uri": "crawl_uri",
"filesize": "fileSize",
"page-count": "pageCount",
"content-length": "fileSize",
"filename": "fileName"
},
"dynamicField": "attr",
"types": {
"filesize": "INT",
"pagecount": "INT",
"lastmodified": "DATE",
"datecreated ": "DATE",
"date": "DATE"
},
"uniqueKey": "id",
"datasourceField": "data_source"
},
"follow_links": true,
"collection": "collection1",
"type": "file",
"crawler": "lucid.aperture ",
"id": 2,
"bounds": "tree",
"category": "FileSystem",
"name": "Small Test Collection",
"path": "C:\\ Users\\Nick\\Documents\\Business",
"crawl_depth": 2147483647
}
]{code}
{anchor:api2}
h2. Create a Data Source
!Rest API^bullet.jpg! {{POST /api/collections/}}{{{}{_}collection{_}{}}}{{/datasources}}
h4. {bgcolor:#FEECC4}{*}Input{*}{bgcolor}
*Path Parameters*
|| Key || Description ||
| collection | The collection name |
*Query Parameters*
None
*Input content*
JSON block with all fields. The ID field, if present, will be ignored. See fields in first [GET: Output Content|#fields].
h4. {bgcolor:#FEECC4}{*}Output{*}{bgcolor}
*Output Content*
JSON representation of new data source. For fields see [GET: Output Content|#fields].
*Return Codes*
201: created
h4. {bgcolor:#FEECC4}{*}Examples{*}{bgcolor}
Create a data source that includes the content of the Lucid Imagination web site. To keep the size down, only crawl two levels, and do not index the blog tag links or any search links. Also, do not wander off the site and index any external links.
*Input*
{code:borderStyle=solid|borderColor=#666666}
curl -H 'Content-type: application/json' -d '
{
"crawl_depth": 2,
"exclude_paths": [
"http://www\\.lucidimagination\\.com/blog/tag/.*",
"http://www\\.lucidimagination\\.com/search\\?.*"
],
"include_paths": [
"http://www\\.lucidimagination\\.com/.*"
],
"name": "Lucid Imagination Website",
"type": "web",
"crawler": "lucid.aperture",
"url": "http://www.lucidimagination.com/"
}' 'http://localhost:8888/api/collections/collection1/datasources'
{code}
*Output*
{code:borderStyle=solid|borderColor=#666666}
{
"id": 6,
"max_bytes": 10485760,
"include_paths": [
"http://www\\.lucidimagination\\.com/.*"
] ,
"collect_links": true,
"exclude_paths": [
"http://www\\.lucidimagination\\.com/blog/tag/.*",
"http://www\\.lucidimagination\\.com/search\\?.*"
],
"mapping": {
"multiVal ": {
"fileSize": true,
"body": true,
"author": true,
"title": true,
"keywords": true,
"subject": true,
"description": false,
"fileName": true,
"dateCreated": false,
"attr": true,
"creator": true
},
"defaultField": null,
"mappings": {
"slide-count": "pageCount",
"content- type": "mimeType",
"body": "body",
"slides": "pageCount",
"subject": "subject",
"plainte xtmessagecontent": "body",
"lastmodifiedby": "author",
"content-encoding": "character Set",
"type": null,
"date": null,
"creator": "creator",
"author": "author",
"title": "titl e",
"mimetype": "mimeType",
"created": "dateCreated",
"plaintextcontent": "body",
"page count": "pageCount",
"contentcreated": "dateCreated",
"description": "description",
"contributor": "author",
"name": "title",
"filelastmodified": "lastModified",
"fullname" : "author",
"fulltext": "body",
"messagesubject": "title",
"last-modified": "lastModified",
"keyword": "keywords",
"contentlastmodified": "lastModified",
"last-printed": null,
"links": null,
"batch_id": "batch_id",
"crawl_uri": "crawl_uri",
"filesize": "fileSize",
"page-count": "pageCount",
"content-length": "fileSize",
"filename": "fileName"
},
" dynamicField": "attr",
"types": {
"filesize": "INT",
"pagecount": "INT",
"lastmodified": "DATE",
"datecreated": "DATE",
"date": "DATE"
},
"uniqueKey": "id",
"datasourceField": "data_source"
},
"collection": "collection1",
"type": "web",
"url": "http://www.lucidimagination.com/",
"crawler": "lucid.aperture",
"bounds": "tree",
"category": "Web" ,
"name": "Lucid Imagination Website",
"crawl_depth": 2
}{code}
{anchor:api3}
h2. Get Data Source Details
!Rest API^bullet.jpg! {{GET /api/collections/}}{{{}{_}collection{_}{}}}{{/datasources/}}{{{}{_}id{_}}}
{info}
Note that the only way to find the id of a data source is to either store it on creation, or use the API call referenced above to [get a list of data sources|#api1].
{info}
h4. {bgcolor:#FEECC4}{*}Input{*}{bgcolor}
*Path Parameters*
|| Key || Description ||
| collection | the collection name |
*Query Parameters*
|| Key || Type || Description ||
| id | string | The data source ID |
*Input content*
None
h4. {bgcolor:#FEECC4}{*}Output{*}{bgcolor}
*Output Content*
see first [GET: Output Content|#fields]
*Return Codes*
200: success ok
404: not found
h4. {bgcolor:#FEECC4}{*}Examples{*}{bgcolor}
Get all of the parameters for data source 6, created in the previous step.
*Input*
{code:borderStyle=solid|borderColor=#666666}
curl 'http://localhost:8888/api/collections/collection1/datasources/6'
{code}
*Output*
{code:borderStyle=solid|borderColor=#666666}
{
"id": 6,
"max_bytes": 10485760,
"include_paths": [
"http://www\\.lucidimagination\\.com/.*"
] ,
"collect_links": true,
"exclude_paths": [
"http://www\\.lucidimagination\\.com/blog/tag/.*",
"http://www\\.lucidimagination\\.com/search\\?.*"
],
"mapping": {
"multiVal ": {
"fileSize": true,
"body": true,
"author": true,
"title": true,
"keywords": true,
"subject": true,
"description": false,
"fileName": true,
"dateCreated": false,
"attr": true,
"creator": true
},
"defaultField": null,
"mappings": {
"slide-count": "pageCount",
"content- type": "mimeType",
"body": "body",
"slides": "pageCount",
"subject": "subject",
"plainte xtmessagecontent": "body",
"lastmodifiedby": "author",
"content-encoding": "character Set",
"type": null,
"date": null,
"creator": "creator",
"author": "author",
"title": "titl e",
"mimetype": "mimeType",
"created": "dateCreated",
"plaintextcontent": "body",
"page count": "pageCount",
"contentcreated": "dateCreated",
"description": "description",
"contributor": "author",
"name": "title",
"filelastmodified": "lastModified",
"fullname" : "author",
"fulltext": "body",
"messagesubject": "title",
"last-modified": "lastModified",
"keyword": "keywords",
"contentlastmodified": "lastModified",
"last-printed": null,
"links": null,
"batch_id": "batch_id",
"crawl_uri": "crawl_uri",
"filesize": "fileSize",
"page-count": "pageCount",
"content-length": "fileSize",
"filename": "fileName"
},
" dynamicField": "attr",
"types": {
"filesize": "INT",
"pagecount": "INT",
"lastmodified": "DATE",
"datecreated": "DATE",
"date": "DATE"
},
"uniqueKey": "id",
"datasourceField": "data_source"
},
"collection": "collection1",
"type": "web",
"url": "http://www.lucidimagination.com/",
"crawler": "lucid.aperture",
"bounds": "tree",
"category": "Web" ,
"name": "Lucid Imagination Website",
"crawl_depth": 2
}{code}
{anchor:api4}
h2. Update a Data Source
!Rest API^bullet.jpg! {{PUT /api/collections/}}{{{}{_}collection{_}{}}}{{/datasources/}}{{{}{_}id{_}}}
h4. {bgcolor:#FEECC4}{*}Input{*}{bgcolor}
*Path Parameters*
|| Key || Description ||
| collection | The collection name |
| id | The data source ID |
*Query Parameters*
None
*Input content*
JSON block with either all fields or just those that need updating. Data source type, crawler type, and ID cannot be updated. See fields at [GET: Output Content|#fields]
h4. {bgcolor:#FEECC4}{*}Output{*}{bgcolor}
*Output Content*
None
*Return Codes*
204: success no content
h4. {bgcolor:#FEECC4}{*}Examples{*}{bgcolor}
Change the web data source so that it crawls three levels instead of just two:
*Input*
{code:borderStyle=solid|borderColor=#666666}
curl -X PUT -H 'Content-type: application/json' -d '
{
"crawl_depth": 3
}' 'http://localhost:8888/api/collections/collection1/datasources/6'
{code}
*Output*
None. (Check properties to confirm changes.)
{anchor:api5}
h2. Delete a Data Source
!Rest API^bullet.jpg! {{DELETE /api/collections/}}{{{}{_}collection{_}{}}}{{/datasources/}}{{{}{_}id{_}}}
h4. {bgcolor:#FEECC4}{*}Input{*}{bgcolor}
*Path Parameters*
|| Key || Description ||
| collection | the collection name |
| id | The data source ID |
*Query Parameters*
None
*Input content*
None
h4. {bgcolor:#FEECC4}{*}Output{*}{bgcolor}
*Output Content*
None
*Return Codes*
204: success no content
404: not found
h4. {bgcolor:#FEECC4}{*}Examples{*}{bgcolor}
*Input*
{code:borderStyle=solid|borderColor=#666666}
curl -X DELETE -H 'Content-type: application/json' 'http://localhost:8888/api/collections/collection1/datasources/13'
{code}
*Output*
None. Check the listing of data sources to confirm deletion.