The Data Source Status API provides a means to get information about whether a data source is currently being processed. This outputs the same information as the Data Source Jobs API but is available as a way to intermittently check the progress of the job.
API Entry Points
/api/collections/collection/datasources/id/status: Get this data source's status
Get the Status of a Data Source
GET /api/collection/collection/datasources/id/status
Input
Path Parameters
| Key | Description |
| collection | The collection name. |
| id | The data source ID. |
Query Parameters
None
Output
Output Content
| Key | Description |
|---|---|
| batch_job | If false, the content crawled will be added to the index. |
| crawl_started | The date and time the crawl started. |
| crawl_state | The current state of the job. Entries are FINISHED, STOPPED, or RUNNING. |
| crawl_stopped | The date and time the crawl stopped. |
| id | The unique id of the datasource. |
| job_id | The ID of the job itself. |
| num_access_denied | The number of documents that could not be accessed because of file permissions or wrong authentication. |
| num_deleted | The number of documents that were removed from the index. |
| num_failed | The number of documents that could not be parsed. |
| num_filter_denied | The number of documents that could not be accessed because of inclusion or exclusion rules. |
| num_new | The number of documents considered "new". |
| num_not_found | The number of documents the crawler expected to find (because of a link from a known document, from a symlink, or a redirect) but the remote server responded with HTTP 404 NOT_FOUND or "file missing". |
| num_robots_denied | The number of documents that could not be crawled because of robots.txt rules. |
| num_total | The total number of documents found during the last crawl. |
| num_unchanged | The number of documents that were not changed. |
| num_updated | The number of documents that were updated. |
Examples
Input
curl 'http://localhost:8888/api/collections/collection1/datasources/2/status'
Output
While the data source is being processed:
{
"batch_job": false,
"crawl_started": "2012-02-06T18:40:12+0000",
"crawl_state": "RUNNING",
"crawl_stopped": null,
"id": 6,
"job_id": "6",
"num_access_denied": 0,
"num_deleted": 0,
"num_failed": 2,
"num_filter_denied": 0,
"num_new": 227,
"num_not_found": 0,
"num_robots_denied": 0,
"num_total": 229,
"num_unchanged": 0,
"num_updated": 0
}
After processing is finished, and the data source is idle:
{
"batch_job": false,
"crawl_started": "2012-02-06T18:40:12+0000",
"crawl_state": "FINISHED",
"crawl_stopped": "2012-02-06T18:42:19+0000",
"id": 6,
"job_id": "6",
"num_access_denied": 0,
"num_deleted": 0,
"num_failed": 2,
"num_filter_denied": 0,
"num_new": 1099,
"num_not_found": 0,
"num_robots_denied": 0,
"num_total": 1101,
"num_unchanged": 0,
"num_updated": 0
}