Errors during crawling will be recorded in the core.<date>.log file. You can find the core.<date>.log file in the $LWE_HOME/data/logs directory. Serious exceptions will be reported to the LucidWorksLogs collection, which you can search as you can any other collection. You can also view log events on the Server Log page (Status -> Server Log).
Documents may be skipped because there is not an extractor available for that file type, or because the file size exceeds the maximum set during crawl configuration. Skipped documents will not be recorded in the LucidWorksLogs collection. These would be found in the log file with a format like this:
INFO filesystem.FileSystemCrawler - File <file-URL> exceeds the maximum size specified for this data source. Skipping.
WARN No extractor for <file format>; Skipping: <document-URI>
Possible Errors
With each of the errors below, the exact cause cannot be determined. This information is provided to help you find the errors in the log file; precise troubleshooting requires information about the documents and system environment. If a document causes an error (besides being too large or the system being out of memory), it may be helpful to try to isolate it and try again to be sure it is the document causing the problem and not some other system error that may have occurred at the same time.
In each of the errors below, the document URI will be listed. For files this will be the path and filename, for websites it would be the URL; for other data sources it will be whatever you have assigned as the document URI when the data source was configured.
Exception
WARN Exception while crawling: <document-URI> <exception-with-stack-trace>
WARN Doc failed: <exception-with-stack-trace>
WARN Doc failed: <document-URI> - cause: <exception-cause-message>
PDF files are notorious for causing exceptions in their processing, but that is primarily for file system crawls.
Out of memory
WARN File caused an Out of Memory Exception, skipping: <document-URI> <exception-with-stack-trace>
WARN Doc failed: <exception-with-stack-trace>
WARN Doc failed: <document-URI> - cause: <OOM-exception-message>
SubCrawlerException
WARN Doc failed: <exception-with-stack-trace>
WARN Doc failed: <document-URI> - cause: <exception-message>
Unknown file type
WARN Doc failed: Could not find extractor: <document-URI>
In this case, this warning will be seen in the logs but will not be reported in the LucidWorksLogs collection.
I/O error
WARN IO Exception processing: <document-URI> <exception-with-stack-trace>
WARN Doc failed: <exception-with-stack-trace>
WARN Doc failed: <document-URI> - cause: <exception-message>
HTML/XML/XHTML parsing errors
WARN Doc failed: <exception-with-stack-trace>
WARN Doc failed: <document-URI> - cause: <exception-cause-message>
This is another case where a warning will be seen in the logs but will not be reported in the LucidWorksLogs collection.