
The Database crawler in LucidWorks Search does not automatically discover and index binary data you may have stored in your database (such as PDF files). However, you can configure LucidWorks to recognize and extract the binary data correctly by modifying the data source configuration file (which does not exist until you create a JDBC data source).
| For detailed information about working with JDBC data sources, see Create a New JDBC Data Source or the Data Sources API. |
After you have created a Database data source, you can find the configuration file in $LWE_HOME/data/lucid.jdbc/datasources/id/conf/dataconfig.xml. The ID in the path is the ID of the data source created. If you are familiar with Solr, you will recognize this file as a Data Import Handler configuration file.
Follow these steps to modify the configuration file:
- Add a name attribute for the database containing your binary data to the dataSource entry.
- Set the convertType attribute for the dataSource to false. This prevents LucidWorks from treating binary data as strings.
- Add a FieldStreamDataSource to stream the binary data to the Tika entity processor.
- Specify the dataSource name in the root entity.
- Add an entity for your FieldStreamDataSource using the TikaEntityProcessor to take the binary data from the FieldStreamDataSource, parse it, and specify a field for storing the processed data.
- Reload the Solr core to apply your configuration changes.
| After you have modified the data source configuration file you should not modify the data source from the LucidWorks Admin UI because LucidWorks will automatically overwrite the convertType attribute, and indexing for the modified data source will fail. |
Example
In this example there is a MySQL database called test containing a table called documents that contains PDF data in a column called binary_content. When the data source is first created, the data source configuration file (in $LWE_HOME/data/lucid.jdbc/datasources/id/conf/dataconfig.xml) looks like this:
<dataConfig> <dataSource autoCommit="true" batchSize="-1" convertType="true" driver="com.mysql.jdbc.Driver" password="admin" url="jdbc:mysql://localhost/test" user="root"/> <document name="items"> <entity name="root" preImportDeleteQuery="data_source:9" query="SELECT * FROM documents" transformer="TemplateTransformer"> <field column="data_source" template="9"/> <field column="data_source_type" template="Jdbc"/> <field column="data_source_name" template="MySQL"/> </entity> </document> </dataConfig>
To modify this data configuration file, follow these steps:
- Add the name attribute to the dataSource and set convertType to false:
<dataSource autoCommit="true" batchSize="-1" convertType="false" driver="com.mysql.jdbc.Driver" password="admin" url="jdbc:mysql://localhost/test" user="root" name="test"/>
Specify another dataSource called fieldReader to handle the binary data:<dataSource name="fieldReader" type="FieldStreamDataSource" />
- Specify the data source for the root entity:
<entity name="root" preImportDeleteQuery="data_source:9" query="SELECT * FROM documents" transformer="TemplateTransformer" dataSource="test">
- Add an entity for the fieldReader data source specifying the TikaEntityProcessor and a dataField for storing the processed binary data:
<entity dataSource="fieldReader" processor="TikaEntityProcessor" dataField="root.binary_content" format="text"> <field column="text" name="body" /> </entity>
- Restart LucidWorks Search to apply your configuration changes.
For this example, the final configuration file looks like this:
<dataConfig> <dataSource autoCommit="true" batchSize="-1" convertType="false" driver="com.mysql.jdbc.Driver" password="admin" url="jdbc:mysql://localhost/test" user="root" name="test"/> <dataSource name="fieldReader" type="FieldStreamDataSource" /> <document name="items"> <entity name="root" preImportDeleteQuery="data_source:9" query="SELECT * FROM documents" transformer="TemplateTransformer" dataSource="test"> <field column="data_source" template="9"/> <field column="data_source_type" template="Jdbc"/> <field column="data_source_name" template="MySQL"/> <entity dataSource="fieldReader" processor="TikaEntityProcessor" dataField="root.binary_content" format="text"> <field column="text" name="body" /> </entity> </entity> </document> </dataConfig>
