The LucidWorks default document highlighting service (which is implemented by the DocumentHighlighterRequestHandler) performs a similar function to the built-in Solr highlighter. The key difference is that while the Solr highlighter operates on a list of results returned from the index in response to a query, the Document Highlighter described here can perform highlighting of arbitrary content in one of many supported formats, and the content can be either submitted as a part of request or it may be retrieved from the current index.
Document Highlighter can return the original text, the highlighted text, and offsets of each matching span of terms, according to request parameters.
Request Parameters
Below is a list of parameters accepted by this RequestHandler. Parameters are mandatory unless marked as optional, in which case a default value will be used.
- dochl.q: the query to use for finding highlights. Note: the query parser can be defined using the regular qt= parameter, for example: qt=lucene.
- dochl.mode={highlight | offsets | both}: (optional, default is highlight). When mode is set to highlight, only the text with highlighted sections will be returned. If mode is set to offsets, only the matching term spans and their offsets will be returned. When mode is set to both, both highlights and offsets will be returned.
- dochl.source={solrcell | stored | xml | text}: the source of the text to be highlighted.
- solrcell: retrieve text from a specified content stream (required) using ExtractingRequestHandler, commonly referred to as SolrCell, and optionally use a subset of extracted data specified in an XPath expression (see below for more details). All parameters specific to SolrCell, such as field mapping, can be used in combination with this option.
- stored: obtain text from stored fields of a specified document in the current index (see below for additional parameters).
- xml: use the text content of the provided XML content stream, and optionally use a subset of the XML selected by an XPath expression.
- text: use the provided plain text content stream.
LucidWorks performs text analysis based on the field names in the supplied content. The default field name, if none are present, is body. The default field type, which determines the analysis chain, is obtained based on this field name from the current schema. It can be also overridden using dochl.ft.NAME parameters, for example dochl.ft.body=text_en. A special name of dochl.ft.* can be used to set an override for any other unspecified field.
- dochl.xpath: (optional, default is none, which selects the whole document). This XPath expression is used to select matching elements in the input XML document (either provided directly as XML content stream, or extracted using SolrCell). Only text content from the selected elements will be processed. In the case of multiple matches, LWE processes each match separately, and returns text, highlight, and offset nodes arranged in the order of matching.
- dochl.filename: (optional, default is none) can be used to help with content extraction.
- dochl.includeOrigText: (optional, default is true when dochl.mode=offsets and false otherwise). If true then returns also the original source text.
- dochl.beginMarker: (optional, default is <b>) characters to use as the start of the highlighted span.
- dochl.endMarker: (optional, default is </b>) characters to use as the end of the highlighted span.
- dochl.docId: (required when dochl.mode=stored) unique key (without field name) that identifies the document to obtain the stored fields from.
- dochl.fl: (optional, considered when dochl.mode=stored) if present, specifies a comma-separated list of fields to highlight. If absent, the regular f parameter is considered. If both are absent then LWE uses all stored fields.
Examples
curl "http://localhost:8888/solr/collection1/dochl?dochl.q=test&dochl.source=text& / stream.body=This+is+a+test+example&dochl.mode=both&dochl.includeOrigText=true&dochl.beginMarker=&dochl.endMarker="
<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0&</int> <int name="QTime">91</int> </lst> <lst name="results"> <lst name="0"> <lst name="offsets"> <str name="test">10-14</str> </lst> <str name="highlighted">This is a <span>test</span> example<str> <str name="text">This is a test example</str> </lst> </lst> </response>
curl "http://localhost:8888/solr/collection1/dochl?dochl.q=test&dochl.source=solrcell&dochl.mode=both"; \-F "myfile=test.pdf"
<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">61</int> </lst> <lst name="results"> <lst name="body"> <lst name="offsets"> <str name="test">10-14</str> <str name="test">40-44</str> </lst> <str name="highlighted">This is a <b>test</b> example. This is a second <b>test</b> example.</str> </lst> </lst> </response>
curl "http://localhost:8888/solr/collection1/dochl?dochl.q=test&dochl.source=xml&dochl.mode=both&stream.body=<?xml+ version='1.0'+encoding='UTF-8'?><top><p>first+test</p><p>second+test+example</p></top>"; \-H "Content-Type: text/xml"
<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">76</int> </lst> <lst name="results"> <lst name="body"> <lst name="offsets"> <str name="test">17-21</str> </lst> <str name="highlighted">first testsecond <b>test</b> example</str> </lst> </lst> </response>