Hyphenated terms, such as plug-in or CD-ROM, are indexed without their hyphens, both as a sequence of sub-words and as a single, combined term which is the catenation of the sub-words. That combined term is stored at the position of the final sub-word. Users authoring documents are not always consistent on whether they use the hyphens or not, but the goal of the Lucid query parser is to be able to match either given a query of either. To do this as well as possible, the Lucid query parser will expand any hyphenated term into a Boolean OR of the sub-words as a phrase and the combined term.
h3. Simple Hyphenated Terms
A query of plug-in will automatically be interpreted as ("plug in" OR plugin). If we have these mini-documents:
* Doc #1: This is a plugin.
* Doc #2: This is the plug-in.
* Doc #3: Where is my plug in?
The query will match all three documents.
A query of plugin will only match the first two documents, but that is a limitation of this heuristic feature. The query results are better than without this feature even if they are still not ideal.
h3. Hyphenated Terms within Quoted Phrases
Quoted phrases may contain any number of hyphenated terms, in which case the Lucene "span query" feature is used for the entire phrase as well as the individual hyphenated terms which are expanded as above.
A query of:
* "buy a cd-rom with plug-in software"
would match any of the following mini-documents:
* Doc #1: I want to buy a cdrom with plugin software
* Doc #2: I want to buy a cdrom with plug-in software
* Doc #3: I want to buy a cd-rom with plugin software
* Doc #4: I want to buy a cd-rom with plug-in software
In terms of the new proximity operators, this query is equivalent to:
* buy a before:0 cd-rom before:0 with before:0 plug-in software
which is equivalent to:
- buy a before:0 ("cd rom" or cdrom) before:0 with before:0 ("plug in" or plugin) before:0 software
Multiple Hyphens in Terms
Some hyphenated terms have more than two sub-words. For example:
- on-the-run and never-to-be-forgotten
will be interpreted as:
- ("on the run" OR ontherun) and ("never to be forgotten" OR nevertobeforgotten)
Multiple hyphens occur in various special formats, such as phone numbers. For example:
- 646-414-1593 1-800-555-1212
which will be interpreted as:
- ("646 414 1593" OR 6464141593) AND ("1 800 555 1212" OR 18005551212)
Social Security numbers and ISBNs also have multiple hyphens. For example,
- 101-23-1234 and 978-3-16-148410-0
will be interpreted as:
- ("101 23 1234" OR 101231234) and ("978 3 16 148410 0" OR 9783161484100)
Part numbers and various ID formats also tend to contain more than one hyphen. These would be treated similarly to the examples above.