While LucidWorks Enterprise provides reasonable defaults for term analysis, you may desire more customization. To further customize your term analysis defaults, you can use the term analysis filters in Solr. Some of these term analysis filters are driven by rules contained in text files. These include:
- Stemming rules
You can edit synonyms and stop words in the Query Settings section of the Administration User Interface. The actual files can be found in the $LWE_HOME/conf/solr/cores/collection1_0/conf directory (assuming the default collection of collection1; if using multiple collections, use the collection name that matches the collection to be changed).
The LucidWorks Enterprise Stop Words file format is the same as the Solr stopwords file format which is one term per line, as in:
The LucidWorks Enterprise Synonyms file format is the same as the Solr synonyms file format. Blank lines and lines starting with pound are comments. Explicit mappings match any token sequence on the left side of "=>" and replace with all alternatives on the right side. These types of mappings ignore the expand parameter in the schema.
Equivalent synonyms may be separated with commas and will give no explicit mapping (that is, the listed terms are equivalent). This allows the same synonym file to be used in different synonym handling strategies.
The Lucid plural stemmer is designed to focus on stemming of plural words into their singular forms. It is rule-based, so the rules can be supplemented and tuned to handle a wide range of exceptions. Individual words can be protected from stemming and can be given special-case stem words. But usually, general patterns cover wide classes of words.
This is mostly mapping plural to singular and primarily those ending with "s", but there are also verb forms not ending with "s" that fall under the same heuristic rules.
It is understood that this simple heuristic approach will misinterpret some words that should either not be stemmed or should be stemmed differently. The rules try to avoid removing "s" endings that are not plural (or verb conjugations), such as "alias" or "business."
Input token does not need to be lower case, but stemming change will be lower case.
The filter (factory) is named com.lucid.analysis.LucidPluralStemFilterFactory. It has a "rules" parameter which names the rules file. The default rules file is named LucidStemRules_en.txt and found in $LWE_HOME/conf/solr/cores/collection1_0/conf. It is expected that each natural language will have its own stemming rules file. This file is also specific to each collection.
If you edit the stemming rules file, adhere to the following format guidelines.
- Exclamation point indicates a comment or comment line to be ignored.
- White space is extraneous and ignored.
- Blank lines ignored.
Just write the word itself, it will not be changed.
Word will always be changed to a replacement word.
- word => new-word
- word -> new-word
- word --> new-word
- word = new-word
Any matching word will be protected.
- pattern suffix
Pattern may start with an asterisk to indicate variable length. Use zero or more question marks to indicate that a character is required. Use a trailing slash if a consonant is required.
Suffix of matching word will be replaced with new suffix.
- pattern suffix => new-suffix
Pattern rules are the same as for protected suffixes. The pattern may be repeated before the replacement suffix for readability.
- *ses => se
- *ses -> *se
- *?/uses => se
- *???s =>
- *???s => *
The latter two examples show no new suffix, meaning that the existing suffix is simply removed.
Rules are evaluated in the order that they appear in the rules file, except that whole protected words and replacement words are processed before examining suffixes.
To restrict the minimum word length that is to be stemmed, simply create rules consisting of only question marks ('?') to match and protect words of those lengths.
For example, to protect words of less than four characters in length, add three rules, before any other rules:
Here is the default LucidStemRules_en.txt file that ships with LucidWorks Enterprise: