Some Solr term analysis filters are driven by rules contained in text files. These include:
- Stop words
- Synonyms
- Stemming rules
The actual file names will vary and are determined by the filter's entry in solrconfig.xml as well as the LucidWorks Enterprise admin interface.
Stop Words File Format
Same as Solr stop words file format.
One stop word per line.
Example:
a an and are as at
Synonyms File Format
Same as Solr synonyms file format.
Blank lines and lines starting with pound are comments.
Explicit mappings match any token sequence on the LHS of "=>" and replace with all alternatives on the RHS. These types of mappings ignore the expand parameter in the schema.
Examples:
i-pod, i pod => ipod,
sea biscuit, sea biscit => seabiscuit
Equivalent synonyms may be separated with commas and give no explicit mapping. This allows the same synonym file to be used in different synonym handling strategies.
Example:
lawyer, attorney one, 1 two, 2 three, 3 ten, 10 hundred, 100 thousand, 1000 tv, television #multiple synonym mapping entries are merged. foo => foo bar foo => baz #is equivalent to foo => foo bar, baz
Lucid Plural Stemmer Rules File Format
The Lucid plural stemmer is designed to focus on stemming of plural words into their singular forms. It is not perfect, but it is rule-based and the rules can be supplemented and tuned to handle a wide range of exceptions. Individual words can be protected from stemming and individual words can be given special-case stem words. But usually, general patterns cover wide classes of words.
This is mostly mapping plural to singular and primarily those ending with "s", but there are also verb forms ending with "s" that fall under the same heuristic rules.
It is understood that this simple heuristic approach will mangle a non-zero fraction of words that should either not be stemmed or should be stemmed differently.
The rules do go to some effort to avoid removing "s" endings that are not plural (or verb conjugations), such as "alias" or "business."
Input token does not need to be lower case, but stemming change will be lower case.
The filter (factory) is named com.lucid.analysis.LucidPluralStemFilterFactory. It has a "rules" parameter which names the rules file. The default rules file name is LucidStemRules_en.txt. It is expected that each natural language will have its own stemming rules file.
General Formatting rules
- Exclamation point indicates a comment or comment line to be ignored.
- White space is extraneous and ignored.
- Blank lines ignored.
Types of stemming rules
Protected word
Just write the word itself, it will not be changed.
- word
Replacement word
Word will always be changed to a replacement word.
- word => new-word
- word -> new-word
- word --> new-word
- word = new-word
Protected suffixes
Any Matching word will be protected
- pattern suffix
Pattern may start with an asterisk to indicate variable length.
Use zero or more question marks to indicate that a character is required.
Use a trailing slash if a consonant is required.
Examples:
- ?ass
- *??ass
- *???/ass
Translation suffix
Suffix of matching word will be replaced with new suffix.
- pattern suffix => new-suffix
Pattern rules are the same as for protected suffixes.
The pattern may be repeated before the replacement suffix for readability.
Examples:
- *ses => se
- *ses -> *se
- *?/uses => se
- *???s =>
- *???s => *
The latter two examples show no new suffix, meaning that the existing suffix is simply removed.
Rules are evaluated in the order that they appear in the rules file, except that whole protected words and replacement words are processed before examining suffixes.
To restrict the minimum word length that is to be stemmed, simple create rules consisting on only question marks ('?') to match and protect words of those lengths.
For example, to protect words of less than four characters in length, add three rules, before any other rules:
? ! Protects 1-char words. ?? ! Protects 2-char words. ??? ! Protects 3-char words.