While LucidWorks Enterprise provides reasonable defaults for term analysis, you may desire more customization. To further customize your term analysis defaults, you can use the term analysis filters in Solr. Some of these term analysis filters are driven by rules contained in text files. These include:
- Stopwords
- Synonyms
- Stemming rules
You can edit synonyms and stop words in the Query Settings section of the Administration User Interface. The actual files can be found in the $LWE_HOME/conf/solr/cores/collection1_0/conf directory (assuming the default collection of collection1; if using multiple collections, use the collection name that matches the collection to be changed).
Stop Words File Format
The LucidWorks Enterprise Stop Words file format is the same as the Solr stopwords file format which is one term per line, as in:
a an and are as at
Synonyms File Format
The LucidWorks Enterprise Synonyms file format is the same as the Solr synonyms file format. Blank lines and lines starting with pound are comments. Explicit mappings match any token sequence on the left side of "=>" and replace with all alternatives on the right side. These types of mappings ignore the expand parameter in the schema.
Equivalent synonyms may be separated with commas and will give no explicit mapping (that is, the listed terms are equivalent). This allows the same synonym file to be used in different synonym handling strategies.
Example:
lawyer, attorney one, 1 two, 2 three, 3 ten, 10 hundred, 100 thousand, 1000 tv, television #multiple synonym mapping entries are merged. foo => foo bar foo => baz #is equivalent to foo => foo bar, baz
Lucid Plural Stemming Rules File Format
The Lucid plural stemmer is designed to focus on stemming of plural words into their singular forms. It is rule-based, so the rules can be supplemented and tuned to handle a wide range of exceptions. Individual words can be protected from stemming and can be given special-case stem words. But usually, general patterns cover wide classes of words.
This is mostly mapping plural to singular and primarily those ending with "s", but there are also verb forms not ending with "s" that fall under the same heuristic rules.
It is understood that this simple heuristic approach will misinterpret some words that should either not be stemmed or should be stemmed differently. The rules try to avoid removing "s" endings that are not plural (or verb conjugations), such as "alias" or "business."
Input token does not need to be lower case, but stemming change will be lower case.
The filter (factory) is named com.lucid.analysis.LucidPluralStemFilterFactory. It has a "rules" parameter which names the rules file. The default rules file is named LucidStemRules_en.txt and found in $LWE_HOME/conf/solr/cores/collection1_0/conf. It is expected that each natural language will have its own stemming rules file. This file is also specific to each collection.
If you edit the stemming rules file, adhere to the following format guidelines.
- Exclamation point indicates a comment or comment line to be ignored.
- White space is extraneous and ignored.
- Blank lines ignored.
Types of Stemming Rules
Protected Word
Just write the word itself, it will not be changed.
- word
Replacement Word
Word will always be changed to a replacement word.
- word => new-word
- word -> new-word
- word --> new-word
- word = new-word
Protected Suffixes
Any matching word will be protected.
- pattern suffix
Pattern may start with an asterisk to indicate variable length. Use zero or more question marks to indicate that a character is required. Use a trailing slash if a consonant is required.
Examples:
- ?ass
- *??ass
- *???/ass
Translation Suffix
Suffix of matching word will be replaced with new suffix.
- pattern suffix => new-suffix
Pattern rules are the same as for protected suffixes. The pattern may be repeated before the replacement suffix for readability.
Examples:
- *ses => se
- *ses -> *se
- *?/uses => se
- *???s =>
- *???s => *
The latter two examples show no new suffix, meaning that the existing suffix is simply removed.
Rules are evaluated in the order that they appear in the rules file, except that whole protected words and replacement words are processed before examining suffixes.
To restrict the minimum word length that is to be stemmed, simply create rules consisting of only question marks ('?') to match and protect words of those lengths.
For example, to protect words of less than four characters in length, add three rules, before any other rules:
? ! Protects 1-char words. ?? ! Protects 2-char words. ??? ! Protects 3-char words.
Example Stemming Rules File
Here is the default LucidStemRules_en.txt file that ships with LucidWorks Enterprise:
? \! Minimum of four characters before any stemming. ?? ??? \*ss \! No change : business \*'s \! No change : cat's - Handled in other filters. \*elves => \*elf \! selves => self, elves, themselves, shelves appendices => appendix \*indices => \*index \! indices => index, subindices - NOT jaundices \*theses => \*thesis \! hypotheses => hypothesis, parentheses, theses \*aderies => aderie \! camaraderie \*ies => \*y \! countries => country, flies, fries, ponies, phonies, queries, symphonies \*hes => \*h \! dishes => dish, ashes, smashes, matches, batches \*???oes => \*o : potatoes => potato, avocadoes, tomatoes, zeroes goes => go does => do ?oes => \*oe \! toes => toe, foes, hoes, joes, moes - NOT does, goes - but "does" is also plural for "doe" ??oes => ??oe \! floes => floe \*sses => \*ss \! passes => pass, bosses, classes, presses, tosses \*igases => \*igase \! ligases => ligase \*gases => \*gas \! outgases => outgas, gases, degases \*mases => \*mas \! Christmases => Christmas, Thomases \*?vases => \*vas \! canvases => canvas - NOT vases \*iases => \*ias \! aliases => alias, bias, Eliases \*abuses => \*abuse \! disabuses => disabuse, abuses \*cuses => \*cuse \! accuses => accuse, recuses, excuses \*fuses => \*fuse \! diffuses => diffuse, fuses, refuses \*/uses => \*us : buses => bus, airbuses, viruses; NOT houses, mouses, causes \*xes => \*x \! indexes => index, axes, taxes \*zes => \*z \! buzzes => buzz \*es => \*e \! spaces => space, files, planes, bases, cases, races, paces \*ras => \*ra \! zebras => zebra, agoras, algebras \*us \*/s => * \! cats => cat (require consonant (not "s") or "o" before "s") \*oci => \*ocus \! foci => focus \*cti => \*ctus \! cacti => cactus plusses => plus gasses => gas classes => class mice => mouse data => datum \!bases => basis amebiases => amebiasis atlases => atlas Eliases => Elias molasses feet => foot backhoes => backhoe calories => calorie \! Some plurals that don't make sense as singular sales news jeans