Support Resources

LucidWorks Forum
KnowledgeBase

LucidWorks Platform v2.0

PDF Version

Older Versions

LWE Guide 1.8
LWE Guide 1.7
LWE Guide 1.6

This is the documentation for LucidWorks Platform v2.0, the latest release is v2.1.

Skip to end of metadata
Go to start of metadata

While LucidWorks Enterprise provides reasonable defaults for term analysis, you may desire more customization. To further customize your term analysis defaults, you can use the term analysis filters in Solr. Some of these term analysis filters are driven by rules contained in text files. These include:

  • Stopwords
  • Synonyms
  • Stemming rules

You can edit synonyms and stop words in the Query Settings section of the Administration User Interface. The actual files can be found in the $LWE_HOME/conf/solr/cores/collection1_0/conf directory (assuming the default collection of collection1; if using multiple collections, use the collection name that matches the collection to be changed).

Stop Words File Format

The LucidWorks Enterprise Stop Words file format is the same as the Solr stopwords file format which is one term per line, as in:

a
an
and
are
as
at

Synonyms File Format

The LucidWorks Enterprise Synonyms file format is the same as the Solr synonyms file format. Blank lines and lines starting with pound are comments. Explicit mappings match any token sequence on the left side of "=>" and replace with all alternatives on the right side. These types of mappings ignore the expand parameter in the schema.

Equivalent synonyms may be separated with commas and will give no explicit mapping (that is, the listed terms are equivalent). This allows the same synonym file to be used in different synonym handling strategies.

Example:

lawyer, attorney
one, 1
two, 2
three, 3
ten, 10
hundred, 100
thousand, 1000
tv, television

#multiple synonym mapping entries are merged.
foo => foo bar
foo => baz
#is equivalent to
foo => foo bar, baz

Lucid Plural Stemming Rules File Format

The Lucid plural stemmer is designed to focus on stemming of plural words into their singular forms. It is rule-based, so the rules can be supplemented and tuned to handle a wide range of exceptions. Individual words can be protected from stemming and can be given special-case stem words. But usually, general patterns cover wide classes of words.

This is mostly mapping plural to singular and primarily those ending with "s", but there are also verb forms not ending with "s" that fall under the same heuristic rules.

It is understood that this simple heuristic approach will misinterpret some words that should either not be stemmed or should be stemmed differently. The rules try to avoid removing "s" endings that are not plural (or verb conjugations), such as "alias" or "business."

Input token does not need to be lower case, but stemming change will be lower case.

The filter (factory) is named com.lucid.analysis.LucidPluralStemFilterFactory. It has a "rules" parameter which names the rules file. The default rules file is named LucidStemRules_en.txt and found in $LWE_HOME/conf/solr/cores/collection1_0/conf. It is expected that each natural language will have its own stemming rules file. This file is also specific to each collection.

If you edit the stemming rules file, adhere to the following format guidelines.

  1. Exclamation point indicates a comment or comment line to be ignored.
  2. White space is extraneous and ignored.
  3. Blank lines ignored.

Types of Stemming Rules

Protected Word

Just write the word itself, it will not be changed.

  • word

Replacement Word

Word will always be changed to a replacement word.

  • word => new-word
  • word -> new-word
  • word --> new-word
  • word = new-word

Protected Suffixes

Any matching word will be protected.

  • pattern suffix

Pattern may start with an asterisk to indicate variable length. Use zero or more question marks to indicate that a character is required. Use a trailing slash if a consonant is required.

Examples:

  • ?ass
  • *??ass
  • *???/ass

Translation Suffix

Suffix of matching word will be replaced with new suffix.

  • pattern suffix => new-suffix

Pattern rules are the same as for protected suffixes. The pattern may be repeated before the replacement suffix for readability.

Examples:

  • *ses => se
  • *ses -> *se
  • *?/uses => se
  • *???s =>
  • *???s => *

The latter two examples show no new suffix, meaning that the existing suffix is simply removed.

Rules are evaluated in the order that they appear in the rules file, except that whole protected words and replacement words are processed before examining suffixes.

To restrict the minimum word length that is to be stemmed, simply create rules consisting of only question marks ('?') to match and protect words of those lengths.

For example, to protect words of less than four characters in length, add three rules, before any other rules:

?     ! Protects 1-char words.
??    ! Protects 2-char words.
???   ! Protects 3-char words.

Example Stemming Rules File

Here is the default LucidStemRules_en.txt file that ships with LucidWorks Enterprise:

? \! Minimum of four characters before any stemming.
??
???
\*ss \! No change : business
\*'s \! No change : cat's - Handled in other filters.
\*elves => \*elf \! selves => self, elves, themselves, shelves
appendices => appendix
\*indices => \*index \! indices => index, subindices - NOT jaundices
\*theses => \*thesis \! hypotheses => hypothesis, parentheses, theses
\*aderies => aderie \! camaraderie
\*ies => \*y \! countries => country, flies, fries, ponies, phonies, queries, symphonies
\*hes => \*h \! dishes => dish, ashes, smashes, matches, batches
\*???oes => \*o : potatoes => potato, avocadoes, tomatoes, zeroes
goes => go
does => do
?oes => \*oe \! toes => toe, foes, hoes, joes, moes - NOT does, goes - but "does" is also plural for "doe"
??oes => ??oe \! floes => floe
\*sses => \*ss \! passes => pass, bosses, classes, presses, tosses
\*igases => \*igase \! ligases => ligase
\*gases => \*gas \! outgases => outgas, gases, degases
\*mases => \*mas \! Christmases => Christmas, Thomases
\*?vases => \*vas \! canvases => canvas - NOT vases
\*iases => \*ias \! aliases => alias, bias, Eliases
\*abuses => \*abuse \! disabuses => disabuse, abuses
\*cuses => \*cuse \! accuses => accuse, recuses, excuses
\*fuses => \*fuse \! diffuses => diffuse, fuses, refuses
\*/uses => \*us : buses => bus, airbuses, viruses; NOT houses, mouses, causes
\*xes => \*x \! indexes => index, axes, taxes
\*zes => \*z \! buzzes => buzz
\*es => \*e \! spaces => space, files, planes, bases, cases, races, paces
\*ras => \*ra \! zebras => zebra, agoras, algebras
\*us
\*/s => * \! cats => cat (require consonant (not "s") or "o" before "s")
\*oci => \*ocus \! foci => focus
\*cti => \*ctus \! cacti => cactus
plusses => plus
gasses => gas
classes => class
mice => mouse
data => datum
\!bases => basis
amebiases => amebiasis
atlases => atlas
Eliases => Elias
molasses
feet => foot
backhoes => backhoe
calories => calorie

\! Some plurals that don't make sense as singular
sales
news
jeans

Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.