Newer Versions

v2.1
v2.0
v1.8
v1.7
v1.6

LucidWorks Enterprise v1.5

Other Resources

Lucid Forums

This is the documentation for LucidWorks Enterprise v1.5. The latest version is 2.1

Skip to end of metadata
Go to start of metadata

Some Solr term analysis filters are driven by rules contained in text files. These include:

  • Stop words
  • Synonyms
  • Stemming rules

The actual file names will vary and are determined by the filter's entry in solrconfig.xml as well as the LucidWorks Enterprise admin interface.

Stop Words File Format

Same as Solr stop words file format.

One stop word per line.
Example:

a
an
and
are
as
at

Synonyms File Format

Same as Solr synonyms file format.

Blank lines and lines starting with pound are comments.  
Explicit mappings match any token sequence on the LHS of "=>" and replace with all alternatives on the RHS.  These types of mappings ignore the expand parameter in the schema.

Examples:

i-pod, i pod => ipod,

sea biscuit, sea biscit => seabiscuit

Equivalent synonyms may be separated with commas and give no explicit mapping. This allows the same synonym file to be used in different synonym handling strategies.

Example:

lawyer, attorney
one, 1
two, 2
three, 3
ten, 10
hundred, 100
thousand, 1000
tv, television


#multiple synonym mapping entries are merged.
foo => foo bar
foo => baz
#is equivalent to
foo => foo bar, baz

Lucid Plural Stemmer Rules File Format

The Lucid plural stemmer is designed to focus on stemming of plural words into their singular forms. It is not perfect, but it is rule-based and the rules can be supplemented and tuned to handle a wide range of exceptions. Individual words can be protected from stemming and individual words can be given special-case stem words. But usually, general patterns cover wide classes of words.

This is mostly mapping plural to singular and primarily those ending with "s", but there are also verb forms ending with "s" that fall under the same heuristic rules.

It is understood that this simple heuristic approach will mangle a non-zero fraction of words that should either not be stemmed or should be stemmed differently.

The rules do go to some effort to avoid removing "s" endings that are not plural (or verb conjugations), such as "alias" or "business."

Input token does not need to be lower case, but stemming change will be lower case.

The filter (factory) is named com.lucid.analysis.LucidPluralStemFilterFactory. It has a "rules" parameter which names the rules file. The default rules file name is LucidStemRules_en.txt. It is expected that each natural language will have its own stemming rules file.

General Formatting rules

  1. Exclamation point indicates a comment or comment line to be ignored.
  2. White space is extraneous and ignored.
  3. Blank lines ignored.

Types of stemming rules

Protected word

Just write the word itself, it will not be changed.

  • word

Replacement word

Word will always be changed to a replacement word.

  • word => new-word
  • word -> new-word
  • word --> new-word
  • word = new-word

Protected suffixes

Any Matching word will be protected

  • pattern suffix

Pattern may start with an asterisk to indicate variable length.
Use zero or more question marks to indicate that a character is required.
Use a trailing slash if a consonant is required.

Examples:

  • ?ass
  • *??ass
  • *???/ass

Translation suffix

Suffix of matching word will be replaced with new suffix.

  • pattern suffix => new-suffix

Pattern rules are the same as for protected suffixes.
The pattern may be repeated before the replacement suffix for readability.

Examples:

  • *ses => se
  • *ses -> *se
  • *?/uses => se
  • *???s =>
  • *???s => *

The latter two examples show no new suffix, meaning that the existing suffix is simply removed.

Rules are evaluated in the order that they appear in the rules file, except that whole protected words and replacement words are processed before examining suffixes.

To restrict the minimum word length that is to be stemmed, simple create rules consisting on only question marks ('?') to match and protect words of those lengths.

For example, to protect words of less than four characters in length, add three rules, before any other rules:

?     ! Protects 1-char words.
??    ! Protects 2-char words.
???   ! Protects 3-char words.
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.