|
You configure the tokenizer for a text field type in schema.xml with a <tokenizer> element, as a child of <analyzer>: <fieldType name="text" class="solr.TextField"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StandardFilterFactory"/> </analyzer> </fieldType> The class attribute names a factory class that will instantiate a tokenizer object when needed. Tokenizer factory classes implement the org.apache.solr.analysis.TokenizerFactory. A TokenizerFactory's create() method accepts a Reader and returns a TokenStream. When Solr creates the tokenizer it passes a Reader object that provides the content of the text field. Arguments may be passed to tokenizer factories by setting attributes on the <tokenizer> element. <fieldType name="semicolonDelimited" class="solr.TextField"> <analyzer type="query"> <tokenizer class="solr.PatternTokenizerFactory" pattern="; "/> <analyzer> </fieldType> The following sections describe the tokenizer factory classes included in this release of Solr. For more information about Solr's tokenizers, see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters. |
Tokenizers discussed in this section: |
Standard Tokenizer
This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter characters are discarded, with the following exceptions:
- Periods (dots) that are not followed by whitespace are kept as part of the token.
- Words are split at hyphens, unless there is a number in the word, in which case the token is not split and the numbers and hyphen(s) are preserved.
- Recognizes Internet domain names and email addresses and preserves them as a single token.
The Standard Tokenizer supports Unicode standard annex UAX#29 word boundaries with the following token types: <ALPHANUM>, <NUM>, <SOUTHEAST_ASIAN>, <IDEOGRAPHIC>, and <HIRAGANA>.
Factory class: solr.StandardTokenizerFactory
Arguments:
maxTokenLength: (integer, default 255) Solr ignores tokens that exceed the number of characters specified by maxTokenLength.
Example:
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
</analyzer>
In: "Please, email john.doe@foo.com by 03-09, re: m37-xq."
Out: "Please", "email", "john.doe@foo.com", "by", "03-09", "re", "m37-xq"
Classic Tokenizer
The Classic Tokenizer preserves the same behavior as the Standard Tokenizer of Solr versions 3.1 and previous. It does not use the Unicode standard annex UAX#29 word boundary rules that the Standard Tokenizer uses. This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter characters are discarded, with the following exceptions:
- Periods (dots) that are not followed by whitespace are kept as part of the token.
- Words are split at hyphens, unless there is a number in the word, in which case the token is not split and the numbers and hyphen(s) are preserved.
- Recognizes Internet domain names and email addresses and preserves them as a single token.
Factory class: solr.ClassicTokenizerFactory
Arguments:
maxTokenLength: (integer, default 255) Solr ignores tokens that exceed the number of characters specified by maxTokenLength.
Example:
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
</analyzer>
In: "Please, email john.doe@foo.com by 03-09, re: m37-xq."
Out: "Please", "email", "john.doe@foo.com", "by", "03-09", "re", "m37-xq"
Keyword Tokenizer
This tokenizer treats the entire text field as a single token.
Factory class: solr.KeywordTokenizerFactory
Arguments: None
Example:
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
In: "Please, email john.doe@foo.com by 03-09, re: m37-xq."
Out: "Please, email john.doe@foo.com by 03-09, re: m37-xq."
Letter Tokenizer
This tokenizer creates tokens from strings of contiguous letters, discarding all non-letter characters.
Factory class: solr.LetterTokenizerFactory
Arguments: None
Example:
<analyzer>
<tokenizer class="solr.LetterTokenizerFactory"/>
</analyzer>
In: "I can't."
Out: "I", "can", "t"
Lower Case Tokenizer
Tokenizes the input stream by delimiting at non-letters and then converting all letters to lowercase. Whitespace and non-letters are discarded.
Factory class: solr.LowerCaseTokenizerFactory
Arguments: None
Example:
<analyzer>
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
</analyzer>
In: "I just LOVE my iPhone!"
Out: "i", "just", "love", "my", "iphone"
N-Gram Tokenizer
Reads the field text and generates n-gram tokens of sizes in the given range.
Factory class: solr.NGramTokenizerFactory
Arguments:
minGramSize: (integer, default 1) The minimum n-gram size, must be > 0.
maxGramSize: (integer, default 2) The maximum n-gram size, must be >= minGramSize.
Example:
Default behavior. Note that this tokenizer operates over the whole field. It does not break the field at whitespace. As a result, the space character is included in the encoding.
<analyzer>
<tokenizer class="solr.NGramTokenizerFactory"/>
</analyzer>
In: "hey man"
Out: "h", "e", "y", " ", "m", "a", "n", "he", "ey", "y ", " m", "ma", "an"
Example:
With an n-gram size range of 4 to 5:
<analyzer> <tokenizer class="solr.NGramTokenizerFactory" minGramSize="4" maxGramSize="5"/> </analyzer>
In: "bicycle"
Out: "bicy", "icyc", "cycl", "ycle", "bicyc", "icycl", "cycle"
Edge N-Gram Tokenizer
Reads the field text and generates edge n-gram tokens of sizes in the given range.
Factory class: solr.EdgeNGramTokenizerFactory
Arguments:
minGramSize: (integer, default 1) The minimum n-gram size, must be > 0.
maxGramSize: (integer, default 1) The maximum n-gram size, must be >= minGramSize.
side: ("front" or "back", default "front") Whether to compute the n-grams from the beginning (front) of the text or from the end (back).
Example:
Default behavior (min and max default to 1):
<analyzer>
<tokenizer class="solr.EdgeNGramTokenizerFactory"/>
</analyzer>
In: "babaloo"
Out: "b"
Example:
Edge n-gram range of 2 to 5
<analyzer> <tokenizer class="solr.EdgeNGramTokenizerFactory" minGramSize="2" maxGramSize="5"/> </analyzer>
In: "babaloo"
Out:"ba", "bab", "baba", "babal"
Example:
Edge n-gram range of 2 to 5, from the back side:
<analyzer> <tokenizer class="solr.EdgeNGramTokenizerFactory" minGramSize="2" maxGramSize="5" side="back"/> </analyzer>
In: "babaloo"
Out: "oo", "loo", "aloo", "baloo"
ICU Tokenizer
This tokenizer processes multilingual text and tokenizes it appropriately based on its script attribute.
Factory class: solr.ICUTokenizerFactory
Arguments: None
Example:
<analyzer>
<tokenizer class="solr.ICUTokenizerFactory"/>
</analyzer>
In: "Testing บริษัทชื่อ נאסק"ר"
Out: "Testing", "บริษัท", "ชื่อ", "נאסק"ר "
Path Hierarchy Tokenizer
This tokenizer creates synonyms from file path hierarchies.
Factory class: solr.PathHierarchyTokenizerFactory
Arguments:
delimiter: (character, no default) You can specify the file path delimiter and replace it with a delimiter you provide. This can be useful for working with backslash delimiters.
replace: (character, no default) Specifies the delimiter character Solr uses in the tokenized output.
Example:
<fieldType name="text_path" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="\" replace="/"/> </analyzer> </fieldType>
In: "c:\usr\local\apache"
Out: "c:", "c:/usr", "c:/usr/local", "c:/usr/local/apache"
Regular Expression Pattern Tokenizer
This tokenizer uses a Java regular expression to break the input text stream into tokens. The expression provided by the pattern argument can be interpreted either as a delimiter that separates tokens, or to match patterns that should be extracted from the text as tokens.
See the Javadocs for java.util.regex.Pattern for more information on Java regular expression syntax.
Factory class: solr.PatternTokenizerFactory
Arguments:
pattern: (Required) The regular expression, as defined by in java.util.regex.Pattern.
group: (Optional, default -1) Specifies which regex group to extract as the token(s).The value -1 means the regex should be treated as a delimiter that separates tokens.Non-negative group numbers (>= 0) indicate that character sequences matching that regex group should be converted to tokens. Group zero refers to the entire regex, groups greater than zero refer to parenthesized sub-expressions of the regex, counted from left to right.
Example:
A comma separated list. Tokens are separated by a sequence of zero or more spaces, a comma, and zero or more spaces.
<analyzer> <tokenizer class="solr.PatternTokenizerFactory" pattern="\s*,\s*"/> </analyzer>
In: "fee,fie, foe , fum, foo"
Out: "fee", "fie", "foe", "fum", "foo"
Example:
Extract simple, capitalized words. A sequence of at least one capital letter followed by zero or more letters of either case is extracted as a token.
<analyzer> <tokenizer class="solr.PatternTokenizerFactory" pattern="\[A-Z\]\[A-Za-z\]" group="0"/> </analyzer>
In: "Hello. My name is Inigo Montoya. You killed my father. Prepare to die."
Out: "Hello", "My", "Inigo", "Montoya". "You", "Prepare"
Example:
Extract part numbers which are preceded by "SKU", "Part" or "Part Number", case sensitive, with an optional semi-colon separator. Part numbers must be all numeric digits, with an optional hyphen. Regex capture groups are numbered by counting left parenthesis from left to right. Group 3 is the subexpression "[0-9-]+", which matches one or more digits or hyphens.
<analyzer> <tokenizer class="solr.PatternTokenizerFactory" pattern="(SKU|Part(\sNumber)?):?\s(\[0-9-\]+)" group="3"/> </analyzer>
In: "SKU: 1234, Part Number 5678, Part: 126-987"
Out: "1234", "5678", "126-987"
Type Tokenizer
This tokenizer filters tokens by its type, with either an exclude or include list.
Factory class: solr.TypeTokenFilterFactory
Arguments:
types: Defines the location of a file of types to filter.
enablePositionIncrements: If true, the token will be incremented by position.
useWhiteList: If true, the file defined in types should be used as include list.
Example:
<analyzer> <filter class="solr.TypeTokenFilterFactory" types="stoptypes.txt" enablePositionIncrements="true" useWhiteList="false"/> </analyzer>
UAX29 URL Email Tokenizer
This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter characters are discarded, with the following exceptions:
- Periods (dots) that are not followed by whitespace are kept as part of the token.
- Words are split at hyphens, unless there is a number in the word, in which case the token is not split and the numbers and hyphen(s) are preserved.
- Recognizes top-level (.com) Internet domain names; email addresses; file:://, http(s)://, and ftp:// addresses; IPv4 and IPv6 addresses; and preserves them as a single token.
The UAX29 URL Email Tokenizer supports Unicode standard annex UAX#29 word boundaries with the following token types: <ALPHANUM>, <NUM>, URL, EMAIL, <SOUTHEAST_ASIAN>, <IDEOGRAPHIC>, and <HIRAGANA>.
Factory class: solr.UAX29URLEmailTokenizerFactory
Arguments:
maxTokenLength: (integer, default 255) Solr ignores tokens that exceed the number of characters specified by maxTokenLength.
Example:
<analyzer>
<tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
</analyzer>
In: "Visit http://accarol.com/contact.htm?from=external&a=10 or e-mail bob.cratchet@accarol.com"
Out: "Visit", "http://accarol.com/contact.htm?from=external&a=10", "or", "email", "bob.cratchet@accarol.com"
White Space Tokenizer
Simple tokenizer that splits the text stream on whitespace and returns sequences of non-whitespace characters as tokens. Note that any punctuation will be included in the tokenization.
Factory class: solr.WhitespaceTokenizerFactory
Arguments: None
Example:
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
</analyzer>
In: "To be, or what?"
Out: "To", "be,", "or", "what?"
