Thursday, July 11, 2013

Solr Analyzers Basics


Solr offers several Analyzers to pre-process document fields being indexed and searched. As part of modelling the schema one needs to make an informed choice for the specific chain of Analyzers to be applied to every field (fieldType) defined in the schema.xml.

To start off one needs to understand that different kinds of Analyzers and their purpose:
  1. Char Filters (or CharacterFilterFactories)
    • Always applied first, i.e. before Tokenizers
    • Operates at the Character level (of the field values)
    • Zero or More Char Filters can be chained together. Get applied as per the sequence in schema.xml
  2. Tokenizers (or TokenizerFactories)
    • Converts stream of Characters into a series of Tokens
    • Only One Tokenizer can be there in each Analyzer chain
  3. Token Filters (or TokenFilterFactories)
    • Always applied last, i.e. after Tokenizers
    • Operates at the Tokens level generated by the Tokenizers
    • Zero or More Token Filters can be chained together. Get applied as per the sequence in schema.xml








To take an example, let's say we have a field title with the value (V1) "Mr. James <b>Bond</b> MI007". Now we run it through the following:

1. Character FilterFactory (One): HTMLStripCharFilterFactory (CF1)

(Output: "Mr. James Bond MI007")

2. Tokenizer (One): StandardTokenizerFactory (T)

(Output: Tokens: [ALPHANUM: "Mr.", ALPHANUM:"James", ALPHANUM:"Bond", ALNUM:"MI007"])

3. TokenFilters (Two): WordDelimiterFilterFactory (TF1) & LowerCaseFilterFactory (TF2) 
  • Mr. => WordDelim => Lowercase => mr.
  • James => WordDelim => Lowercase => james
  • Bond => WordDelim => Lowercase => bond
  • MI007 => WordDelim => [MI, 007] => Lowercase => mi, 007
Finally the output text actually indexed: "mr. james bond mi 007"

There are several other options and many more Analyzers that one could.  Among them the different PatternReplace Analyzers, EdgeNGram and the simple WhiteSpaceFilterFactory are the more popular ones.  Finally, if none of the standard ones are adequate for a specific use case then there is also the option of writing a custom analyzer.