Algorithms, Design, Code and more: July 2013

Solr offers several Analyzers to pre-process document fields being indexed and searched. As part of modelling the schema one needs to make an informed choice for the specific chain of Analyzers to be applied to every field (fieldType) defined in the schema.xml.

To start off one needs to understand that different kinds of Analyzers and their purpose:

Char Filters (or CharacterFilterFactories)

Always applied first, i.e. before Tokenizers
Operates at the Character level (of the field values)
Zero or More Char Filters can be chained together. Get applied as per the sequence in schema.xml

Tokenizers (or TokenizerFactories)

Converts stream of Characters into a series of Tokens
Only One Tokenizer can be there in each Analyzer chain

Token Filters (or TokenFilterFactories)

Always applied last, i.e. after Tokenizers
Operates at the Tokens level generated by the Tokenizers
Zero or More Token Filters can be chained together. Get applied as per the sequence in schema.xml

To take an example, let's say we have a field title with the value (V1) "Mr. James <b>Bond</b> MI007". Now we run it through the following:

1. Character FilterFactory (One): HTMLStripCharFilterFactory (CF1)

(Output: "Mr. James Bond MI007")

2. Tokenizer (One): StandardTokenizerFactory (T)

(Output: Tokens: [ALPHANUM: "Mr.", ALPHANUM:"James", ALPHANUM:"Bond", ALNUM:"MI007"])

3. TokenFilters (Two): WordDelimiterFilterFactory (TF1) & LowerCaseFilterFactory (TF2)

Mr. => WordDelim => Lowercase => mr.
James => WordDelim => Lowercase => james
Bond => WordDelim => Lowercase => bond
MI007 => WordDelim => [MI, 007] => Lowercase => mi, 007

Finally the output text actually indexed: "mr. james bond mi 007"

There are several other options and many more Analyzers that one could. Among them the different PatternReplace Analyzers, EdgeNGram and the simple WhiteSpaceFilterFactory are the more popular ones. Finally, if none of the standard ones are adequate for a specific use case then there is also the option of writing a custom analyzer.

Algorithms, Design, Code and more

Saturday, July 20, 2013

Zookeeper for Synchronizing Across Distributed Systems

Thursday, July 11, 2013

Solr Analyzers Basics