Algorithms, Design, Code and more: solr analyzers

Saturday, August 31, 2013

Internals Of Solr/ Lucene Document Scoring

This post is in continuation a discussion on the solr community about the efficiency of Solr/ Lucene scoring algorithm.

The search algorithm given here can be summarized to:

- Query query = Build query using user's search terms.
- Collector collector = Typically the TopScoreDocCollector
- Searcher searcher = new IndexSearcher(indexReader);
- searcher.search(query, collector);
- Weight weight = query.weight(searcher);
- Scorer scorer = weight.scorer(indexReader); // Typically BooleanScorer2
- scorer.score() => ConjunctionScorer (on every sub-scorer) in a leap frog/ skip ahead mechanism.

Algo needs improvement!

The AND query shows a leap frog/ skip ahead ahead pattern implemented in the BooleanScorer2 (ConjunctionScorer) level.

For example with the query, q=A AND B, where A & B match doc. id's
A -> 1,3,5,7,11,15,17
B -> 2, 6

- Scorer starts with the min. of each, i.e. A -> 1 & B -> 2, & current highest doc id set to 2

- In the next few iterations:
A is advanced past the current highest value to 3 & current highest updated to 3.
B advanced past current highest 3 to 6 & current highest set to 6.
A advanced past 6 to 7 & current highest set to 7.
B has no more docs & this breaks out, without any match.

On the other hand if the two had converged/ agreed on a particular doc id, that doc would be scored & collected (added to a min-heap of scores).

Thursday, July 11, 2013

Solr Analyzers Basics

Solr offers several Analyzers to pre-process document fields being indexed and searched. As part of modelling the schema one needs to make an informed choice for the specific chain of Analyzers to be applied to every field (fieldType) defined in the schema.xml.

To start off one needs to understand that different kinds of Analyzers and their purpose:

Char Filters (or CharacterFilterFactories)

Always applied first, i.e. before Tokenizers
Operates at the Character level (of the field values)
Zero or More Char Filters can be chained together. Get applied as per the sequence in schema.xml

Tokenizers (or TokenizerFactories)

Converts stream of Characters into a series of Tokens
Only One Tokenizer can be there in each Analyzer chain

Token Filters (or TokenFilterFactories)

Always applied last, i.e. after Tokenizers
Operates at the Tokens level generated by the Tokenizers
Zero or More Token Filters can be chained together. Get applied as per the sequence in schema.xml

To take an example, let's say we have a field title with the value (V1) "Mr. James <b>Bond</b> MI007". Now we run it through the following:

1. Character FilterFactory (One): HTMLStripCharFilterFactory (CF1)

(Output: "Mr. James Bond MI007")

2. Tokenizer (One): StandardTokenizerFactory (T)

(Output: Tokens: [ALPHANUM: "Mr.", ALPHANUM:"James", ALPHANUM:"Bond", ALNUM:"MI007"])

3. TokenFilters (Two): WordDelimiterFilterFactory (TF1) & LowerCaseFilterFactory (TF2)

Mr. => WordDelim => Lowercase => mr.
James => WordDelim => Lowercase => james
Bond => WordDelim => Lowercase => bond
MI007 => WordDelim => [MI, 007] => Lowercase => mi, 007

Finally the output text actually indexed: "mr. james bond mi 007"

There are several other options and many more Analyzers that one could. Among them the different PatternReplace Analyzers, EdgeNGram and the simple WhiteSpaceFilterFactory are the more popular ones. Finally, if none of the standard ones are adequate for a specific use case then there is also the option of writing a custom analyzer.