Algorithms, Design, Code and more: Solr

Showing posts with label Solr. Show all posts

Thursday, April 17, 2025

On Quantization

Speed vs Accuracy trade off.
Reduce costs on storage, compute, operations .
Speed up output generation, inference, etc.
Work with lower precision data.
Cast/ map data from Int32, Float32, etc 32-bit or higher precision to lower precision data types such as 16-bit Brain Float (BFloat16) or 4-bit (NFloat)/ int4 or int8, etc.

East mapping Float32 (1-bit Sign, 7-bit Exponent, 23-bit Mantissa) => BFloat16 (1-bit Sign, 7-bit Exponent, 7-bit Mantissa). Just discard the higher 16-bits of mantissa. No overflow!
Straightforward mapping work out max, min, data distribution, mean, variance, etc & then sub-divide into equally sized buckets based on bit size of the lower precision data type. E.g int4 (4-bit) => 2^4 = 16 buckets.
Handle outliers, data skew which can mess up the mapping, yet lead to loss of useful info if discarded randomly.
Work out Bounds wrt Loss of Accuracy.

LLMs, AI/ ML side:

https://newsletter.theaiedge.io/p/reduce-ai-model-operational-costs

Lucene, Search side:

https://www.elastic.co/search-labs/blog/scalar-quantization-101
https://www.elastic.co/search-labs/blog/scalar-quantization-in-lucene

Monday, January 22, 2018

In the streaming solutions space, it all begins with the event driven architecture. This basically includes events (what triggers everything), the handlers (responsible for taking action) & the event loop (for coordinating). When things get more involved & complicated with multiple event streams/ sources, etc. solutions move into the cep space.

Another very popular programming methodology in recent times is Reactive programming. This in some senses is a special case of event driven programming with the focus on data change (as the event) & the reactive step to do other downstream data changes (as the handlers).

A whole bunch of frameworks for streaming solutions have emerged from the Big Data ecosystem such as Storm, Spark Streaming, Flink, etc. These allow for quick development of streaming solutions using high level abstractions. Even Solr has a streaming expression support now for building distributed streaming search solutions.

Outside of these frameworks, Akka Streams seems promising. It's built on top of Akka's robus Actor model & the Reactive Streams api. Solutions such as Gear Pump can provide a sense of the ground up solutions possible with Akka Streams.

Thursday, November 28, 2013

Precision and Recall

Terms popular within search and Information Retrieval (IR) domains.

Precision: Is all about accuracy. Whether all results that have shown up are relevant.

Recall: Has to do with completeness. Whether all valid/ relevant results have shown up.

Needs detailing..

Saturday, August 31, 2013

Internals Of Solr/ Lucene Document Scoring

This post is in continuation a discussion on the solr community about the efficiency of Solr/ Lucene scoring algorithm.

The search algorithm given here can be summarized to:

- Query query = Build query using user's search terms.
- Collector collector = Typically the TopScoreDocCollector
- Searcher searcher = new IndexSearcher(indexReader);
- searcher.search(query, collector);
- Weight weight = query.weight(searcher);
- Scorer scorer = weight.scorer(indexReader); // Typically BooleanScorer2
- scorer.score() => ConjunctionScorer (on every sub-scorer) in a leap frog/ skip ahead mechanism.

Algo needs improvement!

The AND query shows a leap frog/ skip ahead ahead pattern implemented in the BooleanScorer2 (ConjunctionScorer) level.

For example with the query, q=A AND B, where A & B match doc. id's
A -> 1,3,5,7,11,15,17
B -> 2, 6

- Scorer starts with the min. of each, i.e. A -> 1 & B -> 2, & current highest doc id set to 2

- In the next few iterations:
A is advanced past the current highest value to 3 & current highest updated to 3.
B advanced past current highest 3 to 6 & current highest set to 6.
A advanced past 6 to 7 & current highest set to 7.
B has no more docs & this breaks out, without any match.

On the other hand if the two had converged/ agreed on a particular doc id, that doc would be scored & collected (added to a min-heap of scores).

Thursday, July 11, 2013

Solr Analyzers Basics

Solr offers several Analyzers to pre-process document fields being indexed and searched. As part of modelling the schema one needs to make an informed choice for the specific chain of Analyzers to be applied to every field (fieldType) defined in the schema.xml.

To start off one needs to understand that different kinds of Analyzers and their purpose:

Char Filters (or CharacterFilterFactories)

Always applied first, i.e. before Tokenizers
Operates at the Character level (of the field values)
Zero or More Char Filters can be chained together. Get applied as per the sequence in schema.xml

Tokenizers (or TokenizerFactories)

Converts stream of Characters into a series of Tokens
Only One Tokenizer can be there in each Analyzer chain

Token Filters (or TokenFilterFactories)

Always applied last, i.e. after Tokenizers
Operates at the Tokens level generated by the Tokenizers
Zero or More Token Filters can be chained together. Get applied as per the sequence in schema.xml

To take an example, let's say we have a field title with the value (V1) "Mr. James <b>Bond</b> MI007". Now we run it through the following:

1. Character FilterFactory (One): HTMLStripCharFilterFactory (CF1)

(Output: "Mr. James Bond MI007")

2. Tokenizer (One): StandardTokenizerFactory (T)

(Output: Tokens: [ALPHANUM: "Mr.", ALPHANUM:"James", ALPHANUM:"Bond", ALNUM:"MI007"])

3. TokenFilters (Two): WordDelimiterFilterFactory (TF1) & LowerCaseFilterFactory (TF2)

Mr. => WordDelim => Lowercase => mr.
James => WordDelim => Lowercase => james
Bond => WordDelim => Lowercase => bond
MI007 => WordDelim => [MI, 007] => Lowercase => mi, 007

Finally the output text actually indexed: "mr. james bond mi 007"

There are several other options and many more Analyzers that one could. Among them the different PatternReplace Analyzers, EdgeNGram and the simple WhiteSpaceFilterFactory are the more popular ones. Finally, if none of the standard ones are adequate for a specific use case then there is also the option of writing a custom analyzer.

Monday, June 10, 2013

Solution for making Long GET Request to Solr via SolrNet

Solr has REST api's available for performing various searches on indexed documents. The client generally issues GET requests to Solr with different parameters (fields, row, facet, etc.) set. Since there typically are size/ query length limitations on GET requests (imposed by container, OS, etc.), Solr allows the same queries to be issued to the Solr RequestHandlers as POST request as well.

We ran into one such issue with long GET request to Solr from SolrNet and did a few changes to solve the same.

Solr Side Changes:
First up, we increased the headerBufferSize of the application server as explained on SO here and increased the maxBooleanClauses parameter in solrconfig.xml. This allowed Solr side to start responding to much longer GET requests. The problem however wasn't solved. The client side was a dot net application running within IIS having additional length limitations imposed by Windows OS & the dot net framework.

SolrNet Side Changes:
In round two, we went for a better fix and switched over to a POST requests in place of long GET requests. The solution is largely the same as mentioned on the SolrNet group here & here. The difference being to switch over to a POST request from within the Get() method of the SolrConnection.cs class, when the request string is longer than a configurable threshold value.

public string Get(string relativeUrl, IEnumerable> parameters)
{
string st1 = GetQuery(parameters);
if (isQueryLong(st1)) {
var bytes = Encoding.UTF8.GetBytes(st1);
try {
using (var content = new MemoryStream(bytes))
return PostStream(relativeUrl, "application/x-www-form-urlencoded", content, null);
} catch (WebException e) {}
}
else{
// Do it the normal way via GET request
}
}

Update: PostSolrConnection.cs class has made it to the head branch of SolrNet.

Monday, May 27, 2013

SolrNet Separate Highlighting Query - hl.q

Solr allows highlighting of matched sections in field values. There are several parameters that can be set by the caller to adjust the highlighting behaviour.

SolrNet, a library to connect to Solr from dot net applications, also has HighlightingParameters exposed in SolrNet core library. However, not all/ a very small subset of parameters are currently exposed.

Recently needed to use the hl.q query, to issues a separate/ more specific highlighting query to Solr. The work around was to make use of the ExtraParams option, from the base CommonQueryOptions class.

The same approach could be used for any of the other parameters not exposed by SolrNet, such ais hl.BoundaryScanner, per field highlighting, maxScan, etc., essentially all the 3.5x onward features mentioned on the Solr Highlighting wiki.

Dictionary<string, string> extraParams = new Dictionary<string, string>();
extraParams.Add("hl.q", "content:Great");
HighlightingParameters hp= new HighlightingParameters();
hp.Fields =new List<string>() { "content");
var results = solr2.Query("type:book AND content:Great", new QueryOptions
{
Start = 0,
Rows = 10,
Highlight = hp,
ExtraParams=extraParams
});

Friday, March 1, 2013

Atomic Updates via SolrNet

As of today the SolrNet api doesn't offer atomic updates to be issued to a running Solr server. While the Solrnet api is supposed to offer this feature sometime in the future, the following alternative can be used in the interim.

1. Build a custom atomic update XML message:

(See: http://wiki.apache.org/solr/UpdateXmlMessages for more details)

2. Get hold of the connection object (via ServiceLocator):

3. Issue a call to Solr via the connection object:

Will be adding sample code snippets soon..

Friday, February 15, 2013

Solr Cell, Tika And Pages

With Solr Cell, aka Tika, you get the power to index content from within a wide set of digital files such as Pdfs, Office, Text, etc.

Tika however doesn't naturally offer any demarcations for page boundaries. So you can search for content matches from a file, but not for specific pages from within these files.

Among several different ways to solve this problem, one way could be to index each page of the file as a separate document in Solr and do a field collapsing/ result grouping on the search results by a common file identifier shared by all pages of the file.

Since there could be performance overheads with result grouping, another way is to index the combined file as one solr document (of type Combined) & each page as a separate solr document (of type Page) with a common file identifier. The search can then be performed initially against the combined document (type:combined AND text:abc) to identify files that match & then against the corresponding page type document (type:page AND file-id:123 AND text:abc) to identify pages.

Friday, November 2, 2012

Using Pentaho Kettle to Index Data in Solr

Pentaho Kettle is a fine open source ETL tool written in Java. There are several implementations, hooks and plugins available off the shelf for performing various Extract (E), Transform (T), Load (L) processes on data from a source location to a destination location.

Solr, on the other hand, is a rich and powerful production grade search engine written on top of Lucene. So how would it be to get the two to function in tandem? To use Kettle to load data into Solr for indexing purpose.

The data load phase for indexing in Solr is very similar to an ETL process. The data is sourced (Extract) from a relational Database (MySql, Postgre, etc.). This data is denormalized and transformed to a Solr compatible document (Transform). Finally the transformed data is streamed to Solr for indexing (Load). Kettle excels in performing each of these steps!

A Kettle ETL job to load data into Solr for indexing, is a good alternative to using Solr's very own Data Import Handler (DIH). Since DIH typically runs off the same Solr setup (with a few common dependencies) so there's some intermixing of concerns with such a set-up, between what Solr is good at (search & indexing) versus what the DIH is built to do (import documents). The DIH also competes for resources (CPU, IO) with Solr. Ketttle has no such drawbacks and can be run off a different set of physical boxes.

There are additional benefits of using Kettle such as availability of stable implementations for working across data sources, querying, bulk load, setting up of staged workflows with configurable queues & worker threads. Also Kettle's exception handling, retry mechanism, REST/ WS client, JSON serializer, custom Java code extension, and several handy transformation capabilities, all add up in its favour.

On the cons, given that the call to Solr would be via standard REST client from Kettle, the set-up would not be Solr Cloud or Zookeeper (ZK) aware to be able to do any smart routing of documents. One option to solve this could be to use the Custom Java Code step in Kettle and delegate the call to Solr via the SolrJ's CloudSolrServer client (which is Solr Cloud/ ZK aware).

Wednesday, September 12, 2012

Remotely Debug Solr Cloud in Eclipse Using JPDA, JDWP, JVMTI & JDI

The acronym's first:
JPDA - Java Platform Debug Architecture
JDWP - Java Debug Wire Protocol
JVMTI - JVM Tool Interface
JDI - Java Debug Interface

To debug any of the open source Java projects such as Solr using Eclipse, rely on the JDWP feature available within any standard JVM. You can get a lot more info about the terms and architecture here.

At a high level the concept is that there is a JVM to be debugged (Solr) & a client side JVM debuggee (Eclipse). The two communicate over the JDWP. Thanks to a standardized wire protocol the client may even be a non JVM application which subscribes to the protocol.

One of the two JVMs acts as the debugging server (the one that waits for the client to connect). The other JVM acts as the debugging client which connects to the debugger server, to start the debugging process.

In our case, to keep things simple let Solr be the debugger server, while Eclipse can be the debugger client. The configurations then are as follows.

On Solr side (assuming Solr Cloud):

java -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=8000 -Djetty.port=7200 -Dhost=myhost -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf -Djava.util.logging.config.file=etc/logging.properties -DnumShards=3 -DzkHost=zk1:2171 -jar start.jar

Note: Since we have set suspend = y, Solr side will stay suspended until the Eclipse debugger client has connected

On Eclipse side:
Go to Run > Debug Configurations > Remote Java Application
Then choose Standard Socket Attach. Host: localhost (or IP). Port: 8000 (the same as set above)

Also in Eclipse you should have checked out the Solr source code from Solr trunk as a project. This will allow you to put break points at appropriate location to help with the debugging. So go on give this a shot and happy debugging!