Saturday, August 31, 2013

Internals Of Solr/ Lucene Document Scoring

This post is in continuation a discussion on the solr community about the efficiency of Solr/ Lucene scoring algorithm.

The search algorithm given here can be summarized to:

- Query query =  Build query using user's search terms.
- Collector collector = Typically the TopScoreDocCollector
- Searcher searcher = new IndexSearcher(indexReader);
- searcher.search(query, collector);
- Weight weight = query.weight(searcher);
- Scorer scorer = weight.scorer(indexReader); // Typically BooleanScorer2
- scorer.score() => ConjunctionScorer (on every sub-scorer) in a leap frog/ skip ahead mechanism.

Algo needs improvement!

The AND query shows a leap frog/ skip ahead ahead pattern implemented in the BooleanScorer2 (ConjunctionScorer) level.

For example with the query, q=A AND B, where A & B match doc. id's
A -> 1,3,5,7,11,15,17
B -> 2, 6

- Scorer starts with the min. of each, i.e. A -> 1  & B -> 2, & current highest doc id set to 2

- In the next few iterations:
A is advanced past the current highest value to 3 & current highest updated to 3.
B advanced past current highest 3 to 6 & current highest set to 6.
A advanced past 6 to 7 & current highest set to 7.
B has no more docs & this breaks out, without any match.

On the other hand if the two had converged/ agreed on a particular doc id, that doc would be scored & collected (added to a min-heap of scores).

Thursday, August 15, 2013

Update Apt Repositories Location for Old Ubuntu Versions

When working with an old versions of Ubuntu (11.04, 10.04, etc.), the biggest handicap is the lack of a functional package manager such as apt or synaptic. The reason why the package managers stop working is that at end of support/ licence for an old version of Ubuntu, the team behind Ubuntu archive the repositories.

At this point as an user you are supposed to Upgrade (the recommended practice) to a more recent version. There are normally enough advance notices and alerts sent out by Ubuntu's Update Manager for the same. If however, you have a compelling reason to stick on to your current version, then here's a   way to update your repositories' sources list to be able to install and use old software that is present in the archival repository. This is based on the recommendation made on this forum discussion.

Sunday, August 11, 2013

Resume Large Downloads in Mozilla Firefox

When downloading a large file via Firefox over a slow internet connection you might get disconnected in between and end up with a partially downloaded file (with a .part file extension). Here's a little trick to Resume the download after reconnecting, on wards, from where the download had stopped previously.

1. Open up the Firefox Downloads window (Tools > Downloads OR use shortcut Ctrl+ Shift + Y). Not sure if this works with the recent versions of Firefox.

2. Click on the Resume button next to the file you were downloading/ got downloaded partially. If this works then great, nothing else to do.

3. On the other hand, if step 2 didn't work, then click on the Retry button. This will result in the download to start off all over again from the very beginning. Let it start and go over to step 4.

4. Once a few bytes of the file has been downloaded & the progress meter on the Download window indicates that the new download has started (might also give an estimate of time left), click on the Pause button next to the download.

5. Now go to your Downloads folder (where Firefox was downloading the file). Rename the first file that was partially downloaded (having the file name extension as .part) to the new file that just started downloading in step 4.

6. Go back to the Firefox Downloads window and click the ResumeStart/ Restart button next to the download process that was Paused in step 4.

That's it. The download should resume from the point where the initial partially downloaded (.part) file had stopped.

Thursday, August 1, 2013

Trees and Graphs

Useful things to know about the trees and graphs based data structures:

These:
  • Binary Trees Vs. Binary Search Trees
  • 2,4 and Red Black
  • AVL
  • Tries
  • Heaps
  • B & B+ Trees
& these:
  • BFS, DFS
  • Sorting - Quick, Merge, Radix, Timsort
  • Kruskal's & Prim's algorithm for Minimum Spanning Trees
  • Morris Traversal, without extra space or recursion, using Threaded Binary Trees
  • Djikstra's algorithm for shortest path
  • Topological sorting
That horses for courses is applicable:

The big-O deal:

With Java, well tested implementations are mostly available:
  • TreeSet
  • TreeMap
  • LinkedHashMap
  • ConcurrentSkipListMap
  • PriorityQueue