Algorithms, Design, Code and more: November 2012

Tuesday, November 20, 2012

C# - A home away from home

For someone with years of experience in Java, the stint to code in C# seemed like a cake walk. The large scale port of popular Java frameworks into dot net such as NHibernate, Spring.net, NUnit, etc., make life all that much easier for anyone starting to bridge the gap.

There are however some bits that trouble us Java natives no end. Particularly anything & everything to with the web.config file. This one file has enough traps in it to confuse the hell out of any sane minded developer. The file has hints for the IDE (Visual studio), the framework (Asp.net), the web server (IIS), & everyone else connected to the runtime.

The other bit that seems bothersome is how dependencies, packages & Dll's get referenced. Particularly with frequently changing Dll's, & the tools optimized for caching, it gets difficult to know what version is really loaded & running.

Anyway, once you get past these issues, the ride ahead is through familiar grounds.

Friday, November 2, 2012

Using Pentaho Kettle to Index Data in Solr

Pentaho Kettle is a fine open source ETL tool written in Java. There are several implementations, hooks and plugins available off the shelf for performing various Extract (E), Transform (T), Load (L) processes on data from a source location to a destination location.

Solr, on the other hand, is a rich and powerful production grade search engine written on top of Lucene. So how would it be to get the two to function in tandem? To use Kettle to load data into Solr for indexing purpose.

The data load phase for indexing in Solr is very similar to an ETL process. The data is sourced (Extract) from a relational Database (MySql, Postgre, etc.). This data is denormalized and transformed to a Solr compatible document (Transform). Finally the transformed data is streamed to Solr for indexing (Load). Kettle excels in performing each of these steps!

A Kettle ETL job to load data into Solr for indexing, is a good alternative to using Solr's very own Data Import Handler (DIH). Since DIH typically runs off the same Solr setup (with a few common dependencies) so there's some intermixing of concerns with such a set-up, between what Solr is good at (search & indexing) versus what the DIH is built to do (import documents). The DIH also competes for resources (CPU, IO) with Solr. Ketttle has no such drawbacks and can be run off a different set of physical boxes.

There are additional benefits of using Kettle such as availability of stable implementations for working across data sources, querying, bulk load, setting up of staged workflows with configurable queues & worker threads. Also Kettle's exception handling, retry mechanism, REST/ WS client, JSON serializer, custom Java code extension, and several handy transformation capabilities, all add up in its favour.

On the cons, given that the call to Solr would be via standard REST client from Kettle, the set-up would not be Solr Cloud or Zookeeper (ZK) aware to be able to do any smart routing of documents. One option to solve this could be to use the Custom Java Code step in Kettle and delegate the call to Solr via the SolrJ's CloudSolrServer client (which is Solr Cloud/ ZK aware).