Thursday, November 28, 2013

Precision and Recall

Terms popular within search and Information Retrieval (IR) domains.

Precision: Is all about accuracy. Whether all results that have shown up are relevant.

Recall: Has to do with completeness. Whether all valid/ relevant results have shown up.

Needs detailing..

Sunday, November 24, 2013

Pentaho 5.0 Community Edition Released

The stable build of Pentaho Community Edition (CE) 5.0 has been released. Many many new features have made it to this build. Particularly keen to try out the enhancements to web services deployment via Carte. More details can be found in the Pentaho 5.0 release notes

Saturday, November 16, 2013

Pentaho Clusters

Pentaho provides the option to scale out Kettle Transformations via Pentaho Clusters. It is fairly straightforward to set up a Pentaho cluster and elastic/dynamic clusters. The 1-2-3 of what needs to be done is:

1. Start the Carte Instances
There are two kinds of instances - Masters & Slaves. At least one instance must act as the dedicated Master which takes on the responsibility of management/ distribution of transformations/ steps to slaves, fail-over/ restart and communicating with the slaves.

The Carte instances need a config file with details about the Master's port, IP/ Hostname etc. For sample config files take a look at the pwd folder in your default Pentaho installation (/data-integration/pwd).

E.g. With defaults, a cluster can be started on localhost with:


2. Set up Cluster & Server Information using Spoon (GUI)
Switch to the View tab, next to the Design tab in the left hand panel of the Spoon GUI.
Click on 'Slave Servers' to add new Slave servers (host, port, name, etc.). Make sure to check the 'is_the_master' checkbox for the Master server.

Next click on the 'Kettle Cluster Schemas' and use 'Select Slave servers' to choose the slave servers. For  the ability to dynamically add/ remove slave servers, also select the 'Dynamic Cluster' checkbox.

3. Mark Transformation Steps to Execute in Cluster Mode
Right click on the step which needs to be run in the cluster mode, select Clustering & then select the cluster schema. You will now see a symbol next to the step (CxN) indicating that the step is to be executed in a clustered mode.

The cluster settings will be similar to what you see in the left panel in the image. You can also see a transformation, with two steps (Random & Replace in String) being run in a clustered mode in the right panel in the image below.




Monday, November 11, 2013

Shanon Entropy and Information Gain


Shanon's Information Gain/ Entropy theory gets applied a lot in areas such as data encoding, compression and networking. Entropy, as defined by Shanon, is a measure of the unpredictability of a given message. The higher the entropy the more unpredictable the content of the message is to a receiver.

Correspondingly, a high Entropy message is also high on Information Content. On receiving a high Entropy/ high Information Content laden message, the receiver has a high Information Gain.

On the other hand, when the receiver already knows the contents (or of a certain bias) of the message, the Information Content of the message is low. On receiving such a message the receiver has less Information Gain. Effectively once the uncertainty about the content of the message has reduced, the Entropy of the message has also dropped and the Information Gain from receiving such a message has gone down. The reasoning this far is quite intuitive.

The Entropy (& unpredictability) is the highest for a fair coin (example 1.a) and decreases for a biased coin (examples 1.b & 1.c). Due to the bias the receiver is able to predict the outcome (favouring the known bias) in the later case resulting in a lower Entropy.

The observation from the (2-outcomes) coin toss case generalizes to the N-outcomes case, and the Entropy is found to be highest when all N-outcomes are equally likely (fair).