Saturday, June 21, 2014

Getting Table Information From Hive Metastore

The Hive Metastore holds various meta information about Hive tables such as schema names, table names, partitions, fields, permissions, and so on. The metastore is built in a schema within a relational database, outside of Hive, and accessed via services by Hive clients. An embedded Derby database is configured as the default database for experimentation purposes, and must be changed over to something like MySql, Postgre, etc. for production use.

The Hive Metastore ER diagram is fairly straightforward. Once familiar with the schema, it is easy to query the metastore for information about the Hive tables. Here's a sample query to identify all Partitioned tables from a given Hive databases:

Friday, June 13, 2014

Sunday, May 18, 2014

Hive Abstract Semantic Analyzer Hook

Hive allows Pre & Post Analyzer hooks to be added to the normal hive plan query generation flow via the AbstractSemanticAnalyzerHook class.

A custom hook needs to extend AbstractSemanticAnalyzerHook & override the preAnalyze or postAnalyze method as necessary.

Simple Sematic Analyzer Hook:
A sematic analyzer hook that logs a message each in the preAnalyze & postAnalyze methods, is shown below:

Configurations for Simple Sematic Analyzer Hook:

Monday, April 21, 2014

Urlencode and Urldecode in Hive using the Reflect UDF

Hive doesn't offer an in-built UDF to perform Urlencode or Urldecode. One option could be to write a custom UDF to fill for the void.

On the other hand, a rather straight forward alternative to have the same feature, as shown on the forum, us using the very generic Reflect UDF.

Thursday, April 10, 2014

Hive Query Plan Generation

Hive query is passed through several built-in modules for the final plan to be generated.

The stages/ modules are:

Query 
     => (1) Parser 
               => (2) Semantic Analyzer 
                     => (3) Logical Plan Generation  
                           => (4) Optimizer 
                                => (5) Physical Plan Generation
                                         => Executor to run on Hadoop

Monday, March 24, 2014

Hive History File

Hive maintains a history of all commands executed via the hive cli. These commands are written to a file called .hivehistory, on the user's home folder.

Sunday, March 16, 2014

Hive Optimizations

Explain Plan

Mappers and Reducers Count

Map Joins

Sorting

Optimization step:
Between the logical & physical plan generation phase of hive, hive optimizations gets executed. The current set of optimizations include:

  • Column pruning
  • Partition pruning
  • Sample pruning
  • Predicate push down
  • Map join processor
  • Union processor
  • Join reorder
  • Union processor
More on each of these optimizations to follow..

Sunday, February 2, 2014

Build Hadoop from Source Code with Native Libraries and Snappy Compression

When running Hadoop using a pre-built Hadoop binary distribution (a downloaded hadoop-<Latest_Version>.tar.gz bundle), Hadoop may not be able to load certain native libraries. The following warning is also displayed at the time of starting up Hadoop:

"Unable to load native-hadoop library for your platform... using builtin-java classes where applicable "

This issue comes up due to the difference in architecture of the particular machine on which Hadoop is being run now vs. that of the machine on which it was orginally compiled. While most of Hadoop (written in Java) loads up fine, there are native libraries (compression, etc.) which do not get loaded (more details to follow).

The fix is to compile Hadoop locally & use it in place of the pre-built Hadoop binary (tar.gz). At a high level this requires:

Wednesday, January 15, 2014

Mocks for Unit testing Shell Scripts

1. Look at shunit for a sense of what kind of unit testing can be performed for Shell scripts.

2. For mocking up specific steps/ programs in the script, make use of alias.


Within a shell script testScript.sh this would be something as follows:

Tuesday, December 24, 2013

Mechanical Sympathy

A term that's gaining traction thanks to the LMAX architecture. Low latency applications running on the JVM need to be hardware gnostic to a large degree to be able to best leverage the computing power multi-core/ multi-processor architecture.

More details to follow soon on the topic out here, for the moment you could refer to Martin Fowler's post.

Tuesday, December 3, 2013

Real-time Face Reading

The machines getting better and better at face reading. Ancient mystics have another reason to worry. Won't be long before recommendation engines of various kinds get built that leverage this sort of technology.

More about algorithms in this space to follow..

Thursday, November 28, 2013

Precision and Recall

Terms popular within search and Information Retrieval (IR) domains.

Precision: Is all about accuracy. Whether all results that have shown up are relevant.

Recall: Has to do with completeness. Whether all valid/ relevant results have shown up.

Needs detailing..

Sunday, November 24, 2013

Pentaho 5.0 Community Edition Released

The stable build of Pentaho Community Edition (CE) 5.0 has been released. Many many new features have made it to this build. Particularly keen to try out the enhancements to web services deployment via Carte. More details can be found in the Pentaho 5.0 release notes

Saturday, November 16, 2013

Pentaho Clusters

Pentaho provides the option to scale out Kettle Transformations via Pentaho Clusters. It is fairly straightforward to set up a Pentaho cluster and elastic/dynamic clusters. The 1-2-3 of what needs to be done is:

1. Start the Carte Instances
There are two kinds of instances - Masters & Slaves. At least one instance must act as the dedicated Master which takes on the responsibility of management/ distribution of transformations/ steps to slaves, fail-over/ restart and communicating with the slaves.

The Carte instances need a config file with details about the Master's port, IP/ Hostname etc. For sample config files take a look at the pwd folder in your default Pentaho installation (/data-integration/pwd).

E.g. With defaults, a cluster can be started on localhost with:


2. Set up Cluster & Server Information using Spoon (GUI)
Switch to the View tab, next to the Design tab in the left hand panel of the Spoon GUI.
Click on 'Slave Servers' to add new Slave servers (host, port, name, etc.). Make sure to check the 'is_the_master' checkbox for the Master server.

Next click on the 'Kettle Cluster Schemas' and use 'Select Slave servers' to choose the slave servers. For  the ability to dynamically add/ remove slave servers, also select the 'Dynamic Cluster' checkbox.

3. Mark Transformation Steps to Execute in Cluster Mode
Right click on the step which needs to be run in the cluster mode, select Clustering & then select the cluster schema. You will now see a symbol next to the step (CxN) indicating that the step is to be executed in a clustered mode.

The cluster settings will be similar to what you see in the left panel in the image. You can also see a transformation, with two steps (Random & Replace in String) being run in a clustered mode in the right panel in the image below.




Monday, November 11, 2013

Shanon Entropy and Information Gain


Shanon's Information Gain/ Entropy theory gets applied a lot in areas such as data encoding, compression and networking. Entropy, as defined by Shanon, is a measure of the unpredictability of a given message. The higher the entropy the more unpredictable the content of the message is to a receiver.

Correspondingly, a high Entropy message is also high on Information Content. On receiving a high Entropy/ high Information Content laden message, the receiver has a high Information Gain.

On the other hand, when the receiver already knows the contents (or of a certain bias) of the message, the Information Content of the message is low. On receiving such a message the receiver has less Information Gain. Effectively once the uncertainty about the content of the message has reduced, the Entropy of the message has also dropped and the Information Gain from receiving such a message has gone down. The reasoning this far is quite intuitive.

The Entropy (& unpredictability) is the highest for a fair coin (example 1.a) and decreases for a biased coin (examples 1.b & 1.c). Due to the bias the receiver is able to predict the outcome (favouring the known bias) in the later case resulting in a lower Entropy.

The observation from the (2-outcomes) coin toss case generalizes to the N-outcomes case, and the Entropy is found to be highest when all N-outcomes are equally likely (fair).

Saturday, October 26, 2013

Be Hands On

For as long as you can. Think specialists (surgeons, pilots, etc.) who get better clocking in more hours with/into their art - doing, practicing, persevering. 

Sunday, October 20, 2013

General Availability (GA) for Hadoop 2.x

The Hadoop 2.x GA is nothing less that a big leap forward. Most of the features released such as YARN - a pluggable resource management framework, Name Node HA, HDFS Federation and so on were long awaited. As per the official mail to the community, this release includes:

"To recap, this release has a number of significant highlights compared to Hadoop 1.x:
        • YARN - A general purpose resource management system for Hadoop to allow MapReduce and other other data processing frameworks and services
        • High Availability for HDFS
        • HDFS Federation
        • HDFS Snapshots
        • NFSv3 access to data in HDFS
        • Support for running Hadoop on Microsoft Windows
        • Binary Compatibility for MapReduce applications built on hadoop-1.x
        • Substantial amount of integration testing with rest of projects in the ecosystem

 Please see the Hadoop 2.2.0 Release Notes for details."


Also as per the official email to the community, users are encouraged to move forward to the 2.x branch which is more stable & backward compatible.

Tuesday, October 1, 2013

Need Support to Lift with Confidence

Brace up terminologies coming your way...

Support: A measure of the prevalence of an event x in a given set of N data points. Support is effectively a first level indicator of something occurring frequent enough (say greater than 10% of the times) to be of interest.

In the case of two correlated events x & y,

Confidence: A measure of predictability of two events occurring together. Once confidence is above a certain threshold (say 70%), it means the two events show up together often enough to be used for rules/ decision making, etc.

Lift: A measure of the power of association between two events. For an event y that has occurred, how much more likely is event y to occur once it is known that event x has occurred

Sunday, September 22, 2013

False Negative, False Positive and the Paradox


First a bit about the terms False Positive & False Negative. There terms are associated with the nature of error in the results churned out by a system trying to answer an unknown problem, based on a (limited) set of given/ input data points. After analysing the data, the system is expected to come up with a Yes (it is Positive) or a No (it is Negative) type answer. There is invariably some error in the answer due to noisy data, wrong assumptions, calculation mistakes, unanticipated cases, mechanical errors, surges, etc.

A False Positive is when the system says the answer is Positive, but the answer is actually wrong. An example would be a sensitive car's burglar alarm system that starts to beep due to heavy lightning & thunder on a rainy day. The alarm at this stage is indicating a positive hit (i.e. a burglary), which is not really happening.

On the other hand, a False Negative is when the system answers in a Negative, where the answer should have been a Positive. False negatives happen often with first level medical tests and scans which are unable to detect the cause of pain or discomfort. The test report of "Nothing Abnormal Detected" at this stage is often a False Negative, as revealed by more detailed tests performed later.

The False Positive Paradox is an interesting phenomenon where the likelihood of a False Positive shoots up significantly (& sometimes beyond the actual positive) when the actual rate of occurrence of a condition within a given sample group is very low. The results are thanks to basic likelihood calculations as shown below.

Let's say in a group of size 1,000,000 (1 Mn.), 10% are doctors. Let's say there's a system wherein you feed in a person's Unique ID (UID) and it tells you if the person is a doctor or not. The system has a 0.01% chance of incorrectly reporting a person who is not a doctor to be a doctor (a False Positive).

Now, let's work out our confidence levels of the results given out by the system.


On the other hand if just 0.01% of people in the group are actually doctors (while the rest of the info. remains same) the confidence level works out to be quite different.


This clearly shows that the likelihood of the answer being a False Positive has shot up from much under 1% to as much as 50%, when the occurrence of a condition (number of doctors) within a given population dropped from 10%  (i.e. 100,000) to a low value of 0.1% (i.e. 1,000).

Thursday, September 12, 2013

Transparently


While doing software development you might hear of change being introduced "transparently". What does this mean?

Transparency in this context is similar to how a looking glass is transparent. One can barely make out that it exists. Think of a biker who pulls down the glass visor of his helmet when troubled by wind blowing into his eyes. His sight of the road & beyond continue to function without his noticing the transparent visor layer in-between.

Similarly, when a change in introduced transparently on the server side, it means the dependent/ client side applications needn't be told/ made aware of this change on the server side. The old interfaces continue to work as is, communication protocols remain the same, and so on.

The above kind of transparency is different from the transparency of a "transparent person" or a "transparent deal" or a "white box system", where the internals (like thoughts, implementation, ideas, details, etc.) are visible.

Saturday, August 31, 2013

Internals Of Solr/ Lucene Document Scoring

This post is in continuation a discussion on the solr community about the efficiency of Solr/ Lucene scoring algorithm.

The search algorithm given here can be summarized to:

- Query query =  Build query using user's search terms.
- Collector collector = Typically the TopScoreDocCollector
- Searcher searcher = new IndexSearcher(indexReader);
- searcher.search(query, collector);
- Weight weight = query.weight(searcher);
- Scorer scorer = weight.scorer(indexReader); // Typically BooleanScorer2
- scorer.score() => ConjunctionScorer (on every sub-scorer) in a leap frog/ skip ahead mechanism.

Algo needs improvement!

The AND query shows a leap frog/ skip ahead ahead pattern implemented in the BooleanScorer2 (ConjunctionScorer) level.

For example with the query, q=A AND B, where A & B match doc. id's
A -> 1,3,5,7,11,15,17
B -> 2, 6

- Scorer starts with the min. of each, i.e. A -> 1  & B -> 2, & current highest doc id set to 2

- In the next few iterations:
A is advanced past the current highest value to 3 & current highest updated to 3.
B advanced past current highest 3 to 6 & current highest set to 6.
A advanced past 6 to 7 & current highest set to 7.
B has no more docs & this breaks out, without any match.

On the other hand if the two had converged/ agreed on a particular doc id, that doc would be scored & collected (added to a min-heap of scores).

Thursday, August 15, 2013

Update Apt Repositories Location for Old Ubuntu Versions

When working with an old versions of Ubuntu (11.04, 10.04, etc.), the biggest handicap is the lack of a functional package manager such as apt or synaptic. The reason why the package managers stop working is that at end of support/ licence for an old version of Ubuntu, the team behind Ubuntu archive the repositories.

At this point as an user you are supposed to Upgrade (the recommended practice) to a more recent version. There are normally enough advance notices and alerts sent out by Ubuntu's Update Manager for the same. If however, you have a compelling reason to stick on to your current version, then here's a   way to update your repositories' sources list to be able to install and use old software that is present in the archival repository. This is based on the recommendation made on this forum discussion.

Sunday, August 11, 2013

Resume Large Downloads in Mozilla Firefox

When downloading a large file via Firefox over a slow internet connection you might get disconnected in between and end up with a partially downloaded file (with a .part file extension). Here's a little trick to Resume the download after reconnecting, on wards, from where the download had stopped previously.

1. Open up the Firefox Downloads window (Tools > Downloads OR use shortcut Ctrl+ Shift + Y). Not sure if this works with the recent versions of Firefox.

2. Click on the Resume button next to the file you were downloading/ got downloaded partially. If this works then great, nothing else to do.

3. On the other hand, if step 2 didn't work, then click on the Retry button. This will result in the download to start off all over again from the very beginning. Let it start and go over to step 4.

4. Once a few bytes of the file has been downloaded & the progress meter on the Download window indicates that the new download has started (might also give an estimate of time left), click on the Pause button next to the download.

5. Now go to your Downloads folder (where Firefox was downloading the file). Rename the first file that was partially downloaded (having the file name extension as .part) to the new file that just started downloading in step 4.

6. Go back to the Firefox Downloads window and click the ResumeStart/ Restart button next to the download process that was Paused in step 4.

That's it. The download should resume from the point where the initial partially downloaded (.part) file had stopped.

Thursday, August 1, 2013

Trees and Graphs

Useful things to know about the trees and graphs based data structures:

These:
  • Binary Trees Vs. Binary Search Trees
  • 2,4 and Red Black
  • AVL
  • Tries
  • Heaps
  • B & B+ Trees
& these:
  • BFS, DFS
  • Sorting - Quick, Merge, Radix, Timsort
  • Kruskal's & Prim's algorithm for Minimum Spanning Trees
  • Morris Traversal, without extra space or recursion, using Threaded Binary Trees
  • Djikstra's algorithm for shortest path
  • Topological sorting
That horses for courses is applicable:

The big-O deal:

With Java, well tested implementations are mostly available:
  • TreeSet
  • TreeMap
  • LinkedHashMap
  • ConcurrentSkipListMap
  • PriorityQueue

Thursday, July 11, 2013

Solr Analyzers Basics


Solr offers several Analyzers to pre-process document fields being indexed and searched. As part of modelling the schema one needs to make an informed choice for the specific chain of Analyzers to be applied to every field (fieldType) defined in the schema.xml.

To start off one needs to understand that different kinds of Analyzers and their purpose:
  1. Char Filters (or CharacterFilterFactories)
    • Always applied first, i.e. before Tokenizers
    • Operates at the Character level (of the field values)
    • Zero or More Char Filters can be chained together. Get applied as per the sequence in schema.xml
  2. Tokenizers (or TokenizerFactories)
    • Converts stream of Characters into a series of Tokens
    • Only One Tokenizer can be there in each Analyzer chain
  3. Token Filters (or TokenFilterFactories)
    • Always applied last, i.e. after Tokenizers
    • Operates at the Tokens level generated by the Tokenizers
    • Zero or More Token Filters can be chained together. Get applied as per the sequence in schema.xml








To take an example, let's say we have a field title with the value (V1) "Mr. James <b>Bond</b> MI007". Now we run it through the following:

1. Character FilterFactory (One): HTMLStripCharFilterFactory (CF1)

(Output: "Mr. James Bond MI007")

2. Tokenizer (One): StandardTokenizerFactory (T)

(Output: Tokens: [ALPHANUM: "Mr.", ALPHANUM:"James", ALPHANUM:"Bond", ALNUM:"MI007"])

3. TokenFilters (Two): WordDelimiterFilterFactory (TF1) & LowerCaseFilterFactory (TF2) 
  • Mr. => WordDelim => Lowercase => mr.
  • James => WordDelim => Lowercase => james
  • Bond => WordDelim => Lowercase => bond
  • MI007 => WordDelim => [MI, 007] => Lowercase => mi, 007
Finally the output text actually indexed: "mr. james bond mi 007"

There are several other options and many more Analyzers that one could.  Among them the different PatternReplace Analyzers, EdgeNGram and the simple WhiteSpaceFilterFactory are the more popular ones.  Finally, if none of the standard ones are adequate for a specific use case then there is also the option of writing a custom analyzer.

Monday, June 10, 2013

Solution for making Long GET Request to Solr via SolrNet

Solr has REST api's available for performing various searches on indexed documents. The client generally issues GET requests to Solr with different parameters (fields, row, facet, etc.) set. Since there typically are size/ query length limitations on GET requests (imposed by container, OS, etc.), Solr allows the same queries to be issued to the Solr RequestHandlers as POST request as well.

We ran into one such issue with long GET request to Solr from SolrNet and did a few changes to solve the same.

Solr Side Changes:
First up, we increased the headerBufferSize of the application server as explained on SO here and increased the maxBooleanClauses parameter in solrconfig.xml. This allowed Solr side to start responding to much longer GET requests. The problem however wasn't solved. The client side was a dot net application running within IIS having additional length limitations imposed by Windows OS & the dot net framework.

SolrNet Side Changes:
In round two, we went for a better fix and switched over to a POST requests in place of long GET requests. The solution is largely the same as mentioned on the SolrNet group here & here. The difference being to switch over to a POST request from within the Get() method of the SolrConnection.cs class, when the request string is longer than a configurable threshold value.


Update: PostSolrConnection.cs class has made it to the head branch of SolrNet.  

Tuesday, May 28, 2013

Redmine Project Management Tool

In trying to find an Open source Agile project management tool, somewhat of an alternative to Rally, chanced upon Redmine. The initial feel of the tool has been good so far.

Needed somewhat of an integrated tool that would allow various teams to collaborate. Redmine does well on this count as it has a task tracker, bug tracker, knowledge repository (file/ document management and wiki), all rolled in to one.

Additionally, we have been able to migrate our bugs and user accounts from Bugzilla, to get off the ground quick. Now it is about letting rubber hit the road, and having the teams to start working with Redmine.