Saturday, June 21, 2014

Getting Table Information From Hive Metastore

The Hive Metastore holds various meta information about Hive tables such as schema names, table names, partitions, fields, permissions, and so on. The metastore is built in a schema within a relational database, outside of Hive, and accessed via services by Hive clients. An embedded Derby database is configured as the default database for experimentation purposes, and must be changed over to something like MySql, Postgre, etc. for production use.

The Hive Metastore ER diagram is fairly straightforward. Once familiar with the schema, it is easy to query the metastore for information about the Hive tables. Here's a sample query to identify all Partitioned tables from a given Hive databases:

Friday, June 13, 2014

Sunday, May 18, 2014

Hive Abstract Semantic Analyzer Hook

Hive allows Pre & Post Analyzer hooks to be added to the normal hive plan query generation flow via the AbstractSemanticAnalyzerHook class.

A custom hook needs to extend AbstractSemanticAnalyzerHook & override the preAnalyze or postAnalyze method as necessary.

Simple Sematic Analyzer Hook:
A sematic analyzer hook that logs a message each in the preAnalyze & postAnalyze methods, is shown below:

Configurations for Simple Sematic Analyzer Hook:

Monday, April 21, 2014

Urlencode and Urldecode in Hive using the Reflect UDF

Hive doesn't offer an in-built UDF to perform Urlencode or Urldecode. One option could be to write a custom UDF to fill for the void.

On the other hand, a rather straight forward alternative to have the same feature, as shown on the forum, us using the very generic Reflect UDF.

Thursday, April 10, 2014

Hive Query Plan Generation

Hive query is passed through several built-in modules for the final plan to be generated.

The stages/ modules are:

Query 
     => (1) Parser 
               => (2) Semantic Analyzer 
                     => (3) Logical Plan Generation  
                           => (4) Optimizer 
                                => (5) Physical Plan Generation
                                         => Executor to run on Hadoop

Monday, March 24, 2014

Hive History File

Hive maintains a history of all commands executed via the hive cli. These commands are written to a file called .hivehistory, on the user's home folder.

Sunday, March 16, 2014

Hive Optimizations

Explain Plan

Mappers and Reducers Count

Map Joins

Sorting

Optimization step:
Between the logical & physical plan generation phase of hive, hive optimizations gets executed. The current set of optimizations include:

  • Column pruning
  • Partition pruning
  • Sample pruning
  • Predicate push down
  • Map join processor
  • Union processor
  • Join reorder
  • Union processor
More on each of these optimizations to follow..

Sunday, February 2, 2014

Build Hadoop from Source Code with Native Libraries and Snappy Compression

When running Hadoop using a pre-built Hadoop binary distribution (a downloaded hadoop-<Latest_Version>.tar.gz bundle), Hadoop may not be able to load certain native libraries. The following warning is also displayed at the time of starting up Hadoop:

"Unable to load native-hadoop library for your platform... using builtin-java classes where applicable "

This issue comes up due to the difference in architecture of the particular machine on which Hadoop is being run now vs. that of the machine on which it was orginally compiled. While most of Hadoop (written in Java) loads up fine, there are native libraries (compression, etc.) which do not get loaded (more details to follow).

The fix is to compile Hadoop locally & use it in place of the pre-built Hadoop binary (tar.gz). At a high level this requires:

Wednesday, January 15, 2014

Mocks for Unit testing Shell Scripts

1. Look at shunit for a sense of what kind of unit testing can be performed for Shell scripts.

2. For mocking up specific steps/ programs in the script, make use of alias.


Within a shell script testScript.sh this would be something as follows: