Algorithms, Design, Code and more: 2014

Saturday, June 21, 2014

Getting Table Information From Hive Metastore

The Hive Metastore holds various meta information about Hive tables such as schema names, table names, partitions, fields, permissions, and so on. The metastore is built in a schema within a relational database, outside of Hive, and accessed via services by Hive clients. An embedded Derby database is configured as the default database for experimentation purposes, and must be changed over to something like MySql, Postgre, etc. for production use.

The Hive Metastore ER diagram is fairly straightforward. Once familiar with the schema, it is easy to query the metastore for information about the Hive tables. Here's a sample query to identify all Partitioned tables from a given Hive databases:

Friday, June 13, 2014

On Bitcoins

For now, the bitcoin gini index is as divergent from the ideal as can be. More on what makes this tick to follow soon..

Sunday, May 18, 2014

Hive Abstract Semantic Analyzer Hook

Hive allows Pre & Post Analyzer hooks to be added to the normal hive plan query generation flow via the AbstractSemanticAnalyzerHook class.

A custom hook needs to extend AbstractSemanticAnalyzerHook & override the preAnalyze or postAnalyze method as necessary.

Simple Sematic Analyzer Hook:
A sematic analyzer hook that logs a message each in the preAnalyze & postAnalyze methods, is shown below:

package org.apache.hadoop.hive.ql.parse;

public class SimpleSemanticPreAnalyzerHook extends AbstractSemanticAnalyzerHook{
static final private Log LOG = LogFactory.getLog(SimpleSemanticPreAnalyzerHook.class.getName());
static final private LogHelper console = new LogHelper(LOG);

@Override
public ASTNode preAnalyze(HiveSemanticAnalyzerHookContext context,
ASTNode ast) throws SemanticException {
console.printInfo("!! SimpleSemanticPreAnalyzerHook preAnalyze called !!");
return super.preAnalyze(context, ast);
}

@Override
public void postAnalyze(HiveSemanticAnalyzerHookContext context,
List<Task<? extends Serializable>> rootTasks)
throws SemanticException {
console.printInfo("!! SimpleSemanticPreAnalyzerHook postAnalyze called !!");
super.postAnalyze(context, rootTasks);
}
}

Configurations for Simple Sematic Analyzer Hook:

Monday, April 21, 2014

Urlencode and Urldecode in Hive using the Reflect UDF

Hive doesn't offer an in-built UDF to perform Urlencode or Urldecode. One option could be to write a custom UDF to fill for the void.

On the other hand, a rather straight forward alternative to have the same feature, as shown on the forum, us using the very generic Reflect UDF.

Thursday, April 10, 2014

Hive Query Plan Generation

Hive query is passed through several built-in modules for the final plan to be generated.

The stages/ modules are:

Query
=> (1) Parser
=> (2) Semantic Analyzer
=> (3) Logical Plan Generation
=> (4) Optimizer
=> (5) Physical Plan Generation
=> Executor to run on Hadoop

Monday, March 24, 2014

Hive History File

Hive maintains a history of all commands executed via the hive cli. These commands are written to a file called .hivehistory, on the user's home folder.

Sunday, March 16, 2014

Hive Optimizations

Explain Plan

Mappers and Reducers Count

Map Joins

Sorting

Optimization step:
Between the logical & physical plan generation phase of hive, hive optimizations gets executed. The current set of optimizations include:

Column pruning
Partition pruning
Sample pruning
Predicate push down
Map join processor
Union processor
Join reorder
Union processor

More on each of these optimizations to follow..

Sunday, February 2, 2014

Build Hadoop from Source Code with Native Libraries and Snappy Compression

When running Hadoop using a pre-built Hadoop binary distribution (a downloaded hadoop-<Latest_Version>.tar.gz bundle), Hadoop may not be able to load certain native libraries. The following warning is also displayed at the time of starting up Hadoop:

"Unable to load native-hadoop library for your platform... using builtin-java classes where applicable "

This issue comes up due to the difference in architecture of the particular machine on which Hadoop is being run now vs. that of the machine on which it was orginally compiled. While most of Hadoop (written in Java) loads up fine, there are native libraries (compression, etc.) which do not get loaded (more details to follow).

The fix is to compile Hadoop locally & use it in place of the pre-built Hadoop binary (tar.gz). At a high level this requires:

Installations:

Local dev box (Ubuntu 13, etc.)
Build tools set-up:

gcc g++ make maven cmake zlib zlib1g-dev libcurl4-openssl-dev

Native libraries installed: (Snappy, etc)

libsnappy1, libsnappy-dev

Protobuf source cod: (download here)
Hadoop source code: (download here)
Hadoop patch for pom.xml issue

Build:

mvn package -Pdist,native -DskipTests -Dtar

Export Environment: Finally, export Hadoop environment variables

export HADOOP_HOME=/path/to/hadoop/folder
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"

Latest binary: Available at <HADOOP_SOURCE>/hadoop-dist/target/hadoop-<Latest_Version>.tar.gz

Wednesday, January 15, 2014

Mocks for Unit testing Shell Scripts

1. Look at shunit for a sense of what kind of unit testing can be performed for Shell scripts.

2. For mocking up specific steps/ programs in the script, make use of alias.

Within a shell script testScript.sh this would be something as follows: