Algorithms, Design, Code and more: Hadoop 2.x

Showing posts with label Hadoop 2.x. Show all posts

Saturday, June 21, 2014

Getting Table Information From Hive Metastore

The Hive Metastore holds various meta information about Hive tables such as schema names, table names, partitions, fields, permissions, and so on. The metastore is built in a schema within a relational database, outside of Hive, and accessed via services by Hive clients. An embedded Derby database is configured as the default database for experimentation purposes, and must be changed over to something like MySql, Postgre, etc. for production use.

The Hive Metastore ER diagram is fairly straightforward. Once familiar with the schema, it is easy to query the metastore for information about the Hive tables. Here's a sample query to identify all Partitioned tables from a given Hive databases:

Sunday, May 18, 2014

Hive Abstract Semantic Analyzer Hook

Hive allows Pre & Post Analyzer hooks to be added to the normal hive plan query generation flow via the AbstractSemanticAnalyzerHook class.

A custom hook needs to extend AbstractSemanticAnalyzerHook & override the preAnalyze or postAnalyze method as necessary.

Simple Sematic Analyzer Hook:
A sematic analyzer hook that logs a message each in the preAnalyze & postAnalyze methods, is shown below:

package org.apache.hadoop.hive.ql.parse;

public class SimpleSemanticPreAnalyzerHook extends AbstractSemanticAnalyzerHook{
static final private Log LOG = LogFactory.getLog(SimpleSemanticPreAnalyzerHook.class.getName());
static final private LogHelper console = new LogHelper(LOG);

@Override
public ASTNode preAnalyze(HiveSemanticAnalyzerHookContext context,
ASTNode ast) throws SemanticException {
console.printInfo("!! SimpleSemanticPreAnalyzerHook preAnalyze called !!");
return super.preAnalyze(context, ast);
}

@Override
public void postAnalyze(HiveSemanticAnalyzerHookContext context,
List<Task<? extends Serializable>> rootTasks)
throws SemanticException {
console.printInfo("!! SimpleSemanticPreAnalyzerHook postAnalyze called !!");
super.postAnalyze(context, rootTasks);
}
}

Configurations for Simple Sematic Analyzer Hook:

Monday, April 21, 2014

Urlencode and Urldecode in Hive using the Reflect UDF

Hive doesn't offer an in-built UDF to perform Urlencode or Urldecode. One option could be to write a custom UDF to fill for the void.

On the other hand, a rather straight forward alternative to have the same feature, as shown on the forum, us using the very generic Reflect UDF.

Thursday, April 10, 2014

Hive Query Plan Generation

Hive query is passed through several built-in modules for the final plan to be generated.

The stages/ modules are:

Query
=> (1) Parser
=> (2) Semantic Analyzer
=> (3) Logical Plan Generation
=> (4) Optimizer
=> (5) Physical Plan Generation
=> Executor to run on Hadoop

Monday, March 24, 2014

Hive History File

Hive maintains a history of all commands executed via the hive cli. These commands are written to a file called .hivehistory, on the user's home folder.

Sunday, February 2, 2014

Build Hadoop from Source Code with Native Libraries and Snappy Compression

When running Hadoop using a pre-built Hadoop binary distribution (a downloaded hadoop-<Latest_Version>.tar.gz bundle), Hadoop may not be able to load certain native libraries. The following warning is also displayed at the time of starting up Hadoop:

"Unable to load native-hadoop library for your platform... using builtin-java classes where applicable "

This issue comes up due to the difference in architecture of the particular machine on which Hadoop is being run now vs. that of the machine on which it was orginally compiled. While most of Hadoop (written in Java) loads up fine, there are native libraries (compression, etc.) which do not get loaded (more details to follow).

The fix is to compile Hadoop locally & use it in place of the pre-built Hadoop binary (tar.gz). At a high level this requires:

Installations:

Local dev box (Ubuntu 13, etc.)
Build tools set-up:

gcc g++ make maven cmake zlib zlib1g-dev libcurl4-openssl-dev

Native libraries installed: (Snappy, etc)

libsnappy1, libsnappy-dev

Protobuf source cod: (download here)
Hadoop source code: (download here)
Hadoop patch for pom.xml issue

Build:

mvn package -Pdist,native -DskipTests -Dtar

Export Environment: Finally, export Hadoop environment variables

export HADOOP_HOME=/path/to/hadoop/folder
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"

Latest binary: Available at <HADOOP_SOURCE>/hadoop-dist/target/hadoop-<Latest_Version>.tar.gz

Sunday, October 20, 2013

General Availability (GA) for Hadoop 2.x

The Hadoop 2.x GA is nothing less that a big leap forward. Most of the features released such as YARN - a pluggable resource management framework, Name Node HA, HDFS Federation and so on were long awaited. As per the official mail to the community, this release includes:

"To recap, this release has a number of significant highlights compared to Hadoop 1.x:
• YARN - A general purpose resource management system for Hadoop to allow MapReduce and other other data processing frameworks and services
• High Availability for HDFS
• HDFS Federation
• HDFS Snapshots
• NFSv3 access to data in HDFS
• Support for running Hadoop on Microsoft Windows
• Binary Compatibility for MapReduce applications built on hadoop-1.x
• Substantial amount of integration testing with rest of projects in the ecosystem

Please see the Hadoop 2.2.0 Release Notes for details."

Also as per the official email to the community, users are encouraged to move forward to the 2.x branch which is more stable & backward compatible.