Showing posts with label Hadoop. Show all posts
Showing posts with label Hadoop. Show all posts

Saturday, June 21, 2014

Getting Table Information From Hive Metastore

The Hive Metastore holds various meta information about Hive tables such as schema names, table names, partitions, fields, permissions, and so on. The metastore is built in a schema within a relational database, outside of Hive, and accessed via services by Hive clients. An embedded Derby database is configured as the default database for experimentation purposes, and must be changed over to something like MySql, Postgre, etc. for production use.

The Hive Metastore ER diagram is fairly straightforward. Once familiar with the schema, it is easy to query the metastore for information about the Hive tables. Here's a sample query to identify all Partitioned tables from a given Hive databases:

Monday, April 21, 2014

Urlencode and Urldecode in Hive using the Reflect UDF

Hive doesn't offer an in-built UDF to perform Urlencode or Urldecode. One option could be to write a custom UDF to fill for the void.

On the other hand, a rather straight forward alternative to have the same feature, as shown on the forum, us using the very generic Reflect UDF.

Thursday, April 10, 2014

Hive Query Plan Generation

Hive query is passed through several built-in modules for the final plan to be generated.

The stages/ modules are:

Query 
     => (1) Parser 
               => (2) Semantic Analyzer 
                     => (3) Logical Plan Generation  
                           => (4) Optimizer 
                                => (5) Physical Plan Generation
                                         => Executor to run on Hadoop

Monday, March 24, 2014

Hive History File

Hive maintains a history of all commands executed via the hive cli. These commands are written to a file called .hivehistory, on the user's home folder.

Sunday, March 16, 2014

Hive Optimizations

Explain Plan

Mappers and Reducers Count

Map Joins

Sorting

Optimization step:
Between the logical & physical plan generation phase of hive, hive optimizations gets executed. The current set of optimizations include:

  • Column pruning
  • Partition pruning
  • Sample pruning
  • Predicate push down
  • Map join processor
  • Union processor
  • Join reorder
  • Union processor
More on each of these optimizations to follow..

Sunday, February 2, 2014

Build Hadoop from Source Code with Native Libraries and Snappy Compression

When running Hadoop using a pre-built Hadoop binary distribution (a downloaded hadoop-<Latest_Version>.tar.gz bundle), Hadoop may not be able to load certain native libraries. The following warning is also displayed at the time of starting up Hadoop:

"Unable to load native-hadoop library for your platform... using builtin-java classes where applicable "

This issue comes up due to the difference in architecture of the particular machine on which Hadoop is being run now vs. that of the machine on which it was orginally compiled. While most of Hadoop (written in Java) loads up fine, there are native libraries (compression, etc.) which do not get loaded (more details to follow).

The fix is to compile Hadoop locally & use it in place of the pre-built Hadoop binary (tar.gz). At a high level this requires:

Sunday, October 20, 2013

General Availability (GA) for Hadoop 2.x

The Hadoop 2.x GA is nothing less that a big leap forward. Most of the features released such as YARN - a pluggable resource management framework, Name Node HA, HDFS Federation and so on were long awaited. As per the official mail to the community, this release includes:

"To recap, this release has a number of significant highlights compared to Hadoop 1.x:
        • YARN - A general purpose resource management system for Hadoop to allow MapReduce and other other data processing frameworks and services
        • High Availability for HDFS
        • HDFS Federation
        • HDFS Snapshots
        • NFSv3 access to data in HDFS
        • Support for running Hadoop on Microsoft Windows
        • Binary Compatibility for MapReduce applications built on hadoop-1.x
        • Substantial amount of integration testing with rest of projects in the ecosystem

 Please see the Hadoop 2.2.0 Release Notes for details."


Also as per the official email to the community, users are encouraged to move forward to the 2.x branch which is more stable & backward compatible.

Monday, September 17, 2012

Workaround for Copy Command from WebHDFS

At the moment the WebHDFS api doesn't offer the Copy command. As a result, the client ends up having to download the file to the local disk and re-upload the files via the Create command. Since this ends up being a lot of round trips all the way to the client (typically a non Java based client) the following workaround can be set up to partly alleviates the problem.

Set up a HDFS Webdav server on one of the DN or NN boxes. Issue the Copy command to the Webdav server via a REST call. Free up the client application, while letting the Webdav server with much better connectivity & proximity to the HDFS complete the Copy command request.

Friday, August 10, 2012

REST based integration with HDFS via WebHDFS

WebHDFS offers a set of perfectly good REST api's for any application to integrate with the HDFS. This can be particularly advantageous for applications written in languages other than Java such as Rails, Dot Net and so on.

Within our lan with commodity desktop class boxes with 2.5 Ghz processors, 8 G Ram set up, and a replication factor of 2, we found about  read/ write speeds of about 27 Mbps via WebHDFS. This was only a shade slower than the 30 Mbps that we were getting via raw file transfers between the same Data Nodes (DN).

Another observation was that our best transfer rates were achieved by setting the buffer size to 22K. We played around with several other buffer size values, but found 22K to be the magic number. Hoping to find some logical explanation for this observation.