Hive maintains a history of all commands executed via the hive cli. These commands are written to a file called .hivehistory, on the user's home folder.
Insights on Java, Big Data, Search, Cloud, Algorithms, Data Science, Machine Learning...
Monday, March 24, 2014
Sunday, March 16, 2014
Hive Optimizations
Explain Plan
Map Joins
Sorting
Optimization step:
Between the logical & physical plan generation phase of hive, hive optimizations gets executed. The current set of optimizations include:
- Column pruning
- Partition pruning
- Sample pruning
- Predicate push down
- Map join processor
- Union processor
- Join reorder
- Union processor
More on each of these optimizations to follow..
Sunday, February 2, 2014
Build Hadoop from Source Code with Native Libraries and Snappy Compression
When running Hadoop using a pre-built Hadoop binary distribution (a downloaded hadoop-<Latest_Version>.tar.gz bundle), Hadoop may not be able to load certain native libraries. The following warning is also displayed at the time of starting up Hadoop:
"Unable to load native-hadoop library for your platform... using builtin-java classes where applicable "
This issue comes up due to the difference in architecture of the particular machine on which Hadoop is being run now vs. that of the machine on which it was orginally compiled. While most of Hadoop (written in Java) loads up fine, there are native libraries (compression, etc.) which do not get loaded (more details to follow).
The fix is to compile Hadoop locally & use it in place of the pre-built Hadoop binary (tar.gz). At a high level this requires:
Installations:
Build:
Latest binary: Available at <HADOOP_SOURCE>/hadoop-dist/target/hadoop-<Latest_Version>.tar.gz
"Unable to load native-hadoop library for your platform... using builtin-java classes where applicable "
This issue comes up due to the difference in architecture of the particular machine on which Hadoop is being run now vs. that of the machine on which it was orginally compiled. While most of Hadoop (written in Java) loads up fine, there are native libraries (compression, etc.) which do not get loaded (more details to follow).
The fix is to compile Hadoop locally & use it in place of the pre-built Hadoop binary (tar.gz). At a high level this requires:
Installations:
- Local dev box (Ubuntu 13, etc.)
- Build tools set-up:
- gcc g++ make maven cmake zlib zlib1g-dev libcurl4-openssl-dev
- Native libraries installed: (Snappy, etc)
- libsnappy1, libsnappy-dev
- Protobuf source cod: (download here)
- Hadoop source code: (download here)
- Hadoop patch for pom.xml issue
Build:
- mvn package -Pdist,native -DskipTests -Dtar
Export Environment: Finally, export Hadoop environment variables
- export HADOOP_HOME=/path/to/hadoop/folder
- export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
- export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
Latest binary: Available at <HADOOP_SOURCE>/hadoop-dist/target/hadoop-<Latest_Version>.tar.gz
Wednesday, January 15, 2014
Mocks for Unit testing Shell Scripts
1. Look at shunit for a sense of what kind of unit testing can be performed for Shell scripts.
2. For mocking up specific steps/ programs in the script, make use of alias.
Within a shell script testScript.sh this would be something as follows:
shopt -s expand_aliases
2. For mocking up specific steps/ programs in the script, make use of alias.
Expand alias in the script:
shopt -s expand_aliases
Mock up inbuilt program value via alias:
alias find='abc.txt efg.txt; #'
shopt -s expand_aliases
Mock up inbuilt program value via alias:
alias find='abc.txt efg.txt; #'
Within a shell script testScript.sh this would be something as follows:
shopt -s expand_aliases
alias find='abc.txt efg.txt; #'
./runFindFileScript.sh
assertEquals "2 files found" 2 $countFilesFound
Tuesday, December 24, 2013
Mechanical Sympathy
A term that's gaining traction thanks to the LMAX architecture. Low latency applications running on the JVM need to be hardware gnostic to a large degree to be able to best leverage the computing power multi-core/ multi-processor architecture.
More details to follow soon on the topic out here, for the moment you could refer to Martin Fowler's post.
More details to follow soon on the topic out here, for the moment you could refer to Martin Fowler's post.
Tuesday, December 3, 2013
Real-time Face Reading
The machines getting better and better at face reading. Ancient mystics have another reason to worry. Won't be long before recommendation engines of various kinds get built that leverage this sort of technology.
More about algorithms in this space to follow..
More about algorithms in this space to follow..
Thursday, November 28, 2013
Precision and Recall
Terms popular within search and Information Retrieval (IR) domains.
Precision: Is all about accuracy. Whether all results that have shown up are relevant.
Recall: Has to do with completeness. Whether all valid/ relevant results have shown up.
Needs detailing..
Precision: Is all about accuracy. Whether all results that have shown up are relevant.
Recall: Has to do with completeness. Whether all valid/ relevant results have shown up.
Needs detailing..
Sunday, November 24, 2013
Pentaho 5.0 Community Edition Released
The stable build of Pentaho Community Edition (CE) 5.0 has been released. Many many new features have made it to this build. Particularly keen to try out the enhancements to web services deployment via Carte. More details can be found in the Pentaho 5.0 release notes.
Saturday, November 16, 2013
Pentaho Clusters
Pentaho provides the option to scale out Kettle Transformations via Pentaho Clusters. It is fairly straightforward to set up a Pentaho cluster and elastic/dynamic clusters. The 1-2-3 of what needs to be done is:
1. Start the Carte Instances
There are two kinds of instances - Masters & Slaves. At least one instance must act as the dedicated Master which takes on the responsibility of management/ distribution of transformations/ steps to slaves, fail-over/ restart and communicating with the slaves.
The Carte instances need a config file with details about the Master's port, IP/ Hostname etc. For sample config files take a look at the pwd folder in your default Pentaho installation (/data-integration/pwd).
E.g. With defaults, a cluster can be started on localhost with:
2. Set up Cluster & Server Information using Spoon (GUI)
Switch to the View tab, next to the Design tab in the left hand panel of the Spoon GUI.
Click on 'Slave Servers' to add new Slave servers (host, port, name, etc.). Make sure to check the 'is_the_master' checkbox for the Master server.
Next click on the 'Kettle Cluster Schemas' and use 'Select Slave servers' to choose the slave servers. For the ability to dynamically add/ remove slave servers, also select the 'Dynamic Cluster' checkbox.
3. Mark Transformation Steps to Execute in Cluster Mode
Right click on the step which needs to be run in the cluster mode, select Clustering & then select the cluster schema. You will now see a symbol next to the step (CxN) indicating that the step is to be executed in a clustered mode.
The cluster settings will be similar to what you see in the left panel in the image. You can also see a transformation, with two steps (Random & Replace in String) being run in a clustered mode in the right panel in the image below.
1. Start the Carte Instances
There are two kinds of instances - Masters & Slaves. At least one instance must act as the dedicated Master which takes on the responsibility of management/ distribution of transformations/ steps to slaves, fail-over/ restart and communicating with the slaves.
The Carte instances need a config file with details about the Master's port, IP/ Hostname etc. For sample config files take a look at the pwd folder in your default Pentaho installation (/data-integration/pwd).
E.g. With defaults, a cluster can be started on localhost with:
./carte.sh localhost 8080 (For master)
& ./carte.sh localhost 8081 (For slave1)
./carte.sh localhost 8082 (For slave2), & so on..
& ./carte.sh localhost 8081 (For slave1)
./carte.sh localhost 8082 (For slave2), & so on..
2. Set up Cluster & Server Information using Spoon (GUI)
Switch to the View tab, next to the Design tab in the left hand panel of the Spoon GUI.
Click on 'Slave Servers' to add new Slave servers (host, port, name, etc.). Make sure to check the 'is_the_master' checkbox for the Master server.
Next click on the 'Kettle Cluster Schemas' and use 'Select Slave servers' to choose the slave servers. For the ability to dynamically add/ remove slave servers, also select the 'Dynamic Cluster' checkbox.
3. Mark Transformation Steps to Execute in Cluster Mode
Right click on the step which needs to be run in the cluster mode, select Clustering & then select the cluster schema. You will now see a symbol next to the step (CxN) indicating that the step is to be executed in a clustered mode.
The cluster settings will be similar to what you see in the left panel in the image. You can also see a transformation, with two steps (Random & Replace in String) being run in a clustered mode in the right panel in the image below.
Monday, November 11, 2013
Shanon Entropy and Information Gain
Shanon's Information Gain/ Entropy theory gets applied a lot in areas such as data encoding, compression and networking. Entropy, as defined by Shanon, is a measure of the unpredictability of a given message. The higher the entropy the more unpredictable the content of the message is to a receiver.
Correspondingly, a high Entropy message is also high on Information Content. On receiving a high Entropy/ high Information Content laden message, the receiver has a high Information Gain.
On the other hand, when the receiver already knows the contents (or of a certain bias) of the message, the Information Content of the message is low. On receiving such a message the receiver has less Information Gain. Effectively once the uncertainty about the content of the message has reduced, the Entropy of the message has also dropped and the Information Gain from receiving such a message has gone down. The reasoning this far is quite intuitive.
The formula for Entropy (H) calculation:
H(X) = -Summation[ p(x) * log( p(x) )] over all possible values/ outcomes of x, i.e. {x1, ..., xn}
where p(x) = probability of each of the values/ outcomes of x {x1, ..., xn}.
The log is in a certain base b.
(All calculations in base 2)
Eg 1.a: In the case of a single fair coin toss:
x = {H, T}
& p(x) = {1/2, 1/2}
H(X) = -[1/2 * log(1/2) + 1/2 * log(1/2) ] = 1
Eg 1.b: For a biased coin toss, with three times higher likelihood of a Head:
x = {H, T}
& p(x) = {3/4, 1/4}
H(X) = -[3/4 * log(3/4) + 1/4 * log (1/4) ] = 0.811 (Entropy is lower than 1.a, Information Gain is lower).
Eg 1.c: For a completely biased coin toss, which two Heads:
x = {H, T}
& p(x) = {1, 0}
H(X) = -1[ 1*log(1) + 0*log(0) ] = 0 (Entropy is zero)
H(X) = -Summation[ p(x) * log( p(x) )] over all possible values/ outcomes of x, i.e. {x1, ..., xn}
where p(x) = probability of each of the values/ outcomes of x {x1, ..., xn}.
The log is in a certain base b.
(All calculations in base 2)
Eg 1.a: In the case of a single fair coin toss:
x = {H, T}
& p(x) = {1/2, 1/2}
H(X) = -[1/2 * log(1/2) + 1/2 * log(1/2) ] = 1
Eg 1.b: For a biased coin toss, with three times higher likelihood of a Head:
x = {H, T}
& p(x) = {3/4, 1/4}
H(X) = -[3/4 * log(3/4) + 1/4 * log (1/4) ] = 0.811 (Entropy is lower than 1.a, Information Gain is lower).
Eg 1.c: For a completely biased coin toss, which two Heads:
x = {H, T}
& p(x) = {1, 0}
H(X) = -1[ 1*log(1) + 0*log(0) ] = 0 (Entropy is zero)
The Entropy (& unpredictability) is the highest for a fair coin (example 1.a) and decreases for a biased coin (examples 1.b & 1.c). Due to the bias the receiver is able to predict the outcome (favouring the known bias) in the later case resulting in a lower Entropy.
The observation from the (2-outcomes) coin toss case generalizes to the N-outcomes case, and the Entropy is found to be highest when all N-outcomes are equally likely (fair).
Saturday, October 26, 2013
Be Hands On
For as long as you can. Think specialists (surgeons, pilots, etc.) who get better clocking in more hours with/into their art - doing, practicing, persevering.
Sunday, October 20, 2013
General Availability (GA) for Hadoop 2.x
The Hadoop 2.x GA is nothing less that a big leap forward. Most of the features released such as YARN - a pluggable resource management framework, Name Node HA, HDFS Federation and so on were long awaited. As per the official mail to the community, this release includes:
"To recap, this release has a number of significant highlights compared to Hadoop 1.x:
• YARN - A general purpose resource management system for Hadoop to allow MapReduce and other other data processing frameworks and services
• High Availability for HDFS
• HDFS Federation
• HDFS Snapshots
• NFSv3 access to data in HDFS
• Support for running Hadoop on Microsoft Windows
• Binary Compatibility for MapReduce applications built on hadoop-1.x
• Substantial amount of integration testing with rest of projects in the ecosystem
Please see the Hadoop 2.2.0 Release Notes for details."
Also as per the official email to the community, users are encouraged to move forward to the 2.x branch which is more stable & backward compatible.
"To recap, this release has a number of significant highlights compared to Hadoop 1.x:
• YARN - A general purpose resource management system for Hadoop to allow MapReduce and other other data processing frameworks and services
• High Availability for HDFS
• HDFS Federation
• HDFS Snapshots
• NFSv3 access to data in HDFS
• Support for running Hadoop on Microsoft Windows
• Binary Compatibility for MapReduce applications built on hadoop-1.x
• Substantial amount of integration testing with rest of projects in the ecosystem
Please see the Hadoop 2.2.0 Release Notes for details."
Also as per the official email to the community, users are encouraged to move forward to the 2.x branch which is more stable & backward compatible.
Tuesday, October 1, 2013
Need Support to Lift with Confidence
Brace up terminologies coming your way...
Support: A measure of the prevalence of an event x in a given set of N data points. Support is effectively a first level indicator of something occurring frequent enough (say greater than 10% of the times) to be of interest.
In the case of two correlated events x & y,
Confidence: A measure of predictability of two events occurring together. Once confidence is above a certain threshold (say 70%), it means the two events show up together often enough to be used for rules/ decision making, etc.
Lift: A measure of the power of association between two events. For an event y that has occurred, how much more likely is event y to occur once it is known that event x has occurred
Support: A measure of the prevalence of an event x in a given set of N data points. Support is effectively a first level indicator of something occurring frequent enough (say greater than 10% of the times) to be of interest.
S(x) = count(x)/ N = P(x), (i.e. probability of x)
where,
count(x) = Total number of times x has occurred
P(x) = Probability of occurrence of x
where,
count(x) = Total number of times x has occurred
P(x) = Probability of occurrence of x
In the case of two correlated events x & y,
S(xy) = count(xy)/N = P(xy) = P(x <INTERSECTION> y)
Confidence: A measure of predictability of two events occurring together. Once confidence is above a certain threshold (say 70%), it means the two events show up together often enough to be used for rules/ decision making, etc.
C(y,x) = Support(xy)/ Support(x)
= S(xy)/ S(x)
= P(x <INTERSECTION> y)/ P(x) = P(y | x), (i.e. conditional probability of y given x)
= S(xy)/ S(x)
= P(x <INTERSECTION> y)/ P(x) = P(y | x), (i.e. conditional probability of y given x)
Lift: A measure of the power of association between two events. For an event y that has occurred, how much more likely is event y to occur once it is known that event x has occurred
L(y, x) = P(y|x)/ P(y) = C(y,x)/ S(y)
Sunday, September 22, 2013
False Negative, False Positive and the Paradox
First a bit about the terms False Positive & False Negative. There terms are associated with the nature of error in the results churned out by a system trying to answer an unknown problem, based on a (limited) set of given/ input data points. After analysing the data, the system is expected to come up with a Yes (it is Positive) or a No (it is Negative) type answer. There is invariably some error in the answer due to noisy data, wrong assumptions, calculation mistakes, unanticipated cases, mechanical errors, surges, etc.
A False Positive is when the system says the answer is Positive, but the answer is actually wrong. An example would be a sensitive car's burglar alarm system that starts to beep due to heavy lightning & thunder on a rainy day. The alarm at this stage is indicating a positive hit (i.e. a burglary), which is not really happening.
On the other hand, a False Negative is when the system answers in a Negative, where the answer should have been a Positive. False negatives happen often with first level medical tests and scans which are unable to detect the cause of pain or discomfort. The test report of "Nothing Abnormal Detected" at this stage is often a False Negative, as revealed by more detailed tests performed later.
The False Positive Paradox is an interesting phenomenon where the likelihood of a False Positive shoots up significantly (& sometimes beyond the actual positive) when the actual rate of occurrence of a condition within a given sample group is very low. The results are thanks to basic likelihood calculations as shown below.
Let's say in a group of size 1,000,000 (1 Mn.), 10% are doctors. Let's say there's a system wherein you feed in a person's Unique ID (UID) and it tells you if the person is a doctor or not. The system has a 0.01% chance of incorrectly reporting a person who is not a doctor to be a doctor (a False Positive).
Now, let's work out our confidence levels of the results given out by the system.
Actual No. of Doctors (AD1) = 10% * 1 Mn = 100,000 - (i)
False Positive (FP1)= 0.01% * (Total Population that is not a doctor) = 0.01% * 900,000 = 90 - (ii)
Confidence levels = AP1/ (AP1 + FP1) = 100,000 / (100,000 + 90) ~ 99%+
False Positive (FP1)= 0.01% * (Total Population that is not a doctor) = 0.01% * 900,000 = 90 - (ii)
Confidence levels = AP1/ (AP1 + FP1) = 100,000 / (100,000 + 90) ~ 99%+
On the other hand if just 0.01% of people in the group are actually doctors (while the rest of the info. remains same) the confidence level works out to be quite different.
Actual No. of Doctors (AD2) = 0.1% * 1Mn = 1,000 - (iii)
False Positive (FP2) = 0.01% * (1000,000 - 1,000) = 0.01% * 999,000 = 999 - (iv)
Confidence levels = AD2/ (AD2 + FP2) = 1000/ (1000 + 999) ~ 50%
False Positive (FP2) = 0.01% * (1000,000 - 1,000) = 0.01% * 999,000 = 999 - (iv)
Confidence levels = AD2/ (AD2 + FP2) = 1000/ (1000 + 999) ~ 50%
This clearly shows that the likelihood of the answer being a False Positive has shot up from much under 1% to as much as 50%, when the occurrence of a condition (number of doctors) within a given population dropped from 10% (i.e. 100,000) to a low value of 0.1% (i.e. 1,000).
Thursday, September 12, 2013
Transparently
While doing software development you might hear of change being introduced "transparently". What does this mean?
Transparency in this context is similar to how a looking glass is transparent. One can barely make out that it exists. Think of a biker who pulls down the glass visor of his helmet when troubled by wind blowing into his eyes. His sight of the road & beyond continue to function without his noticing the transparent visor layer in-between.
Similarly, when a change in introduced transparently on the server side, it means the dependent/ client side applications needn't be told/ made aware of this change on the server side. The old interfaces continue to work as is, communication protocols remain the same, and so on.
The above kind of transparency is different from the transparency of a "transparent person" or a "transparent deal" or a "white box system", where the internals (like thoughts, implementation, ideas, details, etc.) are visible.
Tuesday, September 10, 2013
Saturday, August 31, 2013
Internals Of Solr/ Lucene Document Scoring
This post is in continuation a discussion on the solr community about the efficiency of Solr/ Lucene scoring algorithm.
The search algorithm given here can be summarized to:
- Query query = Build query using user's search terms.
- Collector collector = Typically the TopScoreDocCollector
- Searcher searcher = new IndexSearcher(indexReader);
- searcher.search(query, collector);
- Weight weight = query.weight(searcher);
- Scorer scorer = weight.scorer(indexReader); // Typically BooleanScorer2
- scorer.score() => ConjunctionScorer (on every sub-scorer) in a leap frog/ skip ahead mechanism.
Algo needs improvement!
The AND query shows a leap frog/ skip ahead ahead pattern implemented in the BooleanScorer2 (ConjunctionScorer) level.
For example with the query, q=A AND B, where A & B match doc. id's
A -> 1,3,5,7,11,15,17
B -> 2, 6
- Scorer starts with the min. of each, i.e. A -> 1 & B -> 2, & current highest doc id set to 2
- In the next few iterations:
A is advanced past the current highest value to 3 & current highest updated to 3.
B advanced past current highest 3 to 6 & current highest set to 6.
A advanced past 6 to 7 & current highest set to 7.
B has no more docs & this breaks out, without any match.
The search algorithm given here can be summarized to:
- Query query = Build query using user's search terms.
- Collector collector = Typically the TopScoreDocCollector
- Searcher searcher = new IndexSearcher(indexReader);
- searcher.search(query, collector);
- Weight weight = query.weight(searcher);
- Scorer scorer = weight.scorer(indexReader); // Typically BooleanScorer2
- scorer.score() => ConjunctionScorer (on every sub-scorer) in a leap frog/ skip ahead mechanism.
Algo needs improvement!
The AND query shows a leap frog/ skip ahead ahead pattern implemented in the BooleanScorer2 (ConjunctionScorer) level.
For example with the query, q=A AND B, where A & B match doc. id's
A -> 1,3,5,7,11,15,17
B -> 2, 6
- Scorer starts with the min. of each, i.e. A -> 1 & B -> 2, & current highest doc id set to 2
- In the next few iterations:
A is advanced past the current highest value to 3 & current highest updated to 3.
B advanced past current highest 3 to 6 & current highest set to 6.
A advanced past 6 to 7 & current highest set to 7.
B has no more docs & this breaks out, without any match.
On the other hand if the two had converged/ agreed on a particular doc id, that doc would be scored & collected (added to a min-heap of scores).
Wednesday, August 28, 2013
Thursday, August 15, 2013
Update Apt Repositories Location for Old Ubuntu Versions
When working with an old versions of Ubuntu (11.04, 10.04, etc.), the biggest handicap is the lack of a functional package manager such as apt or synaptic. The reason why the package managers stop working is that at end of support/ licence for an old version of Ubuntu, the team behind Ubuntu archive the repositories.
At this point as an user you are supposed to Upgrade (the recommended practice) to a more recent version. There are normally enough advance notices and alerts sent out by Ubuntu's Update Manager for the same. If however, you have a compelling reason to stick on to your current version, then here's a way to update your repositories' sources list to be able to install and use old software that is present in the archival repository. This is based on the recommendation made on this forum discussion.
At this point as an user you are supposed to Upgrade (the recommended practice) to a more recent version. There are normally enough advance notices and alerts sent out by Ubuntu's Update Manager for the same. If however, you have a compelling reason to stick on to your current version, then here's a way to update your repositories' sources list to be able to install and use old software that is present in the archival repository. This is based on the recommendation made on this forum discussion.
1. Take a backup of your existing sources.list file
sudo cp /etc/apt/sources.list /etc/apt/sources.list_bk
2. Update all occurrences of ubuntu.com (& it's child domains) to old-releases.ubuntu.com
sudo sed -i -e 's/archive.ubuntu.com\|security.ubuntu.com/old-releases.ubuntu.com/g' /etc/apt/sources.list
Additionally, update any region specific sub-domains. In my case I had to also replace in.archive.ubuntu.com with old-releases.ubuntu.com.
3. Finally, do a sudo apt-get update
sudo cp /etc/apt/sources.list /etc/apt/sources.list_bk
2. Update all occurrences of ubuntu.com (& it's child domains) to old-releases.ubuntu.com
sudo sed -i -e 's/archive.ubuntu.com\|security.ubuntu.com/old-releases.ubuntu.com/g' /etc/apt/sources.list
Additionally, update any region specific sub-domains. In my case I had to also replace in.archive.ubuntu.com with old-releases.ubuntu.com.
3. Finally, do a sudo apt-get update
Sunday, August 11, 2013
Resume Large Downloads in Mozilla Firefox
When downloading a large file via Firefox over a slow internet connection you might get disconnected in between and end up with a partially downloaded file (with a .part file extension). Here's a little trick to Resume the download after reconnecting, on wards, from where the download had stopped previously.
1. Open up the Firefox Downloads window (Tools > Downloads OR use shortcut Ctrl+ Shift + Y). Not sure if this works with the recent versions of Firefox.
2. Click on the Resume button next to the file you were downloading/ got downloaded partially. If this works then great, nothing else to do.
3. On the other hand, if step 2 didn't work, then click on the Retry button. This will result in the download to start off all over again from the very beginning. Let it start and go over to step 4.
4. Once a few bytes of the file has been downloaded & the progress meter on the Download window indicates that the new download has started (might also give an estimate of time left), click on the Pause button next to the download.
5. Now go to your Downloads folder (where Firefox was downloading the file). Rename the first file that was partially downloaded (having the file name extension as .part) to the new file that just started downloading in step 4.
6. Go back to the Firefox Downloads window and click the Resume/ Start/ Restart button next to the download process that was Paused in step 4.
That's it. The download should resume from the point where the initial partially downloaded (.part) file had stopped.
1. Open up the Firefox Downloads window (Tools > Downloads OR use shortcut Ctrl+ Shift + Y). Not sure if this works with the recent versions of Firefox.
2. Click on the Resume button next to the file you were downloading/ got downloaded partially. If this works then great, nothing else to do.
3. On the other hand, if step 2 didn't work, then click on the Retry button. This will result in the download to start off all over again from the very beginning. Let it start and go over to step 4.
4. Once a few bytes of the file has been downloaded & the progress meter on the Download window indicates that the new download has started (might also give an estimate of time left), click on the Pause button next to the download.
5. Now go to your Downloads folder (where Firefox was downloading the file). Rename the first file that was partially downloaded (having the file name extension as .part) to the new file that just started downloading in step 4.
6. Go back to the Firefox Downloads window and click the Resume/ Start/ Restart button next to the download process that was Paused in step 4.
That's it. The download should resume from the point where the initial partially downloaded (.part) file had stopped.
Thursday, August 1, 2013
Trees and Graphs
Useful things to know about the trees and graphs based data structures:
These:
The big-O deal:
With Java, well tested implementations are mostly available:
These:
- Binary Trees Vs. Binary Search Trees
- 2,4 and Red Black
- AVL
- Tries
- Heaps
- B & B+ Trees
- BFS, DFS
- Sorting - Quick, Merge, Radix, Timsort
- Kruskal's & Prim's algorithm for Minimum Spanning Trees
- Morris Traversal, without extra space or recursion, using Threaded Binary Trees
- Djikstra's algorithm for shortest path
- Topological sorting
The big-O deal:
With Java, well tested implementations are mostly available:
- TreeSet
- TreeMap
- LinkedHashMap
- ConcurrentSkipListMap
- PriorityQueue
Saturday, July 20, 2013
Thursday, July 11, 2013
Solr Analyzers Basics
Solr offers several Analyzers to pre-process document fields being indexed and searched. As part of modelling the schema one needs to make an informed choice for the specific chain of Analyzers to be applied to every field (fieldType) defined in the schema.xml.
To start off one needs to understand that different kinds of Analyzers and their purpose:
- Char Filters (or CharacterFilterFactories)
- Always applied first, i.e. before Tokenizers
- Operates at the Character level (of the field values)
- Zero or More Char Filters can be chained together. Get applied as per the sequence in schema.xml
- Tokenizers (or TokenizerFactories)
- Converts stream of Characters into a series of Tokens
- Only One Tokenizer can be there in each Analyzer chain
- Token Filters (or TokenFilterFactories)
- Always applied last, i.e. after Tokenizers
- Operates at the Tokens level generated by the Tokenizers
- Zero or More Token Filters can be chained together. Get applied as per the sequence in schema.xml
To take an example, let's say we have a field title with the value (V1) "Mr. James <b>Bond</b> MI007". Now we run it through the following:
1. Character FilterFactory (One): HTMLStripCharFilterFactory (CF1)
(Output: "Mr. James Bond MI007")
2. Tokenizer (One): StandardTokenizerFactory (T)
(Output: Tokens: [ALPHANUM: "Mr.", ALPHANUM:"James", ALPHANUM:"Bond", ALNUM:"MI007"])
3. TokenFilters (Two): WordDelimiterFilterFactory (TF1) & LowerCaseFilterFactory (TF2)
- Mr. => WordDelim => Lowercase => mr.
- James => WordDelim => Lowercase => james
- Bond => WordDelim => Lowercase => bond
- MI007 => WordDelim => [MI, 007] => Lowercase => mi, 007
Finally the output text actually indexed: "mr. james bond mi 007"
There are several other options and many more Analyzers that one could. Among them the different PatternReplace Analyzers, EdgeNGram and the simple WhiteSpaceFilterFactory are the more popular ones. Finally, if none of the standard ones are adequate for a specific use case then there is also the option of writing a custom analyzer.
Monday, June 10, 2013
Solution for making Long GET Request to Solr via SolrNet
Solr has REST api's available for performing various searches on indexed documents. The client generally issues GET requests to Solr with different parameters (fields, row, facet, etc.) set. Since there typically are size/ query length limitations on GET requests (imposed by container, OS, etc.), Solr allows the same queries to be issued to the Solr RequestHandlers as POST request as well.
We ran into one such issue with long GET request to Solr from SolrNet and did a few changes to solve the same.
Solr Side Changes:
First up, we increased the headerBufferSize of the application server as explained on SO here and increased the maxBooleanClauses parameter in solrconfig.xml. This allowed Solr side to start responding to much longer GET requests. The problem however wasn't solved. The client side was a dot net application running within IIS having additional length limitations imposed by Windows OS & the dot net framework.
SolrNet Side Changes:
In round two, we went for a better fix and switched over to a POST requests in place of long GET requests. The solution is largely the same as mentioned on the SolrNet group here & here. The difference being to switch over to a POST request from within the Get() method of the SolrConnection.cs class, when the request string is longer than a configurable threshold value.
We ran into one such issue with long GET request to Solr from SolrNet and did a few changes to solve the same.
Solr Side Changes:
First up, we increased the headerBufferSize of the application server as explained on SO here and increased the maxBooleanClauses parameter in solrconfig.xml. This allowed Solr side to start responding to much longer GET requests. The problem however wasn't solved. The client side was a dot net application running within IIS having additional length limitations imposed by Windows OS & the dot net framework.
SolrNet Side Changes:
In round two, we went for a better fix and switched over to a POST requests in place of long GET requests. The solution is largely the same as mentioned on the SolrNet group here & here. The difference being to switch over to a POST request from within the Get() method of the SolrConnection.cs class, when the request string is longer than a configurable threshold value.
public string Get(string relativeUrl, IEnumerable> parameters)
{
string st1 = GetQuery(parameters);
if (isQueryLong(st1))
{
var bytes = Encoding.UTF8.GetBytes(st1);
try {
using (var content = new MemoryStream(bytes))
return PostStream(relativeUrl, "application/x-www-form-urlencoded", content, null);
}
catch (WebException e)
{}
}
else{
// Do it the normal way via GET request
}
}
Update: PostSolrConnection.cs class has made it to the head branch of SolrNet. try {
Tuesday, May 28, 2013
Redmine Project Management Tool
In trying to find an Open source Agile project management tool, somewhat of an alternative to Rally, chanced upon Redmine. The initial feel of the tool has been good so far.
Needed somewhat of an integrated tool that would allow various teams to collaborate. Redmine does well on this count as it has a task tracker, bug tracker, knowledge repository (file/ document management and wiki), all rolled in to one.
Additionally, we have been able to migrate our bugs and user accounts from Bugzilla, to get off the ground quick. Now it is about letting rubber hit the road, and having the teams to start working with Redmine.
Needed somewhat of an integrated tool that would allow various teams to collaborate. Redmine does well on this count as it has a task tracker, bug tracker, knowledge repository (file/ document management and wiki), all rolled in to one.
Additionally, we have been able to migrate our bugs and user accounts from Bugzilla, to get off the ground quick. Now it is about letting rubber hit the road, and having the teams to start working with Redmine.
Monday, May 27, 2013
SolrNet Separate Highlighting Query - hl.q
Solr allows highlighting of matched sections in field values. There are several parameters that can be set by the caller to adjust the highlighting behaviour.
SolrNet, a library to connect to Solr from dot net applications, also has HighlightingParameters exposed in SolrNet core library. However, not all/ a very small subset of parameters are currently exposed.
Recently needed to use the hl.q query, to issues a separate/ more specific highlighting query to Solr. The work around was to make use of the ExtraParams option, from the base CommonQueryOptions class.
The same approach could be used for any of the other parameters not exposed by SolrNet, such ais hl.BoundaryScanner, per field highlighting, maxScan, etc., essentially all the 3.5x onward features mentioned on the Solr Highlighting wiki.
SolrNet, a library to connect to Solr from dot net applications, also has HighlightingParameters exposed in SolrNet core library. However, not all/ a very small subset of parameters are currently exposed.
Recently needed to use the hl.q query, to issues a separate/ more specific highlighting query to Solr. The work around was to make use of the ExtraParams option, from the base CommonQueryOptions class.
The same approach could be used for any of the other parameters not exposed by SolrNet, such ais hl.BoundaryScanner, per field highlighting, maxScan, etc., essentially all the 3.5x onward features mentioned on the Solr Highlighting wiki.
Dictionary<string, string> extraParams = new Dictionary<string, string>();
extraParams.Add("hl.q", "content:Great");
HighlightingParameters hp= new HighlightingParameters();
hp.Fields =new List<string>() { "content");
var results = solr2.Query("type:book AND content:Great", new QueryOptions
{
Start = 0,
Rows = 10,
Highlight = hp,
ExtraParams=extraParams
});
extraParams.Add("hl.q", "content:Great");
HighlightingParameters hp= new HighlightingParameters();
hp.Fields =new List<string>() { "content");
var results = solr2.Query("type:book AND content:Great", new QueryOptions
{
Start = 0,
Rows = 10,
Highlight = hp,
ExtraParams=extraParams
});
Friday, May 3, 2013
Php Script To Display Process, Vmstat, Disk Usage, Syslog Of A Linux Server Via A Browser
A Php script that executes some standard shell programs for monitoring resource utilization & processes on a given Linux box. The script directs the output to a web-browser.
Apache web-server should be installed on the server. To run copy the script to the DocumentRoot (/var/www/html). Appropriate execute rights (-rw-x) need to be given to the apache user (which runs this script, but is not the owner) to execute this Php file & to be able to access /var/log/syslog.
Save this file as: showHealth.php in the /var/www/html folder:
Apache web-server should be installed on the server. To run copy the script to the DocumentRoot (/var/www/html). Appropriate execute rights (-rw-x) need to be given to the apache user (which runs this script, but is not the owner) to execute this Php file & to be able to access /var/log/syslog.
Save this file as: showHealth.php in the /var/www/html folder:
<html>
<body>
<script type="text/javascript">
<!--
function toggle(id){
DIV1=document.getElementById(id);
if (DIV1.style.display=="none") DIV1.style.display="block";
else DIV1.style.display="none";
}
//-->
</script>
<div id="outermost">
<div id="mcdetails">
Showing stats from:
<!-- ifconfig eth0 -->
<?php
echo shell_exec("ip addr show | grep 'inet' | tail -1");
?>
<br/><br/>Server Date is:
<?php
echo shell_exec("date");
?>
<br/><br/>
</div>
<a href="#" onclick="toggle('vmstat');">-------Vmstat----------</a><br/><br/>
<div id="vmstat" style="display:none">
<?php
$output = shell_exec("vmstat| sed 's/$/<br\/>/g'| sed 's/ /./g'");
echo $output;
?>
</div>
<br/><a href="#" onclick="toggle('df');">-------df -m----------</a><br/><br/>
<div id="df" style="display:none">
<?php
$output = shell_exec("df -m | sed 's/$/<br\/>/g'| sed 's/ /./g'");
echo $output;
?>
</div>
<br/><a href="#" onclick="toggle('ps');">-------ps aux----------</a><br/><br/>
<div id="ps" style="display:none">
<?php
$output = shell_exec("ps aux | sed 's/$/<br\/><br\/>/g'| sed 's/ /./g'");
//$output = shell_exec('ls -l');
echo $output;
?>
</div>
<br/><a href="#" onclick="toggle('syslogdiv');">-------syslog----------</a><br/><br/>
<!-- sudo chmod 615 /var/log/syslog -->
<div id="syslogdiv" style="display:none">
<?php
$output = shell_exec("tail -200 /var/log/syslog | sed 's/$/<br\/>/g' ");
//$output = shell_exec('ls -l');
echo $output;
?>
</div>
</div>
</body>
</html>
<body>
<script type="text/javascript">
<!--
function toggle(id){
DIV1=document.getElementById(id);
if (DIV1.style.display=="none") DIV1.style.display="block";
else DIV1.style.display="none";
}
//-->
</script>
<div id="outermost">
<div id="mcdetails">
Showing stats from:
<!-- ifconfig eth0 -->
<?php
echo shell_exec("ip addr show | grep 'inet' | tail -1");
?>
<br/><br/>Server Date is:
<?php
echo shell_exec("date");
?>
<br/><br/>
</div>
<a href="#" onclick="toggle('vmstat');">-------Vmstat----------</a><br/><br/>
<div id="vmstat" style="display:none">
<?php
$output = shell_exec("vmstat| sed 's/$/<br\/>/g'| sed 's/ /./g'");
echo $output;
?>
</div>
<br/><a href="#" onclick="toggle('df');">-------df -m----------</a><br/><br/>
<div id="df" style="display:none">
<?php
$output = shell_exec("df -m | sed 's/$/<br\/>/g'| sed 's/ /./g'");
echo $output;
?>
</div>
<br/><a href="#" onclick="toggle('ps');">-------ps aux----------</a><br/><br/>
<div id="ps" style="display:none">
<?php
$output = shell_exec("ps aux | sed 's/$/<br\/><br\/>/g'| sed 's/ /./g'");
//$output = shell_exec('ls -l');
echo $output;
?>
</div>
<br/><a href="#" onclick="toggle('syslogdiv');">-------syslog----------</a><br/><br/>
<!-- sudo chmod 615 /var/log/syslog -->
<div id="syslogdiv" style="display:none">
<?php
$output = shell_exec("tail -200 /var/log/syslog | sed 's/$/<br\/>/g' ");
//$output = shell_exec('ls -l');
echo $output;
?>
</div>
</div>
</body>
</html>
Labels:
Apache,
Health Check,
Linux,
mod_php,
Monitoring,
Php,
Scripting,
Shell,
Unix
Saturday, April 20, 2013
Linux/Unix Shell Function For Date Addition and Subtraction
Here is a small shell script to do date addition and subtraction. This works on the bash shell with GNU Date.
#!/bin/bash
function getPreviousDateTime(){
d1=`date -d "$1" +'%Y-%m-%d %H:%M:%S'`;
secs1=`date -d"$d1" +%s`;
secs2=`expr $secs1 - $2`;
# change to %H:%M:%S if you want the nos. left padded
result=`date -d@"$secs2" +'%Y-%m-%d %-H:%-M:%-S'`;
# return formatted string
echo $result;
}
function getNextDateTime(){
d1=`date -d "$1" +'%Y-%m-%d %H:%M:%S'`;
secs1=`date -d"$d1" +%s`;
secs2=`expr $secs1 + $2`;
result=`date -d@"$secs2" +'%Y-%m-%d %H:%M:%S'`;
# return formatted string
echo $result;
}
# Tests
echo `getPreviousDateTime "2012-04-29 18:00:31" 30`
echo `getPreviousDateTime "2012-04-29 00:00:00" 60`
echo `getPreviousDateTime "2012-04-29 02:00:00" 60`
echo `getNextDateTime "2012-04-29 19:00:00" 60`;
echo `getNextDateTime "2012-04-30 23:59:00" 60`;
echo `getNextDateTime "2012-02-28 23:59:00" 60`;
echo `getNextDateTime "2012-02-28 2:59:00" 60`;
#!/bin/bash
function getPreviousDateTime(){
d1=`date -d "$1" +'%Y-%m-%d %H:%M:%S'`;
secs1=`date -d"$d1" +%s`;
secs2=`expr $secs1 - $2`;
# change to %H:%M:%S if you want the nos. left padded
result=`date -d@"$secs2" +'%Y-%m-%d %-H:%-M:%-S'`;
# return formatted string
echo $result;
}
function getNextDateTime(){
d1=`date -d "$1" +'%Y-%m-%d %H:%M:%S'`;
secs1=`date -d"$d1" +%s`;
secs2=`expr $secs1 + $2`;
result=`date -d@"$secs2" +'%Y-%m-%d %H:%M:%S'`;
# return formatted string
echo $result;
}
# Tests
echo `getPreviousDateTime "2012-04-29 18:00:31" 30`
echo `getPreviousDateTime "2012-04-29 00:00:00" 60`
echo `getPreviousDateTime "2012-04-29 02:00:00" 60`
echo `getNextDateTime "2012-04-29 19:00:00" 60`;
echo `getNextDateTime "2012-04-30 23:59:00" 60`;
echo `getNextDateTime "2012-02-28 23:59:00" 60`;
echo `getNextDateTime "2012-02-28 2:59:00" 60`;
Wednesday, April 17, 2013
Upload to Amazon S3 Bucket via Signed Url with Server Side Encryption
Continuing further from my previous post on upload & download from Amazon S3 bucket via signed url's, here is how to enable Server Side Encryption (SES) with the file being uploaded to S3.
Add a x-amz-server-side-encryption request parameter with the GeneratePresignedUrlRequest before getting the signed url:
Add a x-amz-server-side-encryption request parameter with the GeneratePresignedUrlRequest before getting the signed url:
generatePresignedUrlRequest.addRequestParameter("x-amz-server-side-encryption","AES256");
Monday, April 1, 2013
Upload and Download from Amazon AWS S3 Bucket via Signed Url
While the code snippets are using the Java AWS SDKs, principally these will work with the other SDKs as well.
1. Get hold of FederatedCredentials using your AWS credentials:
Pass in proper access Policy settings for the FederatedCredentials on the S3 Bucket and/ or Item.
E.g.
For Download you could additionally set up ResponseHeaderOverrides for withContentDisposition, ContentType, etc.
2. Get BasicSessionCredentials using the Federated Credentials
3. Generate GeneratePresignedUrlRequest
4. Finally, generate a pre-signed url via the S3Client object:
5. To test this:
- Download:
Get the url.toString() & hit it from a browser
- Upload:
1. Get hold of FederatedCredentials using your AWS credentials:
Pass in proper access Policy settings for the FederatedCredentials on the S3 Bucket and/ or Item.
E.g.
Policy policy = new Policy();
policy.withStatements(new Statement(Effect.Allow)
.withActions(S3Actions.GetObject).withResources(new Resource("arn:aws:s3:::<bucketName>/"+<parentLocation>+"*")));
policy.withStatements(new Statement(Effect.Allow)
.withActions(S3Actions.GetObject).withResources(new Resource("arn:aws:s3:::<bucketName>/"+<parentLocation>+"*")));
For Download you could additionally set up ResponseHeaderOverrides for withContentDisposition, ContentType, etc.
2. Get BasicSessionCredentials using the Federated Credentials
3. Generate GeneratePresignedUrlRequest
4. Finally, generate a pre-signed url via the S3Client object:
AmazonS3 s3 = new AmazonS3Client(basicSessionCredentials);
URL url = s3.generatePresignedUrl(urlRequest);
URL url = s3.generatePresignedUrl(urlRequest);
5. To test this:
- Download:
Get the url.toString() & hit it from a browser
- Upload:
curl -X PUT -H "content-type:application/octet-stream" -T <someFile.txt> '<s3SignedUrl>'
Wednesday, March 20, 2013
Uploading Large Files In Chunks To Amazon S3
A collection of best practices based on my experience building a scaled out solution for the server side file upload handler.
1. Authentication/ Authorization
2. Chunking
3. Stateless upload & Session
4. Shared memory for post file operations
5. Retries & Failover
6. Bulk operations
To be completed..
1. Authentication/ Authorization
2. Chunking
3. Stateless upload & Session
4. Shared memory for post file operations
5. Retries & Failover
6. Bulk operations
To be completed..
Wednesday, March 13, 2013
Autovue Jump To Page
Autovue is a browser based document viewing & markup application.
To open up a specific page of a document in the viewer simply set up an ONINIT javascript call back method via the applet param.
You can get more info on this from the Advanced Scripting Functionality section of the InstallConfigGuideCS of Oracle Autovue.
To open up a specific page of a document in the viewer simply set up an ONINIT javascript call back method via the applet param.
<PARAM NAME="ONINIT" VALUE="loadPage();">
// Set the file to open & page no params
function loadPage(){
var myApp = window.document.applets["JVue"];
myApp.setFile("http://yourwebiste.com/docs/abc.jpg");
myApp.setPage(pageNoToOpen);
}
// Set the file to open & page no params
function loadPage(){
var myApp = window.document.applets["JVue"];
myApp.setFile("http://yourwebiste.com/docs/abc.jpg");
myApp.setPage(pageNoToOpen);
}
You can get more info on this from the Advanced Scripting Functionality section of the InstallConfigGuideCS of Oracle Autovue.
Friday, March 1, 2013
Atomic Updates via SolrNet
As of today the SolrNet api doesn't offer atomic updates to be issued to a running Solr server. While the Solrnet api is supposed to offer this feature sometime in the future, the following alternative can be used in the interim.
1. Build a custom atomic update XML message:
(See: http://wiki.apache.org/solr/UpdateXmlMessages for more details)
2. Get hold of the connection object (via ServiceLocator):
3. Issue a call to Solr via the connection object:
Will be adding sample code snippets soon..
1. Build a custom atomic update XML message:
string updateXml = "<add> <doc> <field name='employeeId'>05991</field> <field name='office' update='set'>Walla Walla</field> </doc></add>";
(See: http://wiki.apache.org/solr/UpdateXmlMessages for more details)
2. Get hold of the connection object (via ServiceLocator):
string connectionStr = string.Format("{0}.{1}.{2}", typeof(SolrConnection), typeof(T), typeof(SolrConnection));
SolrConnection connection = ServiceLocator.Current.GetInstance<SolrConnection>(connectionStr);
SolrConnection connection = ServiceLocator.Current.GetInstance<SolrConnection>(connectionStr);
3. Issue a call to Solr via the connection object:
connection.Post("/update", updateXml);
Will be adding sample code snippets soon..
Friday, February 15, 2013
Solr Cell, Tika And Pages
With Solr Cell, aka Tika, you get the power to index content from within a wide set of digital files such as Pdfs, Office, Text, etc.
Tika however doesn't naturally offer any demarcations for page boundaries. So you can search for content matches from a file, but not for specific pages from within these files.
Among several different ways to solve this problem, one way could be to index each page of the file as a separate document in Solr and do a field collapsing/ result grouping on the search results by a common file identifier shared by all pages of the file.
Since there could be performance overheads with result grouping, another way is to index the combined file as one solr document (of type Combined) & each page as a separate solr document (of type Page) with a common file identifier. The search can then be performed initially against the combined document (type:combined AND text:abc) to identify files that match & then against the corresponding page type document (type:page AND file-id:123 AND text:abc) to identify pages.
Tika however doesn't naturally offer any demarcations for page boundaries. So you can search for content matches from a file, but not for specific pages from within these files.
Among several different ways to solve this problem, one way could be to index each page of the file as a separate document in Solr and do a field collapsing/ result grouping on the search results by a common file identifier shared by all pages of the file.
Since there could be performance overheads with result grouping, another way is to index the combined file as one solr document (of type Combined) & each page as a separate solr document (of type Page) with a common file identifier. The search can then be performed initially against the combined document (type:combined AND text:abc) to identify files that match & then against the corresponding page type document (type:page AND file-id:123 AND text:abc) to identify pages.
Wednesday, February 6, 2013
Mocking AWS ELB Behaviour Locally For Testing
Once hosted out of Amazon, you make use of the AWS Elastic Load Balancer (ELB) for balancing load across your EC2's within or acroos Availability Zones (AZ). Since code gets developed and tested locally (outside of Amazon), at times you might want to test load balancer scenarios before deploying to production. Here's one way to mock up the load balancer behaviour for local testing.
Use Apache (you could very well use something like Nginx instead) in a reverse proxy, load balancer set up via mod_proxy & mod_proxy_balancer. Fairly simple for anyone with slight experience with configuring Apache. We used Apache as a load balancer front-end to IIS on local, exactly the way ELB would load balance in front of production IIS.
Additionally, since ELB was also an SSL end point for our production servers, we set up Apache to be the SSL end point (via mod_ssl) on local. Apache was configured to listen on port 443 (using a self-signed certificate), and would forward all traffic from port 443 to backend IIS on port 80.
Once we had that set-up going, we were quickly able to reproduce an issue with application generated Secure cookies not getting set properly across client request/ response. Once we had the fix on the local (which was to set the flag on the cookies in the request, not response) the same worked flawlessly on the AWS as well.
Use Apache (you could very well use something like Nginx instead) in a reverse proxy, load balancer set up via mod_proxy & mod_proxy_balancer. Fairly simple for anyone with slight experience with configuring Apache. We used Apache as a load balancer front-end to IIS on local, exactly the way ELB would load balance in front of production IIS.
Additionally, since ELB was also an SSL end point for our production servers, we set up Apache to be the SSL end point (via mod_ssl) on local. Apache was configured to listen on port 443 (using a self-signed certificate), and would forward all traffic from port 443 to backend IIS on port 80.
Once we had that set-up going, we were quickly able to reproduce an issue with application generated Secure cookies not getting set properly across client request/ response. Once we had the fix on the local (which was to set the flag on the cookies in the request, not response) the same worked flawlessly on the AWS as well.
Wednesday, January 23, 2013
Headless Java Monster
You know you are up against the same fellow if you start seeing the
On a Ubuntu on the other hand, you could install the Xvfb package (via apt, synaptic, etc.)
2. Start X:
3. Export display:
With those done, now you should have entered the simpler "No X11 DISPLAY variable" zone. Simply export the display variable to fix this.
java.awt.HeadlessException, typically running off a virtual server, or in the rare case of a dedicated server without a monitor (aka head).
The solution is simple. First shut down the application, tomcat, etc. that got the exception.
1. Install the X display manager.
On a Redhat this might mean that you have to install a x11 display driver (via yum, etc.). Search for something like
xorg-x11-drv*
On a Ubuntu on the other hand, you could install the Xvfb package (via apt, synaptic, etc.)
2. Start X:
Redhat: startx
Ubuntu: /usr/bin/Xvfb :1 -screen 0 1024x768x24 -ac +extension GLX +render -noreset
Ubuntu: /usr/bin/Xvfb :1 -screen 0 1024x768x24 -ac +extension GLX +render -noreset
3. Export display:
With those done, now you should have entered the simpler "No X11 DISPLAY variable" zone. Simply export the display variable to fix this.
export DISPLAY=:0.0
(In the Ubuntu case above you have to export DISPLAY=:1)
4. Allow all users to connect/ use this Display variable:
xhost +
Now restart the application, tomcat, etc. that you were trying to run initially & it should work. Hope nothing headless ever troubles no man!
Tuesday, November 20, 2012
C# - A home away from home
For someone with years of experience in Java, the stint to code in C# seemed like a cake walk. The large scale port of popular Java frameworks into dot net such as NHibernate, Spring.net, NUnit, etc., make life all that much easier for anyone starting to bridge the gap.
There are however some bits that trouble us Java natives no end. Particularly anything & everything to with the web.config file. This one file has enough traps in it to confuse the hell out of any sane minded developer. The file has hints for the IDE (Visual studio), the framework (Asp.net), the web server (IIS), & everyone else connected to the runtime.
The other bit that seems bothersome is how dependencies, packages & Dll's get referenced. Particularly with frequently changing Dll's, & the tools optimized for caching, it gets difficult to know what version is really loaded & running.
Anyway, once you get past these issues, the ride ahead is through familiar grounds.
There are however some bits that trouble us Java natives no end. Particularly anything & everything to with the web.config file. This one file has enough traps in it to confuse the hell out of any sane minded developer. The file has hints for the IDE (Visual studio), the framework (Asp.net), the web server (IIS), & everyone else connected to the runtime.
The other bit that seems bothersome is how dependencies, packages & Dll's get referenced. Particularly with frequently changing Dll's, & the tools optimized for caching, it gets difficult to know what version is really loaded & running.
Anyway, once you get past these issues, the ride ahead is through familiar grounds.
Friday, November 2, 2012
Using Pentaho Kettle to Index Data in Solr
Pentaho Kettle is a fine open source ETL tool written in Java. There are several implementations, hooks and plugins available off the shelf for performing various Extract (E), Transform (T), Load (L) processes on data from a source location to a destination location.
Solr, on the other hand, is a rich and powerful production grade search engine written on top of Lucene. So how would it be to get the two to function in tandem? To use Kettle to load data into Solr for indexing purpose.
The data load phase for indexing in Solr is very similar to an ETL process. The data is sourced (Extract) from a relational Database (MySql, Postgre, etc.). This data is denormalized and transformed to a Solr compatible document (Transform). Finally the transformed data is streamed to Solr for indexing (Load). Kettle excels in performing each of these steps!
A Kettle ETL job to load data into Solr for indexing, is a good alternative to using Solr's very own Data Import Handler (DIH). Since DIH typically runs off the same Solr setup (with a few common dependencies) so there's some intermixing of concerns with such a set-up, between what Solr is good at (search & indexing) versus what the DIH is built to do (import documents). The DIH also competes for resources (CPU, IO) with Solr. Ketttle has no such drawbacks and can be run off a different set of physical boxes.
There are additional benefits of using Kettle such as availability of stable implementations for working across data sources, querying, bulk load, setting up of staged workflows with configurable queues & worker threads. Also Kettle's exception handling, retry mechanism, REST/ WS client, JSON serializer, custom Java code extension, and several handy transformation capabilities, all add up in its favour.
On the cons, given that the call to Solr would be via standard REST client from Kettle, the set-up would not be Solr Cloud or Zookeeper (ZK) aware to be able to do any smart routing of documents. One option to solve this could be to use the Custom Java Code step in Kettle and delegate the call to Solr via the SolrJ's CloudSolrServer client (which is Solr Cloud/ ZK aware).
Solr, on the other hand, is a rich and powerful production grade search engine written on top of Lucene. So how would it be to get the two to function in tandem? To use Kettle to load data into Solr for indexing purpose.
The data load phase for indexing in Solr is very similar to an ETL process. The data is sourced (Extract) from a relational Database (MySql, Postgre, etc.). This data is denormalized and transformed to a Solr compatible document (Transform). Finally the transformed data is streamed to Solr for indexing (Load). Kettle excels in performing each of these steps!
A Kettle ETL job to load data into Solr for indexing, is a good alternative to using Solr's very own Data Import Handler (DIH). Since DIH typically runs off the same Solr setup (with a few common dependencies) so there's some intermixing of concerns with such a set-up, between what Solr is good at (search & indexing) versus what the DIH is built to do (import documents). The DIH also competes for resources (CPU, IO) with Solr. Ketttle has no such drawbacks and can be run off a different set of physical boxes.
There are additional benefits of using Kettle such as availability of stable implementations for working across data sources, querying, bulk load, setting up of staged workflows with configurable queues & worker threads. Also Kettle's exception handling, retry mechanism, REST/ WS client, JSON serializer, custom Java code extension, and several handy transformation capabilities, all add up in its favour.
On the cons, given that the call to Solr would be via standard REST client from Kettle, the set-up would not be Solr Cloud or Zookeeper (ZK) aware to be able to do any smart routing of documents. One option to solve this could be to use the Custom Java Code step in Kettle and delegate the call to Solr via the SolrJ's CloudSolrServer client (which is Solr Cloud/ ZK aware).
Thursday, October 25, 2012
Amdahl's Law for Max Utilization and Speedup With Parallelization
There's a lot of talk these days about constraints introduced by the CAP theorem. One other equally relevant law for Parallel and Distributed systems is the Amdahl's law. Amdahl's law talks about the amount of speedup that can be achieved when a given single processor (or threaded) task is split and handed over to N-processors (or threads) to be executed in parallel.
To take an example let's work with the typical entrance examination problem: "if one person takes 2 hours to eat up a cake, how long would four people take to eat the same cake?".
Simple each person eats up one-fourth of the cake. So time taken = time-for-1-person/ N = 2/ 4 = 0.5 hours. Right? Ya, well, unless it's the very same cake that we were referring to, the one that got eaten ;)
What Amdahl's law says is that if the single processor task has sub-tasks or steps (unlike the cake eating example above) not every sub-task/ step can be parallelized. There is some percentage of sub-tasks (F%) that need to be run sequentially. As a result the speedup is not N-times but less computed as follows.
- F = % to be run sequentially
- (1 - F) = % to be parallelized
- N = No of Processors on which tasks can be executed in parallel
- Speedup = 1
-----------------------------------------------------
[ % to be run sequentially + (% that can be parallelized)/ No of processors) ]
= 1
-----------------------------------
[ F + (1-F)/N ]
- (1 - F) = % to be parallelized
- N = No of Processors on which tasks can be executed in parallel
- Speedup = 1
-----------------------------------------------------
[ % to be run sequentially + (% that can be parallelized)/ No of processors) ]
= 1
-----------------------------------
[ F + (1-F)/N ]
- To go back to our cake eating example,
% of sub-tasks that has to run serially (F) = 0, & N = 4.
So, Speedup = 1 = 4
-------------------
[0 + (1-0)/4]
So, Speedup = 1 = 4
-------------------
[0 + (1-0)/4]
i.e. 4 times speedup. Indeed 4 people took 1/4 th the time, i.e. 30 mins, as compared to 2 hours by one person.
- Let's now add some sequential steps before the cake eating. First you got to pay for it at the cashier & then take delivery from the delivery counter.
These have to be done in sequence & only by one person (why pay twice?). Thanks to the monopoly & popularity of the cake vendor, there's invariably a long queue at the cashier. It takes 15 mins to pay and another 15 mins to get the delivery of the cake.
So the steps are:
Single Person (mins) 4 People (mins)
(i) Pay at the cashier (One person) 15 15
(ii) Collect Cake (One Person) 15 15
(iii) Eat Cake (Parallel) 120 30
------- ----
Total 150 60
Speedup calculation is,
% Sequential (F) = 30 / ( 30 + 120 ) * 100 = 1/ 5 = 20%
Speedup = 1/ [1/5 + (4/5)/4 ] = 5/ 2 = 2.5 - (A)
Let's validate this,
Speedup = Time take by Single Person / Time take by 4 pepople = 150/ 60 = 2.5 - (B)
A = B. QED.
Single Person (mins) 4 People (mins)
(i) Pay at the cashier (One person) 15 15
(ii) Collect Cake (One Person) 15 15
(iii) Eat Cake (Parallel) 120 30
------- ----
Total 150 60
Speedup calculation is,
% Sequential (F) = 30 / ( 30 + 120 ) * 100 = 1/ 5 = 20%
Speedup = 1/ [1/5 + (4/5)/4 ] = 5/ 2 = 2.5 - (A)
Let's validate this,
Speedup = Time take by Single Person / Time take by 4 pepople = 150/ 60 = 2.5 - (B)
A = B. QED.
So you see, due to the 20% sequential tasks the speedup has dropped from 4 times to 2.5 times.
Wednesday, October 10, 2012
Brewer's CAP Theorem
Brewer's CAP theorem talks about Consistency (C), Availability (A), Partition (P) tolerance, as the constraints that primarily govern the design of all distributed systems. There's a lot of literature available online explaining the theorem. The summary is that given that network partitions (P) will happen, pick one of the other two - Consistency (C) or Availability (A) for designing your system on a case by case basis (since you can't have all three)!
A partition could be caused by the failure of some kind of component - hardware (routers, gateway, cables, physical boxes/ nodes, disks, etc.) and/or software. When that happens:
- If you pick Consistency (C) => All your systems, processing, etc. is blocked/ held up until the failed component(s) recovers.
This has been the default with traditional RDBMS (thanks to their being ACID compliant). For financial & banking applications this normally has to be the choice.
- On the other hand, if you pick Availability (A) => All systems, other than the currently partitioned/ failed systems, continue to function as is within their own partitions. Seems good? Well not quite, cause this obviously results in inconsistencies across the two (or more) partitioned sections.
Systems thus designed with Availability (A) as their selection (over C), must be able to live with inconsistencies across different partitions. Such systems also have some automated way to later get back to consistent state (eventual consistency) once the partitioned/ failed systems have recovered.
This is mostly the design choice with the NoSqls. Also with services such as Amazon AWS where eventual consistency within some reasonable time window (of a few seconds to a minutes) is acceptable.
Monday, September 17, 2012
Workaround for Copy Command from WebHDFS
At the moment the WebHDFS api doesn't offer the Copy command. As a result, the client ends up having to download the file to the local disk and re-upload the files via the Create command. Since this ends up being a lot of round trips all the way to the client (typically a non Java based client) the following workaround can be set up to partly alleviates the problem.
Set up a HDFS Webdav server on one of the DN or NN boxes. Issue the Copy command to the Webdav server via a REST call. Free up the client application, while letting the Webdav server with much better connectivity & proximity to the HDFS complete the Copy command request.
Set up a HDFS Webdav server on one of the DN or NN boxes. Issue the Copy command to the Webdav server via a REST call. Free up the client application, while letting the Webdav server with much better connectivity & proximity to the HDFS complete the Copy command request.
Wednesday, September 12, 2012
Remotely Debug Solr Cloud in Eclipse Using JPDA, JDWP, JVMTI & JDI
The acronym's first:
JPDA - Java Platform Debug Architecture
JDWP - Java Debug Wire Protocol
JVMTI - JVM Tool Interface
JDI - Java Debug Interface
To debug any of the open source Java projects such as Solr using Eclipse, rely on the JDWP feature available within any standard JVM. You can get a lot more info about the terms and architecture here.
At a high level the concept is that there is a JVM to be debugged (Solr) & a client side JVM debuggee (Eclipse). The two communicate over the JDWP. Thanks to a standardized wire protocol the client may even be a non JVM application which subscribes to the protocol.
One of the two JVMs acts as the debugging server (the one that waits for the client to connect). The other JVM acts as the debugging client which connects to the debugger server, to start the debugging process.
In our case, to keep things simple let Solr be the debugger server, while Eclipse can be the debugger client. The configurations then are as follows.
On Solr side (assuming Solr Cloud):
java -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=8000 -Djetty.port=7200 -Dhost=myhost -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf -Djava.util.logging.config.file=etc/logging.properties -DnumShards=3 -DzkHost=zk1:2171 -jar start.jar
Note: Since we have set suspend = y, Solr side will stay suspended until the Eclipse debugger client has connected
On Eclipse side:
Go to Run > Debug Configurations > Remote Java Application
Then choose Standard Socket Attach. Host: localhost (or IP). Port: 8000 (the same as set above)
Also in Eclipse you should have checked out the Solr source code from Solr trunk as a project. This will allow you to put break points at appropriate location to help with the debugging. So go on give this a shot and happy debugging!
Friday, August 10, 2012
REST based integration with HDFS via WebHDFS
WebHDFS offers a set of perfectly good REST api's for any application to integrate with the HDFS. This can be particularly advantageous for applications written in languages other than Java such as Rails, Dot Net and so on.
Within our lan with commodity desktop class boxes with 2.5 Ghz processors, 8 G Ram set up, and a replication factor of 2, we found about read/ write speeds of about 27 Mbps via WebHDFS. This was only a shade slower than the 30 Mbps that we were getting via raw file transfers between the same Data Nodes (DN).
Another observation was that our best transfer rates were achieved by setting the buffer size to 22K. We played around with several other buffer size values, but found 22K to be the magic number. Hoping to find some logical explanation for this observation.
Within our lan with commodity desktop class boxes with 2.5 Ghz processors, 8 G Ram set up, and a replication factor of 2, we found about read/ write speeds of about 27 Mbps via WebHDFS. This was only a shade slower than the 30 Mbps that we were getting via raw file transfers between the same Data Nodes (DN).
Another observation was that our best transfer rates were achieved by setting the buffer size to 22K. We played around with several other buffer size values, but found 22K to be the magic number. Hoping to find some logical explanation for this observation.
Friday, July 20, 2012
Managing Hierarchies and Hierarchical Data in Databases and Solr
Databases - Materialzed Paths, Nested Sets, Nested Intervals, Closure Sets
Solr - Data flattening, Hierarchical Faceting
To be updated..
Friday, July 6, 2012
Sunday, June 10, 2012
Triangular Matrix
There are two special kinds of square matrices (n X n matrix) known as the Upper Triangular Matrix & the Lower Triangular Matrix.
Upper Triangular Matrix - Elements below the main diagonal are all zero.
i.e. M[i,j] = 0, for i > j
Lower Triangular Matrix - Elements above the main diagonal are all zero.
i.e. M[i,j] = 0, for i < j
where,
Main diagonal - The diagonal running through M[1,1] to M[n,n] for a nXn square matrix.
Upper Triangular Matrix - Elements below the main diagonal are all zero.
i.e. M[i,j] = 0, for i > j
Lower Triangular Matrix - Elements above the main diagonal are all zero.
i.e. M[i,j] = 0, for i < j
where,
Sunday, May 20, 2012
Java Classloaders
Key things about Java classloaders:
1. There is a hierarchy among classloaders:
Bootstrap <---|
Extension <--|
System <---|
Custom
- Child classloaders (typically) delegate class loading to parent classloaders. child class loader’s findClass() method is not called if the parent classloader can load the class.
- Custom classloaders can override the default delegation chain to a certain extent.
- Due to the delegation chain of classloaders, ensure classes to be loaded by custom classloaders are not present on the system class path, boot class path, or extension class path.
2. Bootstrap classloader is a special classloader included with the JVM, written in native code. Bootstrap classloader is tasked with loading all core classes part of the jre.
None of the other classloaders can override the Bootstrap classloader's behvaiour.
3. All other classloaders are written in Java.
4. Extension classloader loads classes from the extension directories: JAVA_HOME/jre/lib/ext/
5. A more popular alternative to using the Extension classloader, is to use the System classloader which loads classes from the CLASSPATH environment variable location.
6. Finally, Custom classloaders can be written to override certain defaults like delegating classloading to parents, etc. Custom classloaders is commonly used by Application servers (such as Tomcat).
7. Separate Namespaces per Classloader:
Same class loaded by two different classloaders, are considered different. Trying to cast an object of one class (loaded by classloader 1) to a reference of the other (loaded by classloader 2, though identical in terms of its fully qualified class name) will result in a ClassCastException.
8. Lazy loading and Caching of Classes:
Classloaders load classes lazily. Once loaded classloaders cache all previously loaded classes for the duration of the JVM.
Key Methods of Classloaders:
To be detailed..
Dynamic Reloading of Classes:
Due to the non-overridable behaviour of caching of classes by classloaders, reloading of classes within a running JVM poses problems. To reload a class dynamically (a common use case for app. servers), a new instance of the classloader itself needs to be created.
Once the earlier classsloader is orphaned/ garbage, classes loaded & cached by it (and reachable only via the now GC'd classloader) also become garbage, which can then be collected by the GC.
Sunday, May 13, 2012
Reflexive, Symmetric, Transitive, Associate, Commutative, Distributive Laws and Operators
A refresher:
Reflexive: a = a
(Every element is mapped back to itself)
Symmetric: a = b, => b = a
Transitive: a = b & b = c, => a = c
Associative: (x # y) # z = x # (y # z)
(The operator is immune to different grouping of the operands)
Commutative: x $ y = y $ x
(The operator is immune to changing order of operands)
Distributive: x @ (y ^ z) = x @ y ^ x @ z,
(Outer Operand x can be distributed over the other two y & z without affecting the results)
Reflexive: a = a
(Every element is mapped back to itself)
Symmetric: a = b, => b = a
Transitive: a = b & b = c, => a = c
Associative: (x # y) # z = x # (y # z)
(The operator is immune to different grouping of the operands)
Commutative: x $ y = y $ x
(The operator is immune to changing order of operands)
Distributive: x @ (y ^ z) = x @ y ^ x @ z,
(Outer Operand x can be distributed over the other two y & z without affecting the results)
Monday, April 2, 2012
NP, NP-Hard & NP-Complete
NP: One that has a non-deterministic, polynomial time solution. (Mind you, it is still polynomial time). The other way to define it is, given a solution (certificate), one can verify correctness of the solution using a deterministic Turing Machine (TM) in polynomial time.
NP-Hard: A hard problem - at least as hard as the the hardest problem known (& unsolvable) so far.
NP-Complete: ( NP ) <intersection> ( NP-Hard ).
NP & NP-Hard are different, and overlap only in certain cases when they are NP-Complete. Many NP-Hard as a result, are not NP-Complete.
Monday, March 5, 2012
Rails Cheat Sheet
This is a living doc. with my notes on getting off the ground and surviving around a Rails apps. My experience with Rails is rather clunky & hardly anything to speak of Ruby.
Set up
Ubuntu OS, Rails 2.0, Ruby 1.8 running on Mongrel behind an Apache web server
Books
A good starting point would be "Four Days on Rails" by John McCreesh.
Installing on Ubuntu:
- Install the following through Synaptic
Ruby 1.8
Gem 1.8
mysql client/ server (should be installed to connect & run mysql server locally)
- Further install all essential gems (you can go about this lazily, installing what you need):
sudo gem install -v=2.0.2 rails
Gems installed on my dev:
gem list
actionmailer (2.1.0, 2.0.2)
actionpack (2.1.0, 2.0.2)
activerecord (2.1.0, 2.0.2)
activeresource (2.1.0, 2.0.2)
activesupport (2.1.0, 2.0.2)
cgi_multipart_eof_fix (2.5.0)
daemons (1.0.10)
fastthread (1.0.1)
gem_plugin (0.2.3)
localization_generator (1.0.8)
mongrel (1.1.5)
mongrel_cluster (1.0.5)
mysql (2.7)
rails (2.0.2)
rake (0.8.1)
salted_login_generator (2.0.2)
Creating your first app
rails <Application Name>
edit config/database.yml for correct database settings
./script/generate model <modelName>
./script/generate controller <controllerName>
edit db/migrate/<record name> for the appropriate insert and delete scripts
rake db:migrate
rake db:migrate RAILS_ENV=production
Scaffold the controller (put the text active_scaffold : <modelName>
Starting the server
Within the project folder type:
./script/server
Hit http://localhost:9000/ to test. Monitor the console outputs for any issues. Once you find things are working, you can turn this into a daemon (adding & to the script).
./script/server --env=production &
Set up
Ubuntu OS, Rails 2.0, Ruby 1.8 running on Mongrel behind an Apache web server
Books
A good starting point would be "Four Days on Rails" by John McCreesh.
Installing on Ubuntu:
- Install the following through Synaptic
Ruby 1.8
Gem 1.8
mysql client/ server (should be installed to connect & run mysql server locally)
- Further install all essential gems (you can go about this lazily, installing what you need):
sudo gem install -v=2.0.2 rails
Gems installed on my dev:
gem list
actionmailer (2.1.0, 2.0.2)
actionpack (2.1.0, 2.0.2)
activerecord (2.1.0, 2.0.2)
activeresource (2.1.0, 2.0.2)
activesupport (2.1.0, 2.0.2)
cgi_multipart_eof_fix (2.5.0)
daemons (1.0.10)
fastthread (1.0.1)
gem_plugin (0.2.3)
localization_generator (1.0.8)
mongrel (1.1.5)
mongrel_cluster (1.0.5)
mysql (2.7)
rails (2.0.2)
rake (0.8.1)
salted_login_generator (2.0.2)
Creating your first app
rails <Application Name>
edit config/database.yml for correct database settings
./script/generate model <modelName>
./script/generate controller <controllerName>
edit db/migrate/<record name> for the appropriate insert and delete scripts
rake db:migrate
rake db:migrate RAILS_ENV=production
Scaffold the controller (put the text active_scaffold : <modelName>
Starting the server
Within the project folder type:
./script/server
Hit http://localhost:9000/ to test. Monitor the console outputs for any issues. Once you find things are working, you can turn this into a daemon (adding & to the script).
./script/server --env=production &
Subscribe to:
Posts (Atom)