Algorithms, Design, Code and more

Wednesday, February 20, 2019

Professional

Almost all definitions of the word professional have an ethical component mentioned in the definition. Yet, we are flooded by news of frauds and malpractices from organizations across sectors and geographies, run by professionals mostly having top of the line credentials.

The article "Star-studded CVs and moral numbness" provides some interesting insights, esp. from an exercise with a batch of budding professionals from a top tier management school. The authors were perhaps dealing with a relatively young lot of individuals having less exposure (typically with <3 years work experience). Had the exercise been done within the walls of the corporate war-rooms, with hardened professionals clued into the ways of the real world, the outcome would have been totally different. Forget remorse, there would be no realization of any wrong doing. Voices of protests would be snubbed or worse, shown the exit door.

Monday, February 18, 2019

System Reliability

People are surrounded by devices of all kinds. Reliability of the device is one of the key aspects of the user's experience of a device, particularly over the long term. This also has an implication on the general opinion (positive or negative) that the user forms about the device, its brand & manufacturer. An understanding of reliability is thus important for the manufacturer and the user.

Reliability numbers are worked initially at the design phase by the manufacturer. Explicit targets for the product are set which govern the design choices. Later several rounds of testing done by the manufacturer and/ or the certifying authority mostly before device roll-out to ascertain the actual numbers. In certain cases these may need to be re-looked at due to unexplained failures, manufacturing defects, etc. while the device is in-service. Such evaluations can be performed during routine maintenance of the device or via explicit recall of the device to the designated service station. Data collected is analyzed to understand & resolve the underlying issues in the device and the causes of failures.

Reliability Analysis

There are some standard methods adopted by the manufacturers (OEMs), etc. to calculate reliability numbers of the device. These include among others quantitative techniques such as capturing Mean Time to Failure (MTTF), Mean Time Between Failure (MTBF) and Mean Time to Repair (MTTF) at the device and/ or its sub-components level. MTTF is a measure of the time (or number of cycles, runs, etc.) at which the device is likely to fail (failure rate), while MTBF is the equivalent value for repairable devices that accounts for the interval between failure incidents. MTTR is the corresponding time spent in repair. For repairable systems:
MTBF = MTTF + MTTR

These numbers are aggregates applicable to a general population of devices and not at one specific device level. So a device with MTBF value of 30,000 hours, implies that a population of size 30 devices are likely to run for 1000 hours on an average, collectively clocking 30K device hours.

For an exponential Reliability R(t) = exp(-t/MTBF), probability of a specific device surviving upto its rated t=MTBF is:
R(t) = exp(MTBF/MTBF) = exp(-1) = 36.8%

For repairable systems, another term used often is Availability.
Availability = System Up Time/ (System Up Time + System Down Time)

For mission critical systems that can not accept any downtime, Availability equals Reliability!

Weibull Analysis

Statistical techniques such as the Weibull Analysis is also very common for reliability computations. Weibull analysis makes use of data from failed as well as non-failed devices to work out device lifespan & reliability. A set of samples of the device`are observed under test conditions & a statistical distribution (model) is fitted to the data collected from these test samples. The fitted model is thereafter used to make predictions about the reliability of the entire population of devices which would be operating under the real world conditions.

The Weibull model uses three parameters for β: Shape (shape of the distribution), η: Scale parameter (spread), γ: Location (Location in time). Interestingly, the Weibull model is able to nicely capture the standard U-shaped, Bath Tub reliability curve typically seen over various device lifespans. In the early life-span of a device (testing, acceptance stage) the failure & defect rates are high (& has a β < 1). As these get fixed, the failure rate drops quickly to the steady operation ready, Useful Life stage.

In the Useful Life (β = 1) stage the device is stable & ready to roll-out to the end-user. Defects in this second stage are mainly due to design issues, operation, human errors, unexpected failures, etc. Finally, the device enters the Wear-out phase (β > 1), where the device or certain sub-components start showing natural wear & tear. Repairs & maintenance help to keep the device in good working shape for a while. Finally, there comes a time when the repairs are no longer viable due to costs or other reasons & then the device is taken out of service. Decisions around scheduled inspections, overhauls, etc. can be planned based on the different stage of the device life cycle & the corresponding values of β.

There are other exponential distributions such as Poisson, Rayleigh, Gamma, Beta, etc. which are applied to specific types of devices, domains and failures cases. Selection of the appropriate distribution is important for a proper reliability analysis.

Sampling and Confidence Levels

Once devices are live, actual on-ground analysis can also be done for certain categories of devices. Data can be collected from a representative sample of devices operating on ground. Techniques from statistics for reliably sampling & deriving confidence intervals for an underlying population can be applied for this purpose.

The analysis is typically done for a Binomial population of devices where a certain p% of the population (devices) are expected to fail, while (1-p)% are expected to operate fine (without failure). Assuming a confidence interval of c (tolerance interval), the sample size n is worked out by taking a Normal approximating for the Binomial distribution (simplifying the calculations):

n = Z²X (p) X (1 - p)/ c²

where Z is constant chosen based on the desired confidence value from the Standard Normal Curve. Z = 1.96 for Confidence 95%, 2.58 for 99%, and so on.

   (E.g. 1) For example, if for a certain device, 4% devices are expected to fail, p=0.04:

      (1.a) With 99% confidence level, for a 1% confidence interval, c=0.01:
n = 2.58*2.58*0.04*(1-0.04)/(0.01*0.01) = 2,556 samples are needed

    (1.b) For a tighter 0.1% confidence interval, c=0.001:
n = 2.58*2.58*0.04*(1-0.04)/(0.001*0.001) = 255,605 samples (100x more than (1.a))are needed

   (1.c) Similarly, for a higher confidence level of 99.99% (Z=3.891), at the same 1% confidence level:
n = 3.891*3.891*0.04*(1-0.04)/(0.01*0.01) = 5814 samples (more than (1.a)) are needed

The above sample size estimator assumes a very large, or infinite population. In case of finite sized population, the following correction is applied to the cases above:

   n_finite = n / (1 + (n-1)/size_pop)

(1.b.1) Applying the correction to the case (1.b) above, assuming a total population of a fixed 30,000 devices only:
n_finite = 255605/ (1 + (255605-1)/30000) = 28,400 devices, which need to be sampled to achieve a 0.1% confidence interval (tolerance) at the 99% confidence levels.

As discussed earlier, the reliability trends for devices tend to fit lifetime dependent exponential distributions such as Weibull better. Confidence levels in such cases are worked out accordingly using the appropriate distribution. For instance with a small constant failure rate (λ) expected, an exponential or a Poisson reliability model is a better approximation to Binomial than Normal. The confidence interval for λ is worked out as a Chi-Square distribution with 2n degrees of freedom, where n is a count of failures seen over time in the sampled set of devices.

Redundancy

Some systems need high fault tolerance. Reliability for such systems can be improved by introducing redundant systems in parallel, thereby replacing the Single Point of Failure (SPOF). When one device fails an alternate one can perform the job in its place.

Reliability of the redundant system:
R = 1 - p1 X p2 X .. X pk

   where p1,..,pk are the probability of failure of the backup redundant systems.

    (E.g. 2) In the above example where the single device system with a failure rate p=0.04, & a reliability of 96% (1-0.04), if we introduce an identical redundant/ backup device also with p=0.04, reliability goes up to R = 1 - 0.04*0.04 = 99.84%.

k-out-of-n Systems

An alternate set-up is a consensus based (k out of n) system. In this set-up, the system fails only when more than the quorum number (k, typically 50%) of devices fail. The reliability of the quorum system is:
   R_quorum_system = 1 - probability of more than k (quorum) device failures

The reliability is maximized for a majority quorum, i.e. k = n/2+1.

Monitoring Systems

Another typical approach is to introduce monitoring systems. The monitoring systems can be in the form of a sensor (optical, non-optical), a logger, a heart-beat polling unit, a human operator, or a combination of these. Whenever the monitoring system finds the primary system faltering, it raise alarms for corrective measures to be taken which may include stopping/ replacing the faulty device and/ or switching over to a backup system if available.

The reliability of the monitoring systems is assumed to be much higher than the underlying system being monitored, ideally a 100%. Monitoring systems are taken to be operating in sequence to the underlying system, so the reliability of the overall system is:
    R_monitored_system = R_device X R_monitoring

In other words, a failure in either the device or the monitor or both, will result in failure of the system, increasing the overall chances of failure. Yet, monitoring systems are effective on ground since they are the first line of defense for the system. They are able to raise alarms for the human operator to intervene early (lowering MTTR).

In certain set-ups the monitoring system are also enabled to automatically switch over to a backup device when there is a failure with the primary device. This helps reduce the down time (MTTR) to a negligible value, if not zero. With a system that has redundant devices & a single monitoring system the SPOF shifts to the monitoring system. A further refinement to the system design (such as Zab, Paxos, etc.) entails setting up the monitoring system in a k-of-n, typically majority, quorum. All decisions regarding the state of the underlying devices is taken by the quorum. The majority quorum is also resilient upto k=n/2 failures of the monitoring system.

Through good system design & thought, the reliability at the system level can be significantly boosted even if the sub-components are less reliable. Design & engineering teams must possess sound reliability analysis skills to be able to build world class products. An awareness of reliability aspects also helps the end-user to decide on the right device that suits their requirements & continues to function properly over its lifespan.

Tuesday, February 5, 2019

Towards A Clean Ganga

The hope for a clean Ganga river remains eternal in our hearts. There have been several attempts at cleaning the Ganga over time, most recently under the Namami Gange project. The goal is to get to within the acceptable water quality standards, for a clean pollution free river (Nirmal Dhara), with uninterrupted, adequate flow (Aviral Dhara). Progress however, seems rather limited/ slow, with no due date in sight for making Ganga clean again.

For the data oriented, numbers on the current state of the Ganga are available on the CPCB website in real-time. There are some 30+ monitoring stations located at different points along the Ganga. These centres collect data from Ganga & publish it in near real-time. Beyond the rudimentary web portal of the CPCB, an API based access to the data should also be made available. This would allow other people to leverage the underlying data for analytical purposes & build interesting apps. Data can reveal insights on several aspects such as seasonal factors, flow volume, portions with pollution spikes, changes in pollution levels over time, impact due to specific events or interventions, etc. Open-sourcing data is the way to go!

Another source of data on Ganga water quality are the reports that get published by CPCB & other environmentalists/ researchers working in this area. At times the data published in the reports have been collected by the authors themselves & provide a secondary check to the numbers from CPCB & others.

Yet another, though less rigorous, option is to crowd-source the data. For various reasons (religious, tourism, adventure, livelihood, etc.) people visit different spots of the Ganga throughout the year. A few motivated people among them could help baseline the numbers on water quality using low-end, free/ cheap phone based apps & devices, & publish the results out for public use. Hydrocolor is one such phone based app developed as part of a Ph.D. dissertation, that uses the Phone camera (a RGB radiometer) to measure water quality. The app auto-calibrates to handle variations across devices, platforms, weather conditions, etc.

Similarly there is a home-made device called Secchi disks that can be used for measuring the turbidity of water. Aerial, drone & IOT based devices are also being conceived by people across the world as solutions to track health & pollution of water bodies in their respective cities. We could adapt such tools to monitor the state of the river Ganga over time as she progresses towards good health.

Friday, August 31, 2018

OpenDNS FamilyShield - A Safer Community

FamilyShield from OpenDNS is a free & simple DNS offering to block out most adult sites, proxy servers and phishing sites. Over the years since FamilyShield was first rolled out the service has continued to be effective & beneficial to millions of customers, particularly parents, in keeping net usage safe for children & families at home.

Innovations from OpenDNS include ideas such as leveraging the community for tagging domains (DomainTagging), identifying phishing sites (PhishTank), speeding up internet access via the OpenDNS Global Network, & a very clear/ open Anti-censorship policy. These are incorporated within the FamilyShield service to effectively block out harmful content & make internet access better for the user across devices. Finally, once ready to onboard, setting up FamilyShield on the router takes no effort at all!

Thursday, April 26, 2018

Biometric Authentication

Biometrics are being pegged as the next big deal for user authentication, esp. within the banking & financial sectors. Been having a few discussions to understand how good biometric authentication are as compared to traditional Pins? Is a return to being angootha-chaaps (thumb-print based) for all purposes the right thing to do, technically that is? Here's a lay-man attempt at answering some of those questions.

Staring off by calling out known facts & assumptions about thumb-prints (an example biometric):

Thumb-prints are globally unique to every human being.
(Counter: Enough people who don't have/ lose a thumb, or lose their thumb-prints due to some other reason. Also partial thumb-prints of two individuals taken of a portion of the thumb due to faults at the time of scanning, etc. may match.)

Thumb-prints stay consistent over the lifetime of an individual (adult).
(Counter: May not be true due to physical changes in the human body, external injuries, growths, etc.)

Computers are basically binary machines. So whether it's a document (pdf, doc), an image file (jpg, gif, etc.), a video (mp4), a music file (wav), a Java program, a Linux operating system, etc. all of the data, instructions, etc. get encoded into a string of bytes (of 0s & 1s).

The thumb-print scan of an individual is similar to an image file (following a standard protocol), encoded as a string of bytes.
The thumb-prints scans of two different individuals will result in two different strings of bytes, that are unique to the individual.
Subsequent scans of the thumb-print of the same individual will result in exactly the same string of bytes over-time.

That's enough background information for a rough evaluation. A thumb-print scan of a certain size, say 10Kb is just a string of 10,000 bits of 0s & 1s. This is unique to an individual & stays the same over the individual's lifetime.

A 4-digit Pin on the other hand is a combination of four Integer numbers. Each Integer typically gets encoded into a 32-bit string. A 4-digit Pin is therefore a 4 * 32-bit = 128-bit string. The Pin normally stays the same, unless explicitly changed (rather infrequent).

In simplistic terms, when a request to authenticate an individual is made to a computer, it reads the incoming string of bits (from the Pin or the thumb-print) & matches it against a database of known/ all existing (1-to-1 or 1-to-N matches) strings. To the computer other than the difference in length between the two encoded strings of thumb-print (10,000-bit) & Pin (128-bit), there's not much difference between the two.

On the other hand, the Pin seems much better than the thumb-print if it were ever to get compromised due to a breach or a malicious app or something. The Pin can simply be changed & a new 128-bit string can replace the earlier one going forward. But in the case of the thumb-print there's really nothing that can be done as the individual's thumb-print scan will stay the same over time!

Yet another alternative for authentication is to use One Time Password (OTP). The OTP is also a 4-digit number (128-bit string) but it is re-issued each time over a separate out-of-band channel (such as SMS), is short lived, & is valid for just one use. These features make the OTP way more robust & immune to breaches & compromise.

What is a biometric to the human being, is just another string of bits to the machine, very similar to the string of bits of a Pin or an OTP. From the stand-point of safety though, the OTP is far superior to the other two. As is the common practice, it maybe ok to use biometric authentication within environments such as government offices, airports, etc. where the network is tightly regulated & monitored. For end-user authentication however, such as within phone apps, or internet payments, or other channels where the network or device is orders of magnitude more insecure & vulnerable these are not ideal. In general OTPs should be the top pick & biometrics the last option in such cases:

OTP > Pin > Biometrics

Monday, April 9, 2018

Learning Deep

Head out straight to KdNugget's Top 20 Deep Learning Papers of 2018. Has a good listing of research publications spanning over the last 4-5 years. You could further go on to read the papers referred to within these papers & then those referred to in the referred papers & so on for some really deep learning!

Sunday, April 8, 2018

Application Security & OWASP

Lots of applications get developed these days to make the life of customers easy & comfortable. However, a cause for concern is the general lack of awareness of security aspects among app developers. As a result unsafe & buggy apps get released to production by the dozens.

Have come across quite a few such apps in recent times & duly reported them to the respective support/ dev teams. While some of these will get fixed, there does appear to be a lack of knowledge of security issues among the teams. Had they known they would have mostly got it right upfront. Retrospective patching while common for newly discovered vulnerabilities, is no substitute for incorporating current standards & best practices that are well researched & documented.

OWASP is one of the leading open standards on security vulnerabilities. OWASP Top-10 Application Security Risks (latest: 2017) include things like Injection, Broken Authentication, Sensitive Data Exposure, etc. There's a whole bunch of material available online including an e-book with details & fixes for the vulnerabilities for the different stake-holders of the app. These are like the safety-belts that must be incorporated in all apps before allowing them to go-live.

Another major cause for widespread security issues in apps is the use of vulnerable frameworks & third party libraries by them. Buggy Javascript (JS) libraries are particularly guilty of pushing vulnerabilities down to apps.

As per the Northeastern University research of outdated Javascript libraries on the web, of 133K websites evaluated 37% included at least one vulnerable library:
- "not only website administrators, but also the dynamic architecture and developers of third-party services are to blame for the Web’s poor state of library management"
- "libraries included transitively, or via ad and tracking code, are more likely to be vulnerable"

RetireJS initiative keeps a tab on the vulnerabilities in the JS libraries. As do the OWASP cheat sheets on 3rd Party JS & AJAX Security. Static analysers, security testing, sand-boxed executions, etc. are typical ways to address client side JS security vulnerabilities.

Security issues are equally widespread in frameworks & libraries from other languages. Java & Scala are fairly well covered by OWASP (though .Net, Php, etc. aren't). Evaluations of Java Spring framework against OWASP Top-10, listing of Java security framework, hdiv & Scala Frameworks provide context on how best to address security issues in some very popular frameworks.

Wednesday, March 7, 2018

Ubuntu 16.04 32-bit Display Issue/ Corruption on load/ boot

Recently tried refurbishing an old Benq Joybook U102 laptop with an Ubuntu OS. This one has a 32-bit Intel Atom Processor with 1G Ram, so needed the 32-bit Ubuntu-16.04 iso image (from alternative downloads).

Followed this up with the standard routine of creating a bootable usb using the downloaded iso, booting with the usb, plugging in the Lan cable (wifi drivers are initially unavailable/ downloaded later), formatting the disk & doing a fresh install (choosing the option to download all third-party drivers, etc). All this went off smoothly & the laptop was ready to reboot.

After restarting however found the display corrupted. Practically the entire screen from the left to right was covered with bright coloured stripes, dots & squares rendering it unusable. After a bit of fiddling around found a work-around to close the lid & reopen it, forcing the laptop to go into standby & then exit by pressing the power button. This did help restore the display, but felt there had to be a better solution.

A few suggestions online were to downgrade the Ubuntu kernel to an older version 4.12 or lower. Further search revealed the actual Bug in Kernel 4.13 : Intel Mobile Graphics 945 shows 80 % black screen. The work-around solution of Suspend/ Resume, as well as the proper solution of setting GRUB_GFXPAYLOAD_LINUX=text in grub file are mentioned there.

Setting the variable GFXPAYLOAD to text makes Linux boot up in normal text mode, typically to avoid display problems in the early boot sequence. As instructed made the addition to the /etc/default/grub file, ran sudo update-grub, rebooted, & the display issue was gone!

Monday, February 26, 2018

Metro Train

In a city like Delhi that's choking under air-pollution, vehicular traffic, etc. public conveyance & the metro seems like the only viable solution to the mess. For many like me, the metro is a rather convenient way to travel back & forth to work daily.

The nearest metro station from home is about half-an-hours drive which I cover via one of the connectors or in an auto. The metro ride after that is well managed & regular. At times, there are slowdowns, delayed arrivals, etc. but on the whole the Delhi metro is doing a fantastic job helping several million people travel across the lengths & breadths of the city everyday. Sharing a few thoughts on what could make the journey more comfortable for travellers in the future.

Peak Hours Rush, 100+% occupancy

All regulars will tell you to avoid the metro at peak/ office hours. Metro can get really full & crowded at these hours. To encourage off-peak hours travel, metro also offers travel discounts for these hours.

Interestingly, during peak hours not all coaches get equally packed. Certain coaches, typically the ones close to the staircases, are much more congested. Now if only the passengers were notified in advance about the occupancy factor across coaches of the upcoming trains, they might be able to move a little bit on the platform & board a less congested one.

This could be achieved via existing sensors on the train that capture weight, footfall, etc. or via video feeds from the on-board cameras (see references below) within the coaches. Just need to relay this feed in real-time to a screen/ dashboard on the platform (& an app) visible to the customer. These feeds needn't be super accurate, and a reasonable estimate (Low, Medium, High, Very High) of the occupancy should do. This data can also reveal other interesting insights on occupancy across days of the week, events, festivals, seasonality, etc.

Fig 1: Occupancy Across Coaches

Another observation is that typically low to medium occupancy trains follow/ trail the high occupancy ones. Perhaps there's a general tendency in people to board the first available train that shows up. On the other hand, if the feed could also show occupancy stats along with arrival timings of next two to three trains that might help the passenger to wait a few minutes & board a less congested one.

Fig 2: Occupancy & Arrival Timings

Surprisingly, the expected arrival timings of next two or three trains, fairly common elsewhere (like Singapore MRT), is not available on the monitors here. This should probably be easy to introduce right away, even if the other one with the occupancy indicator takes time.

Optimizing Number of Coaches

The current logic to ply trains having 6-coaches in place of 8-coaches should also be improved in the future. Perhaps to reduce costs by roughly 75% (6/8), 6-coach trains are run during off-peak hours. Invariably though, back to back 6-coach trains show up during peak hours leading to overcrowding inside the trains & long spiralling queues at the stations.

Working out the right moment to switch between a 6-coach & an 8-coach (or other smaller) variant seems like a solution to a cost (running) minimization problem while maximizing users' comfort. Key factors being peak hours timings, occupancy levels, end-to-end runtime of the train, cost/ kg to ply the trains, time to hook/ unhook additional coaches, available parking space for spare coaches and so on. Very much worth a look at by data science folks.

Beyond the 8-Coaches Barrier

A bigger challenge for the Metro is handling additional/ future load from a growing population. The metro stations are built for the current size of trains having at most 8-coaches. There's no way to increase this length without undertaking massive reconstruction of the platforms at huge costs, time, etc. Increasing frequency of trains is already being explored, & there's very little scope left with trains running just a few minutes apart. Improving coverage certainly helps to an extent. Beyond that the options seem limited.

Probably adding coaches to existing trains could work. These coaches would have to be either attached to the ends of the train. Since they'll be positioned beyond the platform limits, they'll have to be door-less. Entry/ exit would be from adjacent coaches positioned on the platforms having doors. Doable in theory, though the additional movement across the train aisles, etc. will pose newer engineering & security challenges.

Another option could be to set-up each train to have two well synchronized halts per station. First halt to board coaches 1 to 6. After which the train moves forward & makes a second halt to allow boarding for coaches 6 to 12. Another idea could be to increase capacity vertically instead with double decker trains. Various practical aspects need to be figured out for fanciful ideas like these to work. But given the way we are going & growing might need to solve this very soon. Worthwhile therefore thinking of ideas, even if they seem weird today some might get applied in the future!

References:

Counting Number of People In A Video

Saturday, February 10, 2018

Erring On The Side Of Caution

With the "Go Digital" revolution taking over, swipe, click & pay is now the way currency changes hands. As users we've certainly taken to plastic money, net-banking, & mobile payment apps very well. Yet our understanding of their security aspects are vague, if not outright wrong.

We are at a point where not just all transactions are done online, but our interfacing with the banking & financial institutions are likely to be all virtual. It's therefore important to start thinking about how this virtual world functions. Given that there's hardly any awareness programme for the nouveau digital customers, we are left to fend for ourselves for now at least. Here're some of my ideas that, though half baked, might help get your grey cells activated in the right direction.

Convenience Vs. Caution

We are all for convenience these days. With long queues starting to disappear, 24X7 banking turning a reality, cheques heading to obsolescence we are gearing up for the inevitable fully digitized era. Yet, we shouldn't throw caution to the wind. One should be aware that the keys to your hard earned money is now the cell phone & laptops in your hands. Don't allow it to be misused.

Liabilities

But then as they say you can't just be too careful, can you? So it's important to also know what to do when things go wrong. What exactly are the liabilities of the banks? Where do the banks draw the line & what do they label as the customer's fault? Knowing things like how soon do you need to report a fraud, to what if it took place overseas, in some god forsaken currency, etc. becomes important.

Investigation

The next question then is how do banks investigate financial frauds. Who, how, where, when, & what means do they employ. Especially for frauds cutting across regional and international borders.

For the investigating authorities already cracking under the humongous backlogs, how easy is to investigate? Are there stats around how well they've been doing? Not to mention the other aspect around competence, intent, knowledge, effort, etc., all equally problematic. Best bet therefore is to be safe & steer clear of all this hassle.

Customization/ Personalization

Banks have this tendency to deal with all customers alike. At most they'll label you a standard or a premium category customer - more as marker of your net worth than than your tech./ digital competence. Though it's the later kind of categorization that's more relevant.

There's a whole bunch of different people out there. From people who may be digital novices at one end, to pros at the other end. Why not segregate accordingly and personalize the handling? The novices need a lot more hand holding. The systems should be made as such to double check all their transactions. Allow novices to keep all their limits (daily transaction, max value/ transaction, etc.) low. Ensure that they don't make mistakes. The pros on the other hand can be allowed to operate without much/ any checks.

Explain the implications of each digital category to the customer & allow them to label themselves as appropriate. And please let this be at the account level. A pro here might still be a novice there! Allow customers the option to customize their limits & features. At the moment all limits are mostly set to one fixed value for all customers of a particular bank category or card type, etc. which needs to be made flexible for the customer. There maybe people who require high limits on their cards while others who don't, so give customers the option to set & change the limits as per their convenience. At the same, customers with low limits might temporarily require higher which they can set for a specific duration (day, week, etc.) via one of the bank channels such as net-banking, phone banking, ATM, etc.

Another aspect is to strongly differentiate between the mechanism for getting informational/ read-only statements/ data about your accounts vs. the transactionally activated systems. Once email & mobile numbers are registered with the banks, customers should be able to easily request for balance info., statements, notifications, etc., all read only/ non-transactional information about their accounts (reasonably well supported even today).

However, what happens typically is that once activated for the informational service with the bank other transactional services (fund transfer, bill pay, etc.) also get activated by default. That shouldn't be the case. Banks systems must differentiate between the two kinds of services placed by the customer (read only information vs. transactional) & allow customers to select either of the two as per their convenience. At the same time, for the transactional systems allow setting of customizable limits & validation via multi-factor authentication.

Two-factor/ Multi-factor Authentication

Two-factor & multi-factor authentication are commonly heard terms, that work very well in practice. A user's identity is confirmed with 2 or more factors based on something they have (such as an ATM card) & something they know (a Pin). The general idea being that there's a very low probability of two (or more) factors getting compromised at the same time together. You may loose your card or your phone but not both together, at the same time. A chance of one in several million or so, & therefore considered safe.

Any possibility to bypass the multi-factor authentication is a certain recipe for disaster. Double check with your bank if their digital access & interfacing points between you, the vendor & the bank are all multi-factor based.

While the ATM card + Pin is a perfect 2-factor example in the real/ physical world, the picture changes slightly when doing digital transactions online. In this case, the 1st factor is the Card No + Expiry Date + CVV No combination. That's right all 3 combined make up for the 1st factor. Why? Think of what happens if you were to loose the card, the finder has access to all of them. So whether you are asked to enter 3 details or a 100 details printed on that same card, that's still just 1-factor!

The 2nd factor then, is the Pin that you have to enter, similar to the ATM case. However, one major difference between when you are doing transactions online over the internet vs. when using the ATM case, is that inherently your home network is orders of magnitude more unsafe than the bank's network over which information from the ATM gets routed. There's a much higher likelihood of your computer, phone or network being hacked & someone (virus, man-in-middle, etc.) capturing all the card information & your Pin. These can then be used later to do fraudulent transactions or launch a Replay Attack.

Of course, the banks have known/ thought of this, & therefore allowed you an alternative in the form of One Time Password (OTP). An OTP is much better than the Pin, since they are regenerated each time, delivered to your phone (over a separate out-of-band SMS channel), & can be used just once. So even if they were to be replayed, the subsequent transactions would fail!

Perhaps one less heard of/ used device here for the same one time password generation, is the Security Token, also called a dongle sometimes. A small standalone device, that's immune to viruses, hacks, etc. & can do magic for securing your digital transactions. Transactions get fulfilled only once you enter the temporary pin/ password flashing on the specific security token linked to your account. There are a whole bunch of variants out there, & it's about time the security token becomes the mainstay device in our banking & financial sector.

Interestingly the old SMS based OTP mentioned earlier, is a pretty good substitute for the security token. With one caveat, that the OTP should probably not be sent to a smart phone running apps with data connectivity. That's because most apps (good & malicious ones) can very easily detect/ have access to SMS & therefore form a self-fulfilling loop, violating the 2-factor authentication. (For payment apps, valid 2nd factor is just the Pin that you know & should be changed often over a separate channel other than your smart phone, such as ATM, phone-banking, etc.).

About the 1st factor (Card No, Expiry, CVV recycle)

You now know that either one of Pins or OTP's make up the 2nd factor & why OTPs are always better. Essentially they are short lived, & one time use. So wouldn't it help to make the 1st factor, the details printed on the card, short lived as well? Yes, certainly if the cards could be re-issued often. Though it may not be feasible given the printing/ shipping costs & for other reasons.

Banks tend to issue cards with validities that span several years. Could they instead issue temporary one time use card (similar to OTPs) sent virtually (don't need printed cards)? Well perhaps, but then the temporary one time card details can't be delivered via SMS (or netbanking or email), otherwise it would be using the same channel as the OTP & would violate the 2-factor requirements. Other ways that could possibly work is by phone banking, or via two separate phone nos., or with the security token (aha) - better ideas welcome.

Phone Number Recycling

Yet, another thing that seems weird is this phenomenon of allowing phone no's to get recycled. Things may have been somewhat ok in the past, but now it's absolutely wrong to allow the telecom vendor to cancel a Mr. Sharma's phone due to x,y,z reasons & issue it later to Mr. Verma after 180 days or whatever.

As things stand today:
Phone no recycling = Exposing Bank a/c, Personal Id, etc.. & this needs to stop! Phone companies could still block & disable a no., but can't reissue it to anybody, other than maybe immediate family.

Legacy vs. Digital Bank

Just as we discussed that from the bank's perspective there are different sorts of customers out there, tech. savvy to novices, similarly from the customer's perspective as well, it makes sense to hold accounts with different banks. Use only one or two of those online, & use the rest in a legacy/ offline mode to keep things safe. To continue the legacy offline mode, cheques or something similar will need to survive. Though cheques have been in existence for aeons, in their current forms they seem vulnerable in terms of security.

Cheque involve a long winding offline fulfilment loop for the payout. Cheques also involve a kind of good faith delayed payout understanding between the payer & payee. There's a physical instrument (the cheque) issued by the bank in the possession of the payer (=something you have, reasonable safe, though cheque numbers ought to be randomized), a signature uniquely known & reproducible by the payer (=something you know, unsafe & publicly exposed), a transportation of the cheque from the payer to the bank by the payee (rather unsafe as the cheque might move through the hands of several intermediaries), verification of the payer's details & signature by the payer/ payee bank (safe, online), & finally the payout if all's well.

As mentioned earlier cheque numbers are typically issued in sequence making them prone to hacks/ fakes, & should definitely be replaced with randomly generated numbers. Beyond that, there could be a mechanism to uniquely generate, a limited validity (30 days perhaps) one time signature for the cheque after entering the amount & payee details. The signature could be generated on a bank's site using a card (with multi-factor authentication) or some other offline mechanism (such as phone banking) or via the security token & shared with the payee/ written in place of the signature. The generated signature could also be partly human readable (for the benefit of the payee) & look like:

<AMOUNT>-<GENERATED_ALPHA_NUMERIC_KEY>

At the verification leg, the banks simply need to verify the combination of the cheque number, payee name, amount & the one time signature - no differently from what's done today. This should make this legacy instrument somewhat safer for use if it survives in the future.

Artificial Intelligence (AI)

Finally, in the not so distant future, the next generation of digital technology & AI would act as our sentinels. These AI powered machines, devices, algorithms and apps would detect, block, defer, double confirm, transactions on a case by case basis, to find that sweet spot between customer's convenience & safety. Till then, be safe, be happy!

Monday, January 22, 2018

Streaming Solutions

In the streaming solutions space, it all begins with the event driven architecture. This basically includes events (what triggers everything), the handlers (responsible for taking action) & the event loop (for coordinating). When things get more involved & complicated with multiple event streams/ sources, etc. solutions move into the cep space.

Another very popular programming methodology in recent times is Reactive programming. This in some senses is a special case of event driven programming with the focus on data change (as the event) & the reactive step to do other downstream data changes (as the handlers).

A whole bunch of frameworks for streaming solutions have emerged from the Big Data ecosystem such as Storm, Spark Streaming, Flink, etc. These allow for quick development of streaming solutions using high level abstractions. Even Solr has a streaming expression support now for building distributed streaming search solutions.

Outside of these frameworks, Akka Streams seems promising. It's built on top of Akka's robus Actor model & the Reactive Streams api. Solutions such as Gear Pump can provide a sense of the ground up solutions possible with Akka Streams.

Friday, January 5, 2018

Installing Canon LBP2900 Printer on Ubuntu 16.04

Capturing notes on installation of Canon LBP2900 printer on a 64-bit Ubuntu 16.04 (a first for this version) locally (usb). These are essentially a mix and match of instructions from the pages Canon CAPT Driver, installing canon printer on debian systems, & Canon CAPT driver in a sequence that works.

=> Pre-requisites:

- Cups installation:

(Note: These instructions are from the alternate open source foo2capt library library. I have retained the installs for now, did not apt-remove. Not sure if all of them are actually needed.

As such, the foo2capt code failed to build & install with a whole lot of other missing/ invalid dependency issues. The foo2capt project rewrite seems to be on, so for now dropped the idea of experimenting with it any further.)

- Work around to known CAPT 64-bit OS issues linking to 32-bit libraries:

=> Download & install the Linux CAPT printer driver:

Download Linux_CAPT_PrinterDriver_V260_uk_EN.tar.gz, also linked here.

- Untar & install the 64-bit > Debian packages:

=> Add printer to system:

- Start/ Restart cups service:

- Can be done via the command line using lpadmin:

- Alternatively can be done via System Settings > Printer:

Either type on terminal:

Or via System Tools > System Settings & Search for Printer

At this point a new printer gets created in my system with the name "Canon-LBP2900-CAPT-English".

(NOTE: Important to use the port 59787 (and not 59687). Also note that in the /etc/ccpd.conf, port 59787 is mentioned as UI_Port (& used by the captstatusui to communicate), while port 59687 as PDATA_Port files. Once ccpd services have been started, you can telnet to check that these ports are listening. )

=> Add printer to ccpadmin:

- Add "Canon-LBP2900-CAPT-English" to ccpadmin: (will override any existing/ old entry)

- Now there should be a proper entry corresponding to the printer "Canon-LBP2900-CAPT-English":

=> Restart ccpd services:

The ccpd services need to be restarted since they don't start up automatically, there could also be the one ccpd process issue.

=> View status of printer on captstatusui:

- In case you see a communication error, unplug your printer & plug it in again. On my system this works & the printer status changes to:
"Ready to Print"

(Note: Steps for setting up capt rules for usb add/remove could also be tried out.)

=> Print test page:

Next print a test page on "Canon-LBP2900-CAPT-English" & that's it for the set-up.

Thursday, December 7, 2017

One on Blockchain

The write-up on Hyperledger Fabric & other tools in the ecosystem has details on how to get off the ground. While the Hyperledger ecosystem is made up of several early stage technologies with a lot of ground to cover, yet it's got enough to interest technologists & blockchain enthusiasts.

Another blockchain project that looks promising is Corda. The Corda abstractions & concepts seem properly thought out, esp. given it's finance domain focus. Kotlin as the language of choice for Corda is another interesting read.

David's perspectives on Bockchain are insightful. Though, I don't much agree with his 3rd solution/ depiction to the problem (communal open database + trusted historians), it's similar to stuff that exists today. Corda's abstractions mentioned earlier seem better & might mainstream. His other point about building blockchains backed by databases is right. But specifically building one on Microsoft Sql Server seems weird, guess it's not even part of the Blockchain solution at Microsoft. Expecting to hear a lot more about rdbms & NoSql databases backed Blockchains in the near future.

Another much debated issue in the Blockchain community is around Block Size Limits. Different implementations have different size limits which can have an impact on the design of the Blockchain based application. Corda for instance offers an off-chain solution as attachments, while Ethereum potentially has no upper limit, though not recommended for storing large files.

Tuesday, September 5, 2017

Spark Streaming + AWS

Fundamentals

https://aws.amazon.com/blogs/big-data/optimize-spark-streaming-to-efficiently-process-amazon-kinesis-streams/
https://aws.amazon.com/kinesis/streams/faqs/
http://spark.apache.org/docs/latest/streaming-kinesis-integration.html

Spark Streaming - NRT:

Very low values for batchInterval (~10ms)
blockInterval = batchInterval
TBD!

Tuesday, August 1, 2017

On Storm

Storm Fundamentals

http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm-topology/
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_storm-component-guide/content/ch_storm-topology-tuning.html
http://grokbase.com/t/gg/storm-user/12c56ep4dk/calculating-the-capacity-of-a-bolt
http://storm.apache.org/releases/2.0.0-SNAPSHOT/Logs.html
https://stackoverflow.com/questions/35864128/how-to-set-storm-workers-jvm-max-heap-size
https://stackoverflow.com/questions/20914631/configuration-of-workers-in-a-storm-cluster
http://storm.apache.org/releases/1.1.0/Concepts.html

Metrics, Debugging, Monitoring, Logging

https://community.hortonworks.com/articles/36151/debugging-an-apache-storm-topology.html
http://storm.apache.org/releases/1.0.3/Metrics.html
https://www.opsclarity.com/monitoring-troubleshooting-apache-storm-opsclarity/
http://storm.apache.org/releases/2.0.0-SNAPSHOT/Logs.html
https://etl.svbtle.com/visualizing-metrics-in-storm-using-statsdgraphite
http://www.brianhsieh.com/2014/06/nagios-for-monitor-kafka.html
https://etl.svbtle.com/visualizing-metrics-in-storm-using-statsdgraphite
https://dzone.com/articles/monitoring-and-troubleshooting-apache-storm-with-o
https://community.hortonworks.com/articles/36151/debugging-an-apache-storm-topology.html (Logs)

Backpressure, Buffer, etc

http://www.michael-noll.com/blog/2013/06/21/understanding-storm-internal-message-buffers/
https://stackoverflow.com/questions/44557915/backpressure-in-storm
http://jobs.one2team.com/apache-storms/
https://issues.apache.org/jira/browse/STORM-1949 (Issues with Backpressure implementation)
http://storm.apache.org/releases/0.10.0/Configuration.html (MaxSpoutPending + acking seems like the only option for now)

Thursday, April 7, 2016

Singapore

Discovery all the way through..

NUS, SoC, Changi, PGP, UTown, Overseas, Friends, Profs, Lab, Conversation, Biopolis, Garden Walks, City lanes, Metro, Bus, Campus, Merlion, Music by the Bay, Clarke Quay...

Research, PhD, Courses, Conferences, Seminars, Publications, Prez, RA/ TA, Summers, Docs++, DS, Stats, AI, ML, Foundations, Deep Learning, Vision, How small can we see?, AlexNet, NLP, Reasoning, Common Sense, Novelty, Minsky et al, ConceptNet, WordNet, Patholody, Medicine, NUH, Brain, CT, fMRI, Primates, Formal Methods, Papers n Papers n Papers...

A time in awe!

Monday, August 10, 2015

Fast Streaming Solution

High level view:

Web API -> Kafka -> Storm (Streaming)

-> Hadoop/ HDFS -> MR/ Hive (Batch)

Specifics TBD..

Thursday, June 18, 2015

Lftp

Lftp a handy utility for all kinds of ftp, sftp, and other file transfer use-cases from the command-line on a *nix system. Give it a shot if there's ever a need..

Monday, May 4, 2015

Atlasian Stack

- Jira
- Confluence
Among many others.. All well integrated enhance dev productivity...

Friday, March 20, 2015

Teradata

Teradata busy getting a chunk of the BigData pie. Teradata Parrallel Transporter (TPT) and Adv. SQL Combo makes querying Big Data sources fast and efficient using state of the art caching and other optimizations.

Saturday, June 21, 2014

Getting Table Information From Hive Metastore

The Hive Metastore holds various meta information about Hive tables such as schema names, table names, partitions, fields, permissions, and so on. The metastore is built in a schema within a relational database, outside of Hive, and accessed via services by Hive clients. An embedded Derby database is configured as the default database for experimentation purposes, and must be changed over to something like MySql, Postgre, etc. for production use.

The Hive Metastore ER diagram is fairly straightforward. Once familiar with the schema, it is easy to query the metastore for information about the Hive tables. Here's a sample query to identify all Partitioned tables from a given Hive databases:

Friday, June 13, 2014

On Bitcoins

For now, the bitcoin gini index is as divergent from the ideal as can be. More on what makes this tick to follow soon..

Sunday, May 18, 2014

Hive Abstract Semantic Analyzer Hook

Hive allows Pre & Post Analyzer hooks to be added to the normal hive plan query generation flow via the AbstractSemanticAnalyzerHook class.

A custom hook needs to extend AbstractSemanticAnalyzerHook & override the preAnalyze or postAnalyze method as necessary.

Simple Sematic Analyzer Hook:
A sematic analyzer hook that logs a message each in the preAnalyze & postAnalyze methods, is shown below:

package org.apache.hadoop.hive.ql.parse;

public class SimpleSemanticPreAnalyzerHook extends AbstractSemanticAnalyzerHook{
static final private Log LOG = LogFactory.getLog(SimpleSemanticPreAnalyzerHook.class.getName());
static final private LogHelper console = new LogHelper(LOG);

@Override
public ASTNode preAnalyze(HiveSemanticAnalyzerHookContext context,
ASTNode ast) throws SemanticException {
console.printInfo("!! SimpleSemanticPreAnalyzerHook preAnalyze called !!");
return super.preAnalyze(context, ast);
}

@Override
public void postAnalyze(HiveSemanticAnalyzerHookContext context,
List<Task<? extends Serializable>> rootTasks)
throws SemanticException {
console.printInfo("!! SimpleSemanticPreAnalyzerHook postAnalyze called !!");
super.postAnalyze(context, rootTasks);
}
}

Configurations for Simple Sematic Analyzer Hook:

Monday, April 21, 2014

Urlencode and Urldecode in Hive using the Reflect UDF

Hive doesn't offer an in-built UDF to perform Urlencode or Urldecode. One option could be to write a custom UDF to fill for the void.

On the other hand, a rather straight forward alternative to have the same feature, as shown on the forum, us using the very generic Reflect UDF.

Thursday, April 10, 2014

Hive Query Plan Generation

Hive query is passed through several built-in modules for the final plan to be generated.

The stages/ modules are:

Query
=> (1) Parser
=> (2) Semantic Analyzer
=> (3) Logical Plan Generation
=> (4) Optimizer
=> (5) Physical Plan Generation
=> Executor to run on Hadoop

Monday, March 24, 2014

Hive History File

Hive maintains a history of all commands executed via the hive cli. These commands are written to a file called .hivehistory, on the user's home folder.

Sunday, March 16, 2014

Hive Optimizations

Explain Plan

Mappers and Reducers Count

Map Joins

Sorting

Optimization step:
Between the logical & physical plan generation phase of hive, hive optimizations gets executed. The current set of optimizations include:

Column pruning
Partition pruning
Sample pruning
Predicate push down
Map join processor
Union processor
Join reorder
Union processor

More on each of these optimizations to follow..

Sunday, February 2, 2014

Build Hadoop from Source Code with Native Libraries and Snappy Compression

When running Hadoop using a pre-built Hadoop binary distribution (a downloaded hadoop-<Latest_Version>.tar.gz bundle), Hadoop may not be able to load certain native libraries. The following warning is also displayed at the time of starting up Hadoop:

"Unable to load native-hadoop library for your platform... using builtin-java classes where applicable "

This issue comes up due to the difference in architecture of the particular machine on which Hadoop is being run now vs. that of the machine on which it was orginally compiled. While most of Hadoop (written in Java) loads up fine, there are native libraries (compression, etc.) which do not get loaded (more details to follow).

The fix is to compile Hadoop locally & use it in place of the pre-built Hadoop binary (tar.gz). At a high level this requires:

Installations:

Local dev box (Ubuntu 13, etc.)
Build tools set-up:

gcc g++ make maven cmake zlib zlib1g-dev libcurl4-openssl-dev

Native libraries installed: (Snappy, etc)

libsnappy1, libsnappy-dev

Protobuf source cod: (download here)
Hadoop source code: (download here)
Hadoop patch for pom.xml issue

Build:

mvn package -Pdist,native -DskipTests -Dtar

Export Environment: Finally, export Hadoop environment variables

export HADOOP_HOME=/path/to/hadoop/folder
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"

Latest binary: Available at <HADOOP_SOURCE>/hadoop-dist/target/hadoop-<Latest_Version>.tar.gz

Wednesday, January 15, 2014

Mocks for Unit testing Shell Scripts

1. Look at shunit for a sense of what kind of unit testing can be performed for Shell scripts.

2. For mocking up specific steps/ programs in the script, make use of alias.

Within a shell script testScript.sh this would be something as follows:

Tuesday, December 24, 2013

Mechanical Sympathy

A term that's gaining traction thanks to the LMAX architecture. Low latency applications running on the JVM need to be hardware gnostic to a large degree to be able to best leverage the computing power multi-core/ multi-processor architecture.

More details to follow soon on the topic out here, for the moment you could refer to Martin Fowler's post.

Tuesday, December 3, 2013

Real-time Face Reading

The machines getting better and better at face reading. Ancient mystics have another reason to worry. Won't be long before recommendation engines of various kinds get built that leverage this sort of technology.

More about algorithms in this space to follow..

Thursday, November 28, 2013

Precision and Recall

Terms popular within search and Information Retrieval (IR) domains.

Precision: Is all about accuracy. Whether all results that have shown up are relevant.

Recall: Has to do with completeness. Whether all valid/ relevant results have shown up.

Needs detailing..

Sunday, November 24, 2013

Pentaho 5.0 Community Edition Released

The stable build of Pentaho Community Edition (CE) 5.0 has been released. Many many new features have made it to this build. Particularly keen to try out the enhancements to web services deployment via Carte. More details can be found in the Pentaho 5.0 release notes.

Saturday, November 16, 2013

Pentaho Clusters

Pentaho provides the option to scale out Kettle Transformations via Pentaho Clusters. It is fairly straightforward to set up a Pentaho cluster and elastic/dynamic clusters. The 1-2-3 of what needs to be done is:

1. Start the Carte Instances
There are two kinds of instances - Masters & Slaves. At least one instance must act as the dedicated Master which takes on the responsibility of management/ distribution of transformations/ steps to slaves, fail-over/ restart and communicating with the slaves.

The Carte instances need a config file with details about the Master's port, IP/ Hostname etc. For sample config files take a look at the pwd folder in your default Pentaho installation (/data-integration/pwd).

E.g. With defaults, a cluster can be started on localhost with:

2. Set up Cluster & Server Information using Spoon (GUI)
Switch to the View tab, next to the Design tab in the left hand panel of the Spoon GUI.
Click on 'Slave Servers' to add new Slave servers (host, port, name, etc.). Make sure to check the 'is_the_master' checkbox for the Master server.

Next click on the 'Kettle Cluster Schemas' and use 'Select Slave servers' to choose the slave servers. For the ability to dynamically add/ remove slave servers, also select the 'Dynamic Cluster' checkbox.

3. Mark Transformation Steps to Execute in Cluster Mode
Right click on the step which needs to be run in the cluster mode, select Clustering & then select the cluster schema. You will now see a symbol next to the step (CxN) indicating that the step is to be executed in a clustered mode.

The cluster settings will be similar to what you see in the left panel in the image. You can also see a transformation, with two steps (Random & Replace in String) being run in a clustered mode in the right panel in the image below.

Monday, November 11, 2013

Shanon Entropy and Information Gain

Shanon's Information Gain/ Entropy theory gets applied a lot in areas such as data encoding, compression and networking. Entropy, as defined by Shanon, is a measure of the unpredictability of a given message. The higher the entropy the more unpredictable the content of the message is to a receiver.

Correspondingly, a high Entropy message is also high on Information Content. On receiving a high Entropy/ high Information Content laden message, the receiver has a high Information Gain.

On the other hand, when the receiver already knows the contents (or of a certain bias) of the message, the Information Content of the message is low. On receiving such a message the receiver has less Information Gain. Effectively once the uncertainty about the content of the message has reduced, the Entropy of the message has also dropped and the Information Gain from receiving such a message has gone down. The reasoning this far is quite intuitive.

The formula for Entropy (H) calculation:
H(X) = -Summation[ p(x) * log( p(x) )] over all possible values/ outcomes of x, i.e. {x1, ..., xn}
where p(x) = probability of each of the values/ outcomes of x {x1, ..., xn}.
The log is in a certain base b.

(All calculations in base 2)
Eg 1.a: In the case of a single fair coin toss:
x = {H, T}
& p(x) = {1/2, 1/2}

H(X) = -[1/2 * log(1/2) + 1/2 * log(1/2) ] = 1

Eg 1.b: For a biased coin toss, with three times higher likelihood of a Head:
x = {H, T}
& p(x) = {3/4, 1/4}

H(X) = -[3/4 * log(3/4) + 1/4 * log (1/4) ] = 0.811 (Entropy is lower than 1.a, Information Gain is lower).

Eg 1.c: For a completely biased coin toss, which two Heads:
x = {H, T}
& p(x) = {1, 0}

H(X) = -1[ 1*log(1) + 0*log(0) ] = 0 (Entropy is zero)

The Entropy (& unpredictability) is the highest for a fair coin (example 1.a) and decreases for a biased coin (examples 1.b & 1.c). Due to the bias the receiver is able to predict the outcome (favouring the known bias) in the later case resulting in a lower Entropy.

The observation from the (2-outcomes) coin toss case generalizes to the N-outcomes case, and the Entropy is found to be highest when all N-outcomes are equally likely (fair).

Saturday, October 26, 2013

Be Hands On

For as long as you can. Think specialists (surgeons, pilots, etc.) who get better clocking in more hours with/into their art - doing, practicing, persevering.

Sunday, October 20, 2013

General Availability (GA) for Hadoop 2.x

The Hadoop 2.x GA is nothing less that a big leap forward. Most of the features released such as YARN - a pluggable resource management framework, Name Node HA, HDFS Federation and so on were long awaited. As per the official mail to the community, this release includes:

"To recap, this release has a number of significant highlights compared to Hadoop 1.x:
• YARN - A general purpose resource management system for Hadoop to allow MapReduce and other other data processing frameworks and services
• High Availability for HDFS
• HDFS Federation
• HDFS Snapshots
• NFSv3 access to data in HDFS
• Support for running Hadoop on Microsoft Windows
• Binary Compatibility for MapReduce applications built on hadoop-1.x
• Substantial amount of integration testing with rest of projects in the ecosystem

Please see the Hadoop 2.2.0 Release Notes for details."

Also as per the official email to the community, users are encouraged to move forward to the 2.x branch which is more stable & backward compatible.

Tuesday, October 1, 2013

Need Support to Lift with Confidence

Brace up terminologies coming your way...

Support: A measure of the prevalence of an event x in a given set of N data points. Support is effectively a first level indicator of something occurring frequent enough (say greater than 10% of the times) to be of interest.

In the case of two correlated events x & y,

Confidence: A measure of predictability of two events occurring together. Once confidence is above a certain threshold (say 70%), it means the two events show up together often enough to be used for rules/ decision making, etc.

Lift: A measure of the power of association between two events. For an event y that has occurred, how much more likely is event y to occur once it is known that event x has occurred

Sunday, September 22, 2013

False Negative, False Positive and the Paradox

First a bit about the terms False Positive & False Negative. There terms are associated with the nature of error in the results churned out by a system trying to answer an unknown problem, based on a (limited) set of given/ input data points. After analysing the data, the system is expected to come up with a Yes (it is Positive) or a No (it is Negative) type answer. There is invariably some error in the answer due to noisy data, wrong assumptions, calculation mistakes, unanticipated cases, mechanical errors, surges, etc.

A False Positive is when the system says the answer is Positive, but the answer is actually wrong. An example would be a sensitive car's burglar alarm system that starts to beep due to heavy lightning & thunder on a rainy day. The alarm at this stage is indicating a positive hit (i.e. a burglary), which is not really happening.

On the other hand, a False Negative is when the system answers in a Negative, where the answer should have been a Positive. False negatives happen often with first level medical tests and scans which are unable to detect the cause of pain or discomfort. The test report of "Nothing Abnormal Detected" at this stage is often a False Negative, as revealed by more detailed tests performed later.

The False Positive Paradox is an interesting phenomenon where the likelihood of a False Positive shoots up significantly (& sometimes beyond the actual positive) when the actual rate of occurrence of a condition within a given sample group is very low. The results are thanks to basic likelihood calculations as shown below.

Let's say in a group of size 1,000,000 (1 Mn.), 10% are doctors. Let's say there's a system wherein you feed in a person's Unique ID (UID) and it tells you if the person is a doctor or not. The system has a 0.01% chance of incorrectly reporting a person who is not a doctor to be a doctor (a False Positive).

Now, let's work out our confidence levels of the results given out by the system.

On the other hand if just 0.01% of people in the group are actually doctors (while the rest of the info. remains same) the confidence level works out to be quite different.

This clearly shows that the likelihood of the answer being a False Positive has shot up from much under 1% to as much as 50%, when the occurrence of a condition (number of doctors) within a given population dropped from 10% (i.e. 100,000) to a low value of 0.1% (i.e. 1,000).

Thursday, September 12, 2013

Transparently

While doing software development you might hear of change being introduced "transparently". What does this mean?

Transparency in this context is similar to how a looking glass is transparent. One can barely make out that it exists. Think of a biker who pulls down the glass visor of his helmet when troubled by wind blowing into his eyes. His sight of the road & beyond continue to function without his noticing the transparent visor layer in-between.

Similarly, when a change in introduced transparently on the server side, it means the dependent/ client side applications needn't be told/ made aware of this change on the server side. The old interfaces continue to work as is, communication protocols remain the same, and so on.

The above kind of transparency is different from the transparency of a "transparent person" or a "transparent deal" or a "white box system", where the internals (like thoughts, implementation, ideas, details, etc.) are visible.