Showing posts with label terminology. Show all posts
Showing posts with label terminology. Show all posts

Thursday, November 28, 2013

Precision and Recall

Terms popular within search and Information Retrieval (IR) domains.

Precision: Is all about accuracy. Whether all results that have shown up are relevant.

Recall: Has to do with completeness. Whether all valid/ relevant results have shown up.

Needs detailing..

Tuesday, October 1, 2013

Need Support to Lift with Confidence

Brace up terminologies coming your way...

Support: A measure of the prevalence of an event x in a given set of N data points. Support is effectively a first level indicator of something occurring frequent enough (say greater than 10% of the times) to be of interest.

In the case of two correlated events x & y,

Confidence: A measure of predictability of two events occurring together. Once confidence is above a certain threshold (say 70%), it means the two events show up together often enough to be used for rules/ decision making, etc.

Lift: A measure of the power of association between two events. For an event y that has occurred, how much more likely is event y to occur once it is known that event x has occurred

Sunday, September 22, 2013

False Negative, False Positive and the Paradox


First a bit about the terms False Positive & False Negative. There terms are associated with the nature of error in the results churned out by a system trying to answer an unknown problem, based on a (limited) set of given/ input data points. After analysing the data, the system is expected to come up with a Yes (it is Positive) or a No (it is Negative) type answer. There is invariably some error in the answer due to noisy data, wrong assumptions, calculation mistakes, unanticipated cases, mechanical errors, surges, etc.

A False Positive is when the system says the answer is Positive, but the answer is actually wrong. An example would be a sensitive car's burglar alarm system that starts to beep due to heavy lightning & thunder on a rainy day. The alarm at this stage is indicating a positive hit (i.e. a burglary), which is not really happening.

On the other hand, a False Negative is when the system answers in a Negative, where the answer should have been a Positive. False negatives happen often with first level medical tests and scans which are unable to detect the cause of pain or discomfort. The test report of "Nothing Abnormal Detected" at this stage is often a False Negative, as revealed by more detailed tests performed later.

The False Positive Paradox is an interesting phenomenon where the likelihood of a False Positive shoots up significantly (& sometimes beyond the actual positive) when the actual rate of occurrence of a condition within a given sample group is very low. The results are thanks to basic likelihood calculations as shown below.

Let's say in a group of size 1,000,000 (1 Mn.), 10% are doctors. Let's say there's a system wherein you feed in a person's Unique ID (UID) and it tells you if the person is a doctor or not. The system has a 0.01% chance of incorrectly reporting a person who is not a doctor to be a doctor (a False Positive).

Now, let's work out our confidence levels of the results given out by the system.


On the other hand if just 0.01% of people in the group are actually doctors (while the rest of the info. remains same) the confidence level works out to be quite different.


This clearly shows that the likelihood of the answer being a False Positive has shot up from much under 1% to as much as 50%, when the occurrence of a condition (number of doctors) within a given population dropped from 10%  (i.e. 100,000) to a low value of 0.1% (i.e. 1,000).

Thursday, September 12, 2013

Transparently


While doing software development you might hear of change being introduced "transparently". What does this mean?

Transparency in this context is similar to how a looking glass is transparent. One can barely make out that it exists. Think of a biker who pulls down the glass visor of his helmet when troubled by wind blowing into his eyes. His sight of the road & beyond continue to function without his noticing the transparent visor layer in-between.

Similarly, when a change in introduced transparently on the server side, it means the dependent/ client side applications needn't be told/ made aware of this change on the server side. The old interfaces continue to work as is, communication protocols remain the same, and so on.

The above kind of transparency is different from the transparency of a "transparent person" or a "transparent deal" or a "white box system", where the internals (like thoughts, implementation, ideas, details, etc.) are visible.

Thursday, October 25, 2012

Amdahl's Law for Max Utilization and Speedup With Parallelization


There's a lot of talk these days about constraints introduced by the CAP theorem. One other equally relevant law for Parallel and Distributed systems is the Amdahl's law. Amdahl's law talks about the amount of speedup that can be achieved when a given single processor (or threaded) task is split and handed over to N-processors (or threads) to be executed in parallel.

To take an example let's work with the typical entrance examination problem: "if one person takes 2 hours to eat up a cake, how long would four people take to eat the same cake?".

Simple each person eats up one-fourth of the cake. So time taken = time-for-1-person/ N = 2/ 4 = 0.5 hours. Right? Ya, well, unless it's the very same cake that we were referring to, the one that got eaten ;)

What Amdahl's law says is that if the single processor task has sub-tasks or steps (unlike the cake eating example above) not every sub-task/ step can be parallelized. There is some percentage of sub-tasks (F%) that need to be run sequentially. As a result the speedup is not N-times but less computed as follows.



- To go back to our cake eating example,

i.e. 4 times speedup. Indeed 4 people took 1/4 th the time, i.e. 30 mins, as compared to 2 hours by one person.

- Let's now add some sequential steps before the cake eating. First you got to pay for it at the cashier & then take delivery from the delivery counter.
These have to be done in sequence & only by one person (why pay twice?). Thanks to the monopoly & popularity of the cake vendor, there's invariably a long queue at the cashier. It takes 15 mins to pay and another 15 mins to get the delivery of the cake.


So you see, due to the 20% sequential tasks the speedup has dropped from 4 times to 2.5 times.

Wednesday, October 10, 2012

Brewer's CAP Theorem


Brewer's CAP theorem talks about Consistency (C), Availability (A), Partition (P) tolerance, as the constraints that primarily govern the design of all distributed systems. There's a lot of literature available online explaining the theorem. The summary is that given that network partitions (P) will happen, pick one of the other two - Consistency (C) or Availability (A) for designing your system on a case by case basis (since you can't have all three)!

A partition could be caused by the failure of some kind of component - hardware (routers, gateway, cables, physical boxes/ nodes, disks, etc.) and/or software. When that happens:

- If you pick Consistency (C) => All your systems, processing, etc. is blocked/ held up until the failed component(s) recovers.

This has been the default with traditional RDBMS (thanks to their being ACID compliant). For financial & banking applications this normally has to be the choice.

- On the other hand, if you pick Availability (A) => All systems, other than the currently partitioned/ failed systems, continue to function as is within their own partitions. Seems good? Well not quite, cause this obviously results in inconsistencies across the two (or more) partitioned sections.

Systems thus designed with Availability (A) as their selection (over C), must be able to live with inconsistencies across different partitions. Such systems also have some automated way to later get back to consistent state (eventual consistency) once the partitioned/ failed systems have recovered.

This is mostly the design choice with the NoSqls. Also with services such as Amazon AWS where eventual consistency within some reasonable time window (of a few seconds to a minutes) is acceptable.

Sunday, June 10, 2012

Triangular Matrix

There are two special kinds of square matrices (n X n matrix) known as the Upper Triangular Matrix & the Lower Triangular Matrix.

Upper Triangular Matrix - Elements below the main diagonal are all zero.
     i.e. M[i,j] = 0, for i > j

Lower Triangular Matrix - Elements above the main diagonal are all zero.
     i.e. M[i,j] = 0, for i < j

where,
Main diagonal - The diagonal running through M[1,1] to M[n,n] for a nXn square matrix.

Monday, April 2, 2012

NP, NP-Hard & NP-Complete


NP: One that has a non-deterministic, polynomial time solution. (Mind you, it is still polynomial time). The other way to define it is, given a solution (certificate), one can verify correctness of the solution using a deterministic Turing Machine (TM) in polynomial time.

NP-Hard: A hard problem - at least as hard as the the hardest problem known (& unsolvable) so far.

NP-Complete: ( NP ) <intersection> ( NP-Hard ).

NP & NP-Hard are different, and overlap only in certain cases when they are NP-Complete. Many NP-Hard as a result, are not NP-Complete. 

Friday, January 20, 2012

SEDA - Staged Event Driven Architecture

Welsh, Culler, Brewer's paper at SOSP-01 introduces the concepts around SEDA lucidly. SEDA, in my experience, has been a good architectural choice for building scalable back-end systems. SEDA based systems, not only scale well, but also have trackability, failure recovery, load balancing, etc. introduced very easily/ naturally into the system along the stage boundaries.