Algorithms, Design, Code and more: core computer science

Showing posts with label core computer science. Show all posts

Monday, November 11, 2013

Shanon Entropy and Information Gain

Shanon's Information Gain/ Entropy theory gets applied a lot in areas such as data encoding, compression and networking. Entropy, as defined by Shanon, is a measure of the unpredictability of a given message. The higher the entropy the more unpredictable the content of the message is to a receiver.

Correspondingly, a high Entropy message is also high on Information Content. On receiving a high Entropy/ high Information Content laden message, the receiver has a high Information Gain.

On the other hand, when the receiver already knows the contents (or of a certain bias) of the message, the Information Content of the message is low. On receiving such a message the receiver has less Information Gain. Effectively once the uncertainty about the content of the message has reduced, the Entropy of the message has also dropped and the Information Gain from receiving such a message has gone down. The reasoning this far is quite intuitive.

The formula for Entropy (H) calculation:
H(X) = -Summation[ p(x) * log( p(x) )] over all possible values/ outcomes of x, i.e. {x1, ..., xn}
where p(x) = probability of each of the values/ outcomes of x {x1, ..., xn}.
The log is in a certain base b.

(All calculations in base 2)
Eg 1.a: In the case of a single fair coin toss:
x = {H, T}
& p(x) = {1/2, 1/2}

H(X) = -[1/2 * log(1/2) + 1/2 * log(1/2) ] = 1

Eg 1.b: For a biased coin toss, with three times higher likelihood of a Head:
x = {H, T}
& p(x) = {3/4, 1/4}

H(X) = -[3/4 * log(3/4) + 1/4 * log (1/4) ] = 0.811 (Entropy is lower than 1.a, Information Gain is lower).

Eg 1.c: For a completely biased coin toss, which two Heads:
x = {H, T}
& p(x) = {1, 0}

H(X) = -1[ 1*log(1) + 0*log(0) ] = 0 (Entropy is zero)

The Entropy (& unpredictability) is the highest for a fair coin (example 1.a) and decreases for a biased coin (examples 1.b & 1.c). Due to the bias the receiver is able to predict the outcome (favouring the known bias) in the later case resulting in a lower Entropy.

The observation from the (2-outcomes) coin toss case generalizes to the N-outcomes case, and the Entropy is found to be highest when all N-outcomes are equally likely (fair).

Wednesday, October 10, 2012

Brewer's CAP Theorem

Brewer's CAP theorem talks about Consistency (C), Availability (A), Partition (P) tolerance, as the constraints that primarily govern the design of all distributed systems. There's a lot of literature available online explaining the theorem. The summary is that given that network partitions (P) will happen, pick one of the other two - Consistency (C) or Availability (A) for designing your system on a case by case basis (since you can't have all three)!

A partition could be caused by the failure of some kind of component - hardware (routers, gateway, cables, physical boxes/ nodes, disks, etc.) and/or software. When that happens:

- If you pick Consistency (C) => All your systems, processing, etc. is blocked/ held up until the failed component(s) recovers.

This has been the default with traditional RDBMS (thanks to their being ACID compliant). For financial & banking applications this normally has to be the choice.

- On the other hand, if you pick Availability (A) => All systems, other than the currently partitioned/ failed systems, continue to function as is within their own partitions. Seems good? Well not quite, cause this obviously results in inconsistencies across the two (or more) partitioned sections.

Systems thus designed with Availability (A) as their selection (over C), must be able to live with inconsistencies across different partitions. Such systems also have some automated way to later get back to consistent state (eventual consistency) once the partitioned/ failed systems have recovered.

This is mostly the design choice with the NoSqls. Also with services such as Amazon AWS where eventual consistency within some reasonable time window (of a few seconds to a minutes) is acceptable.

Friday, July 6, 2012

Queuing System and Little's Law

On its way..

Sunday, June 10, 2012

Triangular Matrix

There are two special kinds of square matrices (n X n matrix) known as the Upper Triangular Matrix & the Lower Triangular Matrix.

Upper Triangular Matrix - Elements below the main diagonal are all zero.
i.e. M[i,j] = 0, for i > j

Lower Triangular Matrix - Elements above the main diagonal are all zero.
i.e. M[i,j] = 0, for i < j

where,

Main diagonal - The diagonal running through M[1,1] to M[n,n] for a nXn square matrix.

Monday, April 2, 2012

NP, NP-Hard & NP-Complete

NP: One that has a non-deterministic, polynomial time solution. (Mind you, it is still polynomial time). The other way to define it is, given a solution (certificate), one can verify correctness of the solution using a deterministic Turing Machine (TM) in polynomial time.

NP-Hard: A hard problem - at least as hard as the the hardest problem known (& unsolvable) so far.

NP-Complete: ( NP ) <intersection> ( NP-Hard ).

NP & NP-Hard are different, and overlap only in certain cases when they are NP-Complete. Many NP-Hard as a result, are not NP-Complete.