Algorithms, Design, Code and more: maths

Showing posts with label maths. Show all posts

Thursday, April 17, 2025

On Quantization

Speed vs Accuracy trade off.
Reduce costs on storage, compute, operations .
Speed up output generation, inference, etc.
Work with lower precision data.
Cast/ map data from Int32, Float32, etc 32-bit or higher precision to lower precision data types such as 16-bit Brain Float (BFloat16) or 4-bit (NFloat)/ int4 or int8, etc.

East mapping Float32 (1-bit Sign, 7-bit Exponent, 23-bit Mantissa) => BFloat16 (1-bit Sign, 7-bit Exponent, 7-bit Mantissa). Just discard the higher 16-bits of mantissa. No overflow!
Straightforward mapping work out max, min, data distribution, mean, variance, etc & then sub-divide into equally sized buckets based on bit size of the lower precision data type. E.g int4 (4-bit) => 2^4 = 16 buckets.
Handle outliers, data skew which can mess up the mapping, yet lead to loss of useful info if discarded randomly.
Work out Bounds wrt Loss of Accuracy.

LLMs, AI/ ML side:

https://newsletter.theaiedge.io/p/reduce-ai-model-operational-costs

Lucene, Search side:

https://www.elastic.co/search-labs/blog/scalar-quantization-101
https://www.elastic.co/search-labs/blog/scalar-quantization-in-lucene

Wednesday, February 26, 2020

Sampling Plan for Binomial Population with Zero Defects

Rough notes on sample size requirement calculations for a given confidence interval for a Binomial Population - having a probability p of success & (1 – p) of failure. The first article of relevance is Binomial Confidence Interval which lists out the different approaches to be taken when dealing with:

Large n (> 15), large p (>0.1) => Normal Approximation
Large n (> 15), small p (<0.1) => Poisson Approximation
Small n (< 15), small p (<0.1) => Binomial Table

On the other side, there are derivatives of the Bayes Success Run theorem such as Acceptance Sampling, Zero Defect Sampling, etc. used to work out statistically valid sampling plans. These approaches are based on a successful run of n tests, in which either zero or a an upper bounded k-failures are seen.

These approaches are used in various industries like healthcare, automotive, military, etc. for performing inspections, checks and certifications of components, parts and devices. The sampling could be single sampling (one sample of size n with confidence c), or double sampling (a first smaller sample n1 with confidences c1 & a second larger sample n2 with confidence c2 to be used if test on sample n1 shows more than c1 failures), and other sequential sampling versions of it. A few rule of thumb approximations have also emerged in practice based on the success run techique:

Rule of 3s: That provides a bound for p=3/n, with a 95% confidence for a given success run of length n, with zero defects.

Success run sample size (n) using Confidence Interval (C) & Reliability (R = 1 -p), when sampling with replacement (sampled item replaced and maybe selected again) taking a Binomial Distribution:

probability of R^n, for a successful run of length n with zero defects

case when a maximum of k defects/ failures is acceptable

For a small population and when sampling without replacement (sampled item not replaced), the Hypergeometric Distribution is used for sampling n items from a population having size N with D defects. Accordingly, Operating Characteristic (OC) curves for the given problem are prepared and used to get values of C & R for a given n.

Footnote on Distributions:

Poisson distribution is used to model events within time/ space that are rare (small p) but show up large number of times (large n) & occur independent of the time since last event. Inter arrival times is an iid exponential random variables.

Poisson can be derived as a limiting case to Binomial when n is large (tends to infinity) while the expected no. of successes (or failures) remains fixed, also referred to as the law of rare event

Poisson confidence interval is derived from Gamma Distribution - which is defined using the two-parameters shape & scale. Exponential, Erlang & Chi-Squared are all special cases of Gamma Distrubtion. Gamma distribution is used in areas such as prediction of wait time, insurance claims, wireless communication signal power fading, age distribution of cancer events, inter-spike intervals, genomics. Gamma is also the conjugate prior of Bayesian statistics & exponential distribution.

Bayesian Success Run can be derived using the Beta Distribution which is the conjugate prior for Binomial. Beta Distribution is defined via two shape parameters. Beta Distribution applications is found in order statistics (selection of k-th smallest from Uniform distribution), subjective logic, wavelet analysis, project management (PERT), etc.

Monday, November 11, 2013

Shanon Entropy and Information Gain

Shanon's Information Gain/ Entropy theory gets applied a lot in areas such as data encoding, compression and networking. Entropy, as defined by Shanon, is a measure of the unpredictability of a given message. The higher the entropy the more unpredictable the content of the message is to a receiver.

Correspondingly, a high Entropy message is also high on Information Content. On receiving a high Entropy/ high Information Content laden message, the receiver has a high Information Gain.

On the other hand, when the receiver already knows the contents (or of a certain bias) of the message, the Information Content of the message is low. On receiving such a message the receiver has less Information Gain. Effectively once the uncertainty about the content of the message has reduced, the Entropy of the message has also dropped and the Information Gain from receiving such a message has gone down. The reasoning this far is quite intuitive.

The formula for Entropy (H) calculation:
H(X) = -Summation[ p(x) * log( p(x) )] over all possible values/ outcomes of x, i.e. {x1, ..., xn}
where p(x) = probability of each of the values/ outcomes of x {x1, ..., xn}.
The log is in a certain base b.

(All calculations in base 2)
Eg 1.a: In the case of a single fair coin toss:
x = {H, T}
& p(x) = {1/2, 1/2}

H(X) = -[1/2 * log(1/2) + 1/2 * log(1/2) ] = 1

Eg 1.b: For a biased coin toss, with three times higher likelihood of a Head:
x = {H, T}
& p(x) = {3/4, 1/4}

H(X) = -[3/4 * log(3/4) + 1/4 * log (1/4) ] = 0.811 (Entropy is lower than 1.a, Information Gain is lower).

Eg 1.c: For a completely biased coin toss, which two Heads:
x = {H, T}
& p(x) = {1, 0}

H(X) = -1[ 1*log(1) + 0*log(0) ] = 0 (Entropy is zero)

The Entropy (& unpredictability) is the highest for a fair coin (example 1.a) and decreases for a biased coin (examples 1.b & 1.c). Due to the bias the receiver is able to predict the outcome (favouring the known bias) in the later case resulting in a lower Entropy.

The observation from the (2-outcomes) coin toss case generalizes to the N-outcomes case, and the Entropy is found to be highest when all N-outcomes are equally likely (fair).

Tuesday, October 1, 2013

Need Support to Lift with Confidence

Brace up terminologies coming your way...

Support: A measure of the prevalence of an event x in a given set of N data points. Support is effectively a first level indicator of something occurring frequent enough (say greater than 10% of the times) to be of interest.

In the case of two correlated events x & y,

Confidence: A measure of predictability of two events occurring together. Once confidence is above a certain threshold (say 70%), it means the two events show up together often enough to be used for rules/ decision making, etc.

Lift: A measure of the power of association between two events. For an event y that has occurred, how much more likely is event y to occur once it is known that event x has occurred

Sunday, September 22, 2013

False Negative, False Positive and the Paradox

First a bit about the terms False Positive & False Negative. There terms are associated with the nature of error in the results churned out by a system trying to answer an unknown problem, based on a (limited) set of given/ input data points. After analysing the data, the system is expected to come up with a Yes (it is Positive) or a No (it is Negative) type answer. There is invariably some error in the answer due to noisy data, wrong assumptions, calculation mistakes, unanticipated cases, mechanical errors, surges, etc.

A False Positive is when the system says the answer is Positive, but the answer is actually wrong. An example would be a sensitive car's burglar alarm system that starts to beep due to heavy lightning & thunder on a rainy day. The alarm at this stage is indicating a positive hit (i.e. a burglary), which is not really happening.

On the other hand, a False Negative is when the system answers in a Negative, where the answer should have been a Positive. False negatives happen often with first level medical tests and scans which are unable to detect the cause of pain or discomfort. The test report of "Nothing Abnormal Detected" at this stage is often a False Negative, as revealed by more detailed tests performed later.

The False Positive Paradox is an interesting phenomenon where the likelihood of a False Positive shoots up significantly (& sometimes beyond the actual positive) when the actual rate of occurrence of a condition within a given sample group is very low. The results are thanks to basic likelihood calculations as shown below.

Let's say in a group of size 1,000,000 (1 Mn.), 10% are doctors. Let's say there's a system wherein you feed in a person's Unique ID (UID) and it tells you if the person is a doctor or not. The system has a 0.01% chance of incorrectly reporting a person who is not a doctor to be a doctor (a False Positive).

Now, let's work out our confidence levels of the results given out by the system.

On the other hand if just 0.01% of people in the group are actually doctors (while the rest of the info. remains same) the confidence level works out to be quite different.

This clearly shows that the likelihood of the answer being a False Positive has shot up from much under 1% to as much as 50%, when the occurrence of a condition (number of doctors) within a given population dropped from 10% (i.e. 100,000) to a low value of 0.1% (i.e. 1,000).

Tuesday, September 10, 2013

Identity Function and Idempotence

Thursday, October 25, 2012

Amdahl's Law for Max Utilization and Speedup With Parallelization

There's a lot of talk these days about constraints introduced by the CAP theorem. One other equally relevant law for Parallel and Distributed systems is the Amdahl's law. Amdahl's law talks about the amount of speedup that can be achieved when a given single processor (or threaded) task is split and handed over to N-processors (or threads) to be executed in parallel.

To take an example let's work with the typical entrance examination problem: "if one person takes 2 hours to eat up a cake, how long would four people take to eat the same cake?".

Simple each person eats up one-fourth of the cake. So time taken = time-for-1-person/ N = 2/ 4 = 0.5 hours. Right? Ya, well, unless it's the very same cake that we were referring to, the one that got eaten ;)

What Amdahl's law says is that if the single processor task has sub-tasks or steps (unlike the cake eating example above) not every sub-task/ step can be parallelized. There is some percentage of sub-tasks (F%) that need to be run sequentially. As a result the speedup is not N-times but less computed as follows.

- F = % to be run sequentially

- (1 - F) = % to be parallelized

- N = No of Processors on which tasks can be executed in parallel

- Speedup = 1
-----------------------------------------------------
[ % to be run sequentially + (% that can be parallelized)/ No of processors) ]

  = 1
   -----------------------------------
   [ F + (1-F)/N ]

- To go back to our cake eating example,

i.e. 4 times speedup. Indeed 4 people took 1/4 th the time, i.e. 30 mins, as compared to 2 hours by one person.

- Let's now add some sequential steps before the cake eating. First you got to pay for it at the cashier & then take delivery from the delivery counter.
These have to be done in sequence & only by one person (why pay twice?). Thanks to the monopoly & popularity of the cake vendor, there's invariably a long queue at the cashier. It takes 15 mins to pay and another 15 mins to get the delivery of the cake.

So the steps are:
Single Person (mins) 4 People (mins)
(i) Pay at the cashier (One person)              15                               15
(ii) Collect Cake (One Person)                     15                                15
(iii) Eat Cake (Parallel)                               120                              30
   ------- ----
Total 150    60

Speedup calculation is,

% Sequential (F) = 30 / ( 30 + 120 ) * 100 = 1/ 5 = 20%

Speedup = 1/ [1/5 + (4/5)/4 ] = 5/ 2 = 2.5 - (A)

Let's validate this,
Speedup = Time take by Single Person / Time take by 4 pepople = 150/ 60 = 2.5 - (B)

A = B. QED.

So you see, due to the 20% sequential tasks the speedup has dropped from 4 times to 2.5 times.

Sunday, May 13, 2012

Reflexive, Symmetric, Transitive, Associate, Commutative, Distributive Laws and Operators

A refresher:

Reflexive: a = a
(Every element is mapped back to itself)

Symmetric: a = b, => b = a

Transitive: a = b & b = c, => a = c

Associative: (x # y) # z = x # (y # z)
(The operator is immune to different grouping of the operands)

Commutative: x $ y = y $ x
(The operator is immune to changing order of operands)

Distributive: x @ (y ^ z) = x @ y ^ x @ z,
(Outer Operand x can be distributed over the other two y & z without affecting the results)