Algorithms, Design, Code and more: Analytics

Showing posts with label Analytics. Show all posts

Monday, March 29, 2021

Doing Better Data Science

In the article titled "Field Notes #1 - Easy Does It" author Will Kurt highlights a key aspect of doing good Data Science - Simplicity. This includes first and foremost getting a good understanding of the problem to be solved. Later among the hypothesis & possible solutions/ models to favour the simpler ones. Atleast giving the simpler ones a fair/ equal chance at proving their worth in tests employing standardized performance metrics.

Another article of relevance for Data Scientists is from the allied domain of Stats titled "The 10 most common mistakes with statistics, and how to avoid them". The article based on the paper in eLife by the authors Makin and Orban de Xivry lists out the ten most common statistical mistakes in scientific research. The paper also includes tips for both the Reviewers to detect such mistakes and for Researchers (authors) to avoid them.

Many of the issues listed are linked to the p-value computations which is used to establish significance of statistical tests & draw conclusions from it. However, its incorrect usage, understanding, corrections, manipulation, etc. results in rendering the test ineffective and insignificant results getting reported. Issues of Sampling and adequate Control Groups along with faulty attempts by authors to establish Causation where none exists are also common in scientific literature.

As per the authors, these issues typically happen due to ineffective experimental designs, inappropriate analyses and/or flawed reasoning. A strong publication bias & pressure on researchers to publish significant results as opposed to correct but failed experiments makes matters worse. Moreover senior researchers entrusted to mentor juniors are often unfamiliar with fundamentals and prone to making these errors themselves. Their aversion to taking criticism becomes a further roadblock to improvement.

While correct mentoring of early stage researchers will certainly help, change can also come in by making science open access. Open science/ research must include details on all aspects of the study and all the materials involved such as data and analysis code. On the other hand, at the institutions and funders level incentivizing correctness over productivity can also prove beneficial.

Friday, April 17, 2020

Analysis of Deaths Registered In Delhi Between 2015 - 2018

The Directorate of Economics and Statistics & Office of Chief Registrar (Births & Deaths), Government of National Capital Territory (NCT) of Delhi annually publishes its report on registrations of births and deaths that have taken place within the NCT of Delhi. The report, an overview of the Civil Registration System (CRS) in the NCT of Delhi, is a source of very useful stats on birth, deaths, infant mortality and so on within the Delhi region.

The detailed reports can be downloaded in the form of pdf files from the website of the Department of Economics and Statistics, Delhi Government. Anonymized, cleaned data is made available in the form of tables in Section Titled "STATISTICAL TABLES" in the pdf files. The births and deaths data is aggregated by attributes like age, profession, gender, etc.

Approach

In this article, an analysis has been done of tables D-4 (DEATHS BY SEX AND MONTH OF OCCURRENCE (URBAN)), D-5 (DEATHS BY TYPE OF ATTENTION AT DEATH (URBAN)), & D-8 (DEATHS BY AGE, OCCUPATION AND SEX (URBAN)) from the above pdfs. Data from for the four years 2015-18 (presently downloadable from the department's website) has been used from these tables for evaluating mortality trends in Delhi for the three most populous Urban districts of North DMC, South DMC & East DMC for the period 2015-18.

Analysis

1) Cyclic Trends: Data for absolute death counts for period Jan-2015 to Dec-2018 is plotted in table "T1: Trends 2015-18". Another view of the same data is as monthly percentage of annual shown in table "T-2: Month/ Year_Total %".

Both tables clearly show that there is a spike in the number of deaths in the colder months of Dec to Feb. About 30% of all deaths in Delhi happen within these three months. The percentages are fairly consistent for both genders and across all 3 districts of North, South & East DMCs.

As summer sets in from March the death percentages start dropping. Reaching the lowest points below 7% monthly for June & July as the monsoons set in. Towards the end of monsoons, a second spike is seen around Aug/ Sep followed by a dip in Oct/ Nov before the next winters when the cyclic trends repeat.

Trends reported above are also seen with moving averages, plotted in Table "T-3: 3-Monthly Moving Avg", across the three districts and genders. Similar trends, though not plotted here, are seen in the moving averages of other tenures (such as 2 & 4 months).

2) Gender Differences: In terms of differences between genders, far more deaths of males as compared to females were noted during the peak winters on Delhi between 2015-18. This is shown in table "T4: Difference Male & Female".

From a peak gap of about 1000 in the colder months it drops to about 550-600 range in the summer months, particularly for the North & South DMCs. A narrower gap is seen the East DMC, largely attributable to its smaller population size as compared to the other two districts.

Table "T5: Percentage Male/ Female*100" plots the percentage of male deaths to females over the months. The curves of the three districts though quite wavy primarily stay within the rough band of 1.5 to 1.7 times male deaths as compared to females. The spike of the winter months is clearly visible in table T5 as well.

3) Cross District Differences in Attention Type: Table "T6: Percentage Attention Type" plots the different form of Attention Type (hospital, non-institutional, doctor/ nurse, family, etc.) received by the person at the time of death.

While in East DMC, over 60% people were in institutional care the same is almost 20% points lower for North & South DMCs. For the later two districts the percentage for No Medical Attention received has remained consistently high, the South DMC being particularly high over 40%.

4) Vulnerable Age: Finally, a plot of the vulnerable age groups is shown in table "T7: Age 55 & Above". A clear spike in death rates is seen in the 55-64 age group, perhaps attributable to the act of retirement from active profession & subsequent life style changes. The gender skewness within the 55-64 age group may again be due to the inherent skewness in the workforce, having far higher number of male workers, who would be subjected to the effects of retirement. This aspect could be probed further from other data sources.

Age groups in-between 65-69 show far lower mortality rates as they are perhaps better adjusted and healthier. Finally, a spike is seen in the number of deaths in the super senior citizens aged 70 & above, which must be largely attributable to their advancing age resulting in frail health.

Conclusion

The analysis in this article was done using data published by the Directorate of Economics and Statistics & Office of Chief Registrar (Births & Deaths), Government of National Capital Territory (NCT) of Delhi annually on registrations of births and deaths within the NCT of Delhi. Data of mortality from the three most populous districts of North DMC, South DMC and East DMC of Delhi were analysed. Some specific monthly, yearly and age group related trends are reported here.

The analysis can be easily performed over the other districts of Delhi, as well as for data from current years as and when those are made available by the department. The data may also be used for various modeling and simulation purposes and training machine learning algorithms. A more real-time sharing of raw (anonymized, aggregated) data by the department via api's or other data feeds may be looked at in the future. These may prove beneficial for the research and data science community who may put the data to good use for public health and welfare purposes.

Resouces:

Downloadable Datasheets For Analysis:

Monday, February 18, 2019

System Reliability

People are surrounded by devices of all kinds. Reliability of the device is one of the key aspects of the user's experience of a device, particularly over the long term. This also has an implication on the general opinion (positive or negative) that the user forms about the device, its brand & manufacturer. An understanding of reliability is thus important for the manufacturer and the user.

Reliability numbers are worked initially at the design phase by the manufacturer. Explicit targets for the product are set which govern the design choices. Later several rounds of testing done by the manufacturer and/ or the certifying authority mostly before device roll-out to ascertain the actual numbers. In certain cases these may need to be re-looked at due to unexplained failures, manufacturing defects, etc. while the device is in-service. Such evaluations can be performed during routine maintenance of the device or via explicit recall of the device to the designated service station. Data collected is analyzed to understand & resolve the underlying issues in the device and the causes of failures.

Reliability Analysis

There are some standard methods adopted by the manufacturers (OEMs), etc. to calculate reliability numbers of the device. These include among others quantitative techniques such as capturing Mean Time to Failure (MTTF), Mean Time Between Failure (MTBF) and Mean Time to Repair (MTTF) at the device and/ or its sub-components level. MTTF is a measure of the time (or number of cycles, runs, etc.) at which the device is likely to fail (failure rate), while MTBF is the equivalent value for repairable devices that accounts for the interval between failure incidents. MTTR is the corresponding time spent in repair. For repairable systems:
MTBF = MTTF + MTTR

These numbers are aggregates applicable to a general population of devices and not at one specific device level. So a device with MTBF value of 30,000 hours, implies that a population of size 30 devices are likely to run for 1000 hours on an average, collectively clocking 30K device hours.

For an exponential Reliability R(t) = exp(-t/MTBF), probability of a specific device surviving upto its rated t=MTBF is:
R(t) = exp(MTBF/MTBF) = exp(-1) = 36.8%

For repairable systems, another term used often is Availability.
Availability = System Up Time/ (System Up Time + System Down Time)

For mission critical systems that can not accept any downtime, Availability equals Reliability!

Weibull Analysis

Statistical techniques such as the Weibull Analysis is also very common for reliability computations. Weibull analysis makes use of data from failed as well as non-failed devices to work out device lifespan & reliability. A set of samples of the device`are observed under test conditions & a statistical distribution (model) is fitted to the data collected from these test samples. The fitted model is thereafter used to make predictions about the reliability of the entire population of devices which would be operating under the real world conditions.

The Weibull model uses three parameters for β: Shape (shape of the distribution), η: Scale parameter (spread), γ: Location (Location in time). Interestingly, the Weibull model is able to nicely capture the standard U-shaped, Bath Tub reliability curve typically seen over various device lifespans. In the early life-span of a device (testing, acceptance stage) the failure & defect rates are high (& has a β < 1). As these get fixed, the failure rate drops quickly to the steady operation ready, Useful Life stage.

In the Useful Life (β = 1) stage the device is stable & ready to roll-out to the end-user. Defects in this second stage are mainly due to design issues, operation, human errors, unexpected failures, etc. Finally, the device enters the Wear-out phase (β > 1), where the device or certain sub-components start showing natural wear & tear. Repairs & maintenance help to keep the device in good working shape for a while. Finally, there comes a time when the repairs are no longer viable due to costs or other reasons & then the device is taken out of service. Decisions around scheduled inspections, overhauls, etc. can be planned based on the different stage of the device life cycle & the corresponding values of β.

There are other exponential distributions such as Poisson, Rayleigh, Gamma, Beta, etc. which are applied to specific types of devices, domains and failures cases. Selection of the appropriate distribution is important for a proper reliability analysis.

Sampling and Confidence Levels

Once devices are live, actual on-ground analysis can also be done for certain categories of devices. Data can be collected from a representative sample of devices operating on ground. Techniques from statistics for reliably sampling & deriving confidence intervals for an underlying population can be applied for this purpose.

The analysis is typically done for a Binomial population of devices where a certain p% of the population (devices) are expected to fail, while (1-p)% are expected to operate fine (without failure). Assuming a confidence interval of c (tolerance interval), the sample size n is worked out by taking a Normal approximating for the Binomial distribution (simplifying the calculations):

n = Z²X (p) X (1 - p)/ c²

where Z is constant chosen based on the desired confidence value from the Standard Normal Curve. Z = 1.96 for Confidence 95%, 2.58 for 99%, and so on.

   (E.g. 1) For example, if for a certain device, 4% devices are expected to fail, p=0.04:

      (1.a) With 99% confidence level, for a 1% confidence interval, c=0.01:
n = 2.58*2.58*0.04*(1-0.04)/(0.01*0.01) = 2,556 samples are needed

    (1.b) For a tighter 0.1% confidence interval, c=0.001:
n = 2.58*2.58*0.04*(1-0.04)/(0.001*0.001) = 255,605 samples (100x more than (1.a))are needed

   (1.c) Similarly, for a higher confidence level of 99.99% (Z=3.891), at the same 1% confidence level:
n = 3.891*3.891*0.04*(1-0.04)/(0.01*0.01) = 5814 samples (more than (1.a)) are needed

The above sample size estimator assumes a very large, or infinite population. In case of finite sized population, the following correction is applied to the cases above:

   n_finite = n / (1 + (n-1)/size_pop)

(1.b.1) Applying the correction to the case (1.b) above, assuming a total population of a fixed 30,000 devices only:
n_finite = 255605/ (1 + (255605-1)/30000) = 28,400 devices, which need to be sampled to achieve a 0.1% confidence interval (tolerance) at the 99% confidence levels.

As discussed earlier, the reliability trends for devices tend to fit lifetime dependent exponential distributions such as Weibull better. Confidence levels in such cases are worked out accordingly using the appropriate distribution. For instance with a small constant failure rate (λ) expected, an exponential or a Poisson reliability model is a better approximation to Binomial than Normal. The confidence interval for λ is worked out as a Chi-Square distribution with 2n degrees of freedom, where n is a count of failures seen over time in the sampled set of devices.

Redundancy

Some systems need high fault tolerance. Reliability for such systems can be improved by introducing redundant systems in parallel, thereby replacing the Single Point of Failure (SPOF). When one device fails an alternate one can perform the job in its place.

Reliability of the redundant system:
R = 1 - p1 X p2 X .. X pk

   where p1,..,pk are the probability of failure of the backup redundant systems.

    (E.g. 2) In the above example where the single device system with a failure rate p=0.04, & a reliability of 96% (1-0.04), if we introduce an identical redundant/ backup device also with p=0.04, reliability goes up to R = 1 - 0.04*0.04 = 99.84%.

k-out-of-n Systems

An alternate set-up is a consensus based (k out of n) system. In this set-up, the system fails only when more than the quorum number (k, typically 50%) of devices fail. The reliability of the quorum system is:
   R_quorum_system = 1 - probability of more than k (quorum) device failures

The reliability is maximized for a majority quorum, i.e. k = n/2+1.

Monitoring Systems

Another typical approach is to introduce monitoring systems. The monitoring systems can be in the form of a sensor (optical, non-optical), a logger, a heart-beat polling unit, a human operator, or a combination of these. Whenever the monitoring system finds the primary system faltering, it raise alarms for corrective measures to be taken which may include stopping/ replacing the faulty device and/ or switching over to a backup system if available.

The reliability of the monitoring systems is assumed to be much higher than the underlying system being monitored, ideally a 100%. Monitoring systems are taken to be operating in sequence to the underlying system, so the reliability of the overall system is:
    R_monitored_system = R_device X R_monitoring

In other words, a failure in either the device or the monitor or both, will result in failure of the system, increasing the overall chances of failure. Yet, monitoring systems are effective on ground since they are the first line of defense for the system. They are able to raise alarms for the human operator to intervene early (lowering MTTR).

In certain set-ups the monitoring system are also enabled to automatically switch over to a backup device when there is a failure with the primary device. This helps reduce the down time (MTTR) to a negligible value, if not zero. With a system that has redundant devices & a single monitoring system the SPOF shifts to the monitoring system. A further refinement to the system design (such as Zab, Paxos, etc.) entails setting up the monitoring system in a k-of-n, typically majority, quorum. All decisions regarding the state of the underlying devices is taken by the quorum. The majority quorum is also resilient upto k=n/2 failures of the monitoring system.

Through good system design & thought, the reliability at the system level can be significantly boosted even if the sub-components are less reliable. Design & engineering teams must possess sound reliability analysis skills to be able to build world class products. An awareness of reliability aspects also helps the end-user to decide on the right device that suits their requirements & continues to function properly over its lifespan.

Thursday, November 28, 2013

Precision and Recall

Terms popular within search and Information Retrieval (IR) domains.

Precision: Is all about accuracy. Whether all results that have shown up are relevant.

Recall: Has to do with completeness. Whether all valid/ relevant results have shown up.

Needs detailing..

Sunday, September 22, 2013

False Negative, False Positive and the Paradox

First a bit about the terms False Positive & False Negative. There terms are associated with the nature of error in the results churned out by a system trying to answer an unknown problem, based on a (limited) set of given/ input data points. After analysing the data, the system is expected to come up with a Yes (it is Positive) or a No (it is Negative) type answer. There is invariably some error in the answer due to noisy data, wrong assumptions, calculation mistakes, unanticipated cases, mechanical errors, surges, etc.

A False Positive is when the system says the answer is Positive, but the answer is actually wrong. An example would be a sensitive car's burglar alarm system that starts to beep due to heavy lightning & thunder on a rainy day. The alarm at this stage is indicating a positive hit (i.e. a burglary), which is not really happening.

On the other hand, a False Negative is when the system answers in a Negative, where the answer should have been a Positive. False negatives happen often with first level medical tests and scans which are unable to detect the cause of pain or discomfort. The test report of "Nothing Abnormal Detected" at this stage is often a False Negative, as revealed by more detailed tests performed later.

The False Positive Paradox is an interesting phenomenon where the likelihood of a False Positive shoots up significantly (& sometimes beyond the actual positive) when the actual rate of occurrence of a condition within a given sample group is very low. The results are thanks to basic likelihood calculations as shown below.

Let's say in a group of size 1,000,000 (1 Mn.), 10% are doctors. Let's say there's a system wherein you feed in a person's Unique ID (UID) and it tells you if the person is a doctor or not. The system has a 0.01% chance of incorrectly reporting a person who is not a doctor to be a doctor (a False Positive).

Now, let's work out our confidence levels of the results given out by the system.

On the other hand if just 0.01% of people in the group are actually doctors (while the rest of the info. remains same) the confidence level works out to be quite different.

This clearly shows that the likelihood of the answer being a False Positive has shot up from much under 1% to as much as 50%, when the occurrence of a condition (number of doctors) within a given population dropped from 10% (i.e. 100,000) to a low value of 0.1% (i.e. 1,000).