Showing posts with label Data science. Show all posts
Showing posts with label Data science. Show all posts

Monday, March 29, 2021

Doing Better Data Science

In the article titled "Field Notes #1 - Easy Does It" author Will Kurt highlights a key aspect of doing good Data Science - Simplicity. This includes first and foremost getting a good understanding of the problem to be solved. Later among the hypothesis & possible solutions/ models to favour the simpler ones. Atleast giving the simpler ones a fair/ equal chance at proving their worth in tests employing standardized performance metrics.  

Another article of relevance for Data Scientists is from the allied domain of Stats titled "The 10 most common mistakes with statistics, and how to avoid them". The article based on the paper in eLife by the authors Makin and Orban de Xivry lists out the ten most common statistical mistakes in scientific research. The paper also includes tips for both the Reviewers to detect such mistakes and for Researchers (authors) to avoid them.

Many of the issues listed are linked to the p-value computations which is used to establish significance of statistical tests & draw conclusions from it. However, its incorrect usage, understanding, corrections, manipulation, etc. results in rendering the test ineffective and insignificant results getting reported. Issues of Sampling and adequate Control Groups along with faulty attempts by authors to establish Causation where none exists are also common in scientific literature.  

As per the authors, these issues typically happen due to ineffective experimental designs, inappropriate analyses and/or flawed reasoning. A strong publication bias & pressure on researchers to publish significant results as opposed to correct but failed experiments makes matters worse. Moreover senior researchers entrusted to mentor juniors are often unfamiliar with fundamentals and prone to making these errors themselves. Their aversion to taking criticism becomes a further roadblock to improvement.  

While correct mentoring of early stage researchers will certainly help, change can also come in by making science open access. Open science/ research must include details on all aspects of the study and all the materials involved such as data and analysis code. On the other hand, at the institutions and funders level incentivizing correctness over productivity can also prove beneficial.

Friday, April 17, 2020

Analysis of Deaths Registered In Delhi Between 2015 - 2018

The Directorate of Economics and Statistics & Office of Chief Registrar (Births & Deaths), Government of National Capital Territory (NCT) of Delhi annually publishes its report on registrations of births and deaths that have taken place within the NCT of Delhi. The report, an overview of the Civil Registration System (CRS) in the NCT of Delhi, is a source of very useful stats on birth, deaths, infant mortality and so on within the Delhi region.

The detailed reports can be downloaded in the form of pdf files from the website of the Department of Economics and Statistics, Delhi Government. Anonymized, cleaned data is made available in the form of tables in Section Titled "STATISTICAL TABLES" in the pdf files. The births and deaths data is aggregated by attributes like age, profession, gender, etc.

Approach

In this article, an analysis has been done of tables D-4 (DEATHS BY SEX AND MONTH OF OCCURRENCE (URBAN)), D-5 (DEATHS BY TYPE OF ATTENTION AT DEATH (URBAN)), & D-8 (DEATHS BY AGE, OCCUPATION AND SEX (URBAN)) from the above pdfs. Data from for the four years 2015-18 (presently downloadable from the department's website) has been used from these tables for evaluating mortality trends in Delhi for the three most populous Urban districts of North DMC, South DMC & East DMC for the period 2015-18. 

Analysis




1) Cyclic Trends: Data for absolute death counts for period Jan-2015 to Dec-2018 is plotted in table "T1: Trends 2015-18". Another view of the same data is as monthly percentage of annual shown in table "T-2: Month/ Year_Total %".




Both tables clearly show that there is a spike in the number of deaths in the colder months of Dec to Feb. About 30% of all deaths in Delhi happen within these three months. The percentages are fairly consistent for both genders and across all 3 districts of North, South & East DMCs.

As summer sets in from March the death percentages start dropping. Reaching the lowest points below 7% monthly for June & July as the monsoons set in. Towards the end of monsoons, a second spike is seen around Aug/ Sep followed by a dip in Oct/ Nov before the next winters when the cyclic trends repeat.


  


Trends reported above are also seen with moving averages, plotted in Table "T-3: 3-Monthly Moving Avg", across the three districts and genders. Similar trends, though not plotted here, are seen in the moving averages of other tenures (such as 2 & 4 months).

2) Gender Differences: In terms of differences between genders, far more deaths of males as compared to females were noted during the peak winters on Delhi between 2015-18. This is shown in table "T4: Difference Male & Female".




From a peak gap of about 1000 in the colder months it drops to about 550-600 range in the summer months, particularly for the North & South DMCs. A narrower gap is seen the East DMC, largely attributable to its smaller population size as compared to the other two districts.






Table "T5: Percentage Male/ Female*100" plots the percentage of male deaths to females over the months. The curves of the three districts though quite wavy primarily stay within the rough band of 1.5 to 1.7 times male deaths as compared to females. The spike of the winter months is clearly visible in table T5 as well.    

3) Cross District Differences in Attention Type: Table "T6: Percentage Attention Type" plots the different form of Attention Type (hospital, non-institutional, doctor/ nurse, family, etc.) received by the person at the time of death.




While in East DMC, over 60% people were in institutional care the same is almost 20% points lower for North & South DMCs. For the later two districts the percentage for No Medical Attention received has remained consistently high, the South DMC being particularly high over 40%.

4) Vulnerable Age: Finally, a plot of the vulnerable age groups is shown in table "T7: Age 55 & Above". A clear spike in death rates is seen in the 55-64 age group, perhaps attributable to the act of retirement from active profession & subsequent life style changes. The gender skewness within the 55-64 age group may again be due to the inherent skewness in the workforce, having far higher number of male workers, who would be subjected to the effects of retirement. This aspect could be probed further from other data sources.







Age groups in-between 65-69 show far lower mortality rates as they are perhaps better adjusted and healthier. Finally, a spike is seen in the number of deaths in the super senior citizens aged 70 & above, which must be largely attributable to their advancing age resulting in frail health.

Conclusion

The analysis in this article was done using data published by the Directorate of Economics and Statistics & Office of Chief Registrar (Births & Deaths), Government of National Capital Territory (NCT) of Delhi annually on registrations of births and deaths within the NCT of Delhi. Data of mortality from the three most populous districts of North DMC, South DMC and East DMC of Delhi were analysed. Some specific monthly, yearly and age group related trends are reported here.

The analysis can be easily performed over the other districts of Delhi, as well as for data from current years as and when those are made available by the department. The data may also be used for various modeling and simulation purposes and training machine learning algorithms. A more real-time sharing of raw (anonymized, aggregated) data by the department via api's or other data feeds may be looked at in the future. These may prove beneficial for the research and data science community who may put the data to good use for public health and welfare purposes.

Resouces: 

Downloadable Datasheets For Analysis:

Friday, February 28, 2020

Defence R&D Organisation Young Scientists Lab (DYSL)


Recently there was quite a lot of buzz in the media about the launch of DRDO Young Scientists Lab (DYSL). 5 such labs have been formed by DRDO each headed by a young director under the age of 35! Each lab has its own specialized focus area from among fields such as AI, Quantum Computing, Cognitive Technologies, Asymmetric Technologies and Smart Materials.

When trying to look for specifics on what these labs are doing, particularly the AI lab, there is very little to go by for now. While a lot of information about the vintage DRDO Centre of AI and Robotics (CAIR) lab is available on the DRDO website, there's practically nothing there regarding the newly formed DRDO Young Scientists Lab on AI (DYSL-AI). Neither are the details available anywhere else in the public domain, till end-Feb 2020 atleast. While these would certainly get updated soon for now there are just these interviews with the directors of the DYSL labs:

  • Doordarshan's Y-Factor Interview with the 5 DYSL Directors Mr. Parvathaneni Shiva Prasad, Mr. Manish Pratap Singh, Mr. Ramakrishnan Raghavan, Mr. Santu Sardar, Mr. Sunny Manchanda







  • Rajya Sabha TV Interview with DYSL-AI Director Mr. Sunny Manchanda





Wednesday, February 26, 2020

Sampling Plan for Binomial Population with Zero Defects

Rough notes on sample size requirement calculations for a given confidence interval for a Binomial Population - having a probability p of success & (1 – p) of failure. The first article of relevance is Binomial Confidence Interval which lists out the different approaches to be taken when dealing with:

  • Large n (> 15), large p (>0.1) => Normal Approximation
  • Large n (> 15), small p (<0.1) => Poisson Approximation
  • Small n (< 15), small p (<0.1) => Binomial Table

On the other side, there are derivatives of the Bayes Success Run theorem such as Acceptance Sampling, Zero Defect Sampling, etc. used to work out statistically valid sampling plans. These approaches are based on a successful run of n tests, in which either zero or a an upper bounded k-failures are seen.

These approaches are used in various industries like healthcare, automotive, military, etc. for performing inspections, checks and certifications of components, parts and devices. The sampling could be single sampling (one sample of size n with confidence c), or double sampling (a first smaller sample n1 with confidences c1 & a second larger sample n2 with confidence c2 to be used if test on sample n1 shows more than c1 failures), and other sequential sampling versions of it. A few rule of thumb approximations have also emerged in practice based on the success run techique:

  • Rule of 3s: That provides a bound for p=3/n, with a 95% confidence for a given success run of length n, with zero defects.

Footnote on Distributions:
  • Poisson confidence interval is derived from Gamma Distribution - which is defined using the two-parameters shape & scale. Exponential, Erlang & Chi-Squared are all special cases of Gamma Distrubtion. Gamma distribution is used in areas such as prediction of wait time, insurance claims, wireless communication signal power fading, age distribution of cancer events, inter-spike intervals, genomics. Gamma is also the conjugate prior of Bayesian statistics & exponential distribution.

Monday, March 4, 2019

AB Testing To Establish Causation

A/B testing is a form of testing performed to compare the effectiveness of different versions of a product with randomly distributed (i.i.d.) end-user groups. Each group gets to use only one version of the product. Assignment of any particular user to a specific group is done at random, without any biases, etc. User composition of the different groups are assumed to be similar, to the extent that switching the version of the products between any two groups at random would make no difference to the overall results of the test.

A/B testing is an example of a simple randomized control trial . This sort of tests help establish causal relationship between a particular element of change & the measured outcome. The element of change could be something like change of location of certain elements of a page, adding/ removing a feature of the product, conversion rate, and so on. The outcome could be to measure the impact on additional purchase, clicks, time of engagement, etc.

During the testing period, users are at randomly assigned to the different groups. The key aspect of the test is the random assignment of users being done in real-time of otherwise similar users, so that no other unknown confounding factors (demographics, seasonality, tech. competence, background, etc.) have no impact on the test objective. When tested with a fairly large number of users, almost every group will end-up with a good sample of users that are representative of the underlying population.

One of the groups (the control group) is shown the baseline version (maybe an older version of an existing product) of the product against which the alternate versions are compared. For every group the proportions of users that fulfilled the stated objective (purchased, clicked, converted, etc.) is captured.

The proportions (pi) are then used to compute the test statistics Z-value (assuming a large normally distributed user base), confidence intervals, etc. The null hypothesis being that the proportions (pi) are all similar/ not significantly different from the proportion of the control group (pc).

For the two version A/B test scenario

   Null hypothesis H0(p1 = pc) vs. the alternate hypothesis H1(p1 != pc).

   p1 = X1/ N1 (Test group)
   pc = Xc/ Nc (Control group)
   p_total = (X1 + Xc)/(N1 + Nc) (For the combined population) ,
            where X1, Xc: Number of users from groups test & control that fulfilled the objective,
                & N1, Nc: Total number of users from test & control group

  Z = Observed_Difference / Standard_Error
    = (p1 - pc)/ sqrt(p_total * (1 - p_total) * (1/N1 + 1/Nc))



The confidence level for the computed Z value is looked up in a normal table. Depending upon whether it is greater than 1.96 (or 2.56) the null hypothesis can be rejected with a confidence level of 95% (or 99%, or higher). This would indicate that the behavior of the test group is significantly different from the control group, the likely cause for which being the element of change brought in by the alternate version of the product. On the other hand, if the Z value is less than 1.96, the null hypothesis is not rejected & the element of change not considered to have made any significant impact on fulfillment of the stated objective.

Sunday, March 3, 2019

Mendelian Randomization

The second law of Mendelian inheritance is about independent assortment of alleles at the time of gamete (sperm & egg cells) formation. Therefore within the population of any given species, genetic variants are likely to be distributed at random, independent of any external factors. This insight forms the basis of Mendelian Randomization (MR) technique, typically applied in studies of epidemiology.

Studies of epidemiology try to establish the causal link (given some known association) between a particular risk factor & a disease. For e.g. smoking to cancer, blood pressure to stroke, etc. The association in many cases is found to be non-causal, or reverse causal, etc. Establishing the true effect becomes challenging due to the presence of confounding factors such as social, behavioral, environmental, physiological, etc. MR helps to tackle the confounding factors in such situations.

In MR, genetic variants (polymorphism) or genotype that have an effect similar to the risk factor/ exposure are identified. An additional constraint being that the genotype must not have any direct influence on the disease. Existence of genotype in the population is random, independent of any external influence. So presence (or absence) of disease within the population possessing the genotype, establishes (or refutes) that the risk factor/ effect is actually the cause for the disease. Several researches based on Mendelian randomization have been done  successfully.

Example 1: There could be a study to establish the causal relationship (given observed association) between raised cholesterol levels & chronic heart disease (CHD). Given the presence of several confounding factors such as age, physiology, smoking/ drinking habits, reverse causation (CHD causing raised cholesterol), etc., MR approach would be beneficial.

The approach would be to identify a genotype/ gene variant that is known to be linked to an increase in total cholesterol levels (but has no direct bearing on CHD). The propensity for CHD is tested for all subjects having the particular genotype, which if/ when found much higher than the general population (not possessing the gene variant) establishes that raised cholesterol levels have a causal relationship with CHD.

Instrumental Variables

MR is an application of the statistical technique of instrumental variables (IV) estimation. IV technique is also used to establish causal relationships in the presence of confounding factors.

When applied to regression models, IV technique is particularly beneficial when the explanatory variable (covariates) are correlated with the error term & give biased results. The choice of IV is such that it only induces changes in the explanatory variables, without having any independent effect on dependent variables. The IV must not be correlated to the error term. Selecting an IV that fulfills these criterias is largely done through an analytical process supported by some observational data, & by leveraging relevant priors about the field of study.

Equating MR to IV 
  • Risk Factor/ Effect = Explanatory Variable, 
  • Disease = Dependent Variable
  • Genotype = Instrument Variable 
Selection of genotype (IV) is based on prior knowledge of genes, from existing sources, literature, etc.

Friday, February 22, 2019

Taller Woman Dance Partner

In the book Think Stats, there's an exercise to work out the percentage of dance couples where the woman is taller, when paired up at random. Mean heights (cm) & their variances are given as 178 & 59.4 for men, & 163 & 52.8 for women.

The two height distributions for men & women can be assumed Normal. The solution is to work out the total area under the two curves where the condition height_woman > height_men holds. This will need to be done for every single height point (h), i.e.  under the  entire spread of the two curves (-∞, ∞). In other words, the integral of the height curve for men from (-∞, h) having height < h,  multiplied by the integral of the height curve for women from (h, ∞) having height > h.

There are empirical solutions where N data points are drawn from two Normal distributions with appropriate mean & sd (variance) for men & women. These are paired up at random (averaged over k-runs) to compute the number of pairs with taller women.

The area computation can also be approximated by limiting to the ±3 sd range (includes 99.7%) of height values on either side of the two curves (140cm to 185cm). Then by sliding along the height values (h) starting from h=185 down in steps of size (s) s=0.5 or so, compute z at each point:

z = (h - m)/ sd, where m & sd are the corresponding mean & standard deviation of the two curves.

Refer to the one-sided standard normal value to compute percentage women to the right of h (>z) & percentage of men to left of h (<z). The product of the two is the corresponding value at point h. A summation of the same over all h results in the final percentage value. The equivalent solution using NORMDIST yields a likelihood of ~7.5%, slightly below expected (due to the coarse step size of 0.5). 

C1 plots the percent. of women to the right & percent. of men to the left at different height values. C2 is the likelihood of seeing a couple with a taller woman within each step window of size 0.5. Interestingly, the peak in C2 is between heights 172-172.5, about 1.24 sd from women's mean (163) & 0.78 sd from men's mean (178). The spike at the end of the curve C2 at point 185.5 is for the likelihood of all heights > 185.5, i.e. the window (185.5,∞).

Playing around with different height & variance values yields other final results. For instance at the moment the two means are separated by 2*sd. If we reduced this to 1*sd the mean height of women (or men) to about 170.5 cm, the final likelihood jumps to about 23%. This is understandable since the population now has far more taller women. The height variance for men is more than women, setting them to identical values 52.8 (fewer shorter men) results in lowering of the percentage to about 6.9%, vs. setting them to 59.4 (more taller women) increases the percentage to 8.1%.





Sample data points from Confidence_Interval_Heights.ods worksheet:


Point (p) Women
Men Likelihood (l)

z_f
Using p
r_f =
% to Right
r_f_delta =
% in between

z_m
Using p
l_m=
% to Left
l = l_m
* r_f_delta
185.5 3.0964606 0.0009792 0.0009792
0.9731237 0.8347541 0.0008174
185 3.0276504 0.0012323 0.0002531
0.9082488 0.8181266 0.0002071
184.5 2.9588401 0.0015440 0.0003117
0.8433739 0.8004903 0.0002495
184 2.8900299 0.0019260 0.0003820
0.7784989 0.7818625 0.0002987
183.5 2.8212196 0.0023921 0.0004660
0.7136240 0.7622702 0.0003553
183 2.7524094 0.0029579 0.0005659
0.6487491 0.7417497 0.0004197
182.5 2.6835992 0.0036417 0.0006838
0.5838742 0.7203475 0.0004926
182 2.6147889 0.0044641 0.0008224
0.5189993 0.6981194 0.0005741

...

Thursday, February 21, 2019

Poincaré and the Baker

There is a story about the famous French mathematician Poincaré, where he kept track of the weight of the loaves of bread that he bought for a year from one single baker. At the end of the year he complained to the police that the baker has been cheating people selling loaves with a mean weight of 950g instead of 1000g. The baker was issued a warning & let off.

Poincaré continued weighing the loaves that he bought from the baker through the next year at the end which he again complained. This time even though the mean weight of the loaves was 1000g, he pointed out that the baker was still continuing to bake loaves with lower weights, but handing out the heavier ones to Poincaré. The story though made up makes for an interesting exercise for a stats intro class.

The crux of the solution, as shown by others, is to model the baked loaves as a Normal distribution.

At the end of year 1

The expected  mean m=1000g, since the baker had been cheating the mean shows up as 950g. The standard deviation (sd) is assumed to remain unaffected. To know if the drop in mean from the expected value m1=1000 to the actual m2=950 for identical sd, a hypothesis test can be performed to ascertain if the sampled breads are drawn from the same Normal distribution having mean value m1=1000, or from a separate one having mean m2=950. The null hypothesis is H0(m1 = m2) vs. the alternate hypothesis H1(m1 != m2).

|z| = |(m1-m2)/(sd/sqrt(n))| where n is the no of samples.

With an assumption of Poincaré's frequency of buying at 1 loaf/ day (n=365), & sd=50,

z=19.1, much larger than 1% (2.576) level of significance, so the null hypothesis can be rejected. Poincaré was right in calling out the baker's cheating!

From the original story we are only certain about the values of m1=1000 & m2=950. Playing around with the other values, for instance changing n=52 (1 loaf/ week), or sd to 100g (novice baker, more variance), could yield other z values & different decisions about the null hypothesis.     

At the end of year 2

What Poincaré observes is a different sort of curve with a mean=1000g, but that is far narrower than Normal. A few quick checks make him realize that he's looking at an Extreme Value Distribution (EVD) or the Gumbell distribution. Using data from his set of bread samples, he could easily work out the parameters of the EVD (mean, median, shape β, location μ, etc.). Evident from the EVD curve was the baker's modus-operandi of handing out specially earmarked heavier loaves (heaviest ones would yield narrowest curve) to Poincaré.