Showing posts with label Big Data. Show all posts
Showing posts with label Big Data. Show all posts

Wednesday, March 31, 2021

Flip side to Technology - Extractivism, Exploitation, Inequality, Disparity, Ecological Damage

Anatomy of an AI system is a real eye-opener. This helps us to get a high level view of the enormous complexity and scale of the supply chains, manufacturers, assemblers, miners, transporters and other links that collaborate at a global scale to help commercialize something like an Amazon ECHO device.

The authors explain how extreme exploitation of human labour, environment and resources that happen at various levels largely remain unacknowledged and unaccounted for. Right from mining of rare elements, to smelting and refining, to shipping and transportation, to component manufacture and assembly, etc. these mostly happen under in-human conditions with complete disregard for health, well-being, safety of workers who are given miserable wages. These processes also cause irreversible damage to the ecology and environment at large.

Though Amazon Echo as an AI powered self-learning device connected to cloud-based web-services opens up several privacy, safety, intrusion and digital exploitation concerns for the end-user, yet focusing solely on Echo would amount to missing the forest for the trees! Most issues highlighted here would be equally true of technologies from many other traditional and non-AI, or not-yet-AI, powered sectors like automobiles, electronics, telecom, etc. Time to give a thought to these issues and bring a stop to the irreversible damage to humans lives, well-being, finances, equality, and to the environment and planetary resources!

Monday, March 29, 2021

Doing Better Data Science

In the article titled "Field Notes #1 - Easy Does It" author Will Kurt highlights a key aspect of doing good Data Science - Simplicity. This includes first and foremost getting a good understanding of the problem to be solved. Later among the hypothesis & possible solutions/ models to favour the simpler ones. Atleast giving the simpler ones a fair/ equal chance at proving their worth in tests employing standardized performance metrics.  

Another article of relevance for Data Scientists is from the allied domain of Stats titled "The 10 most common mistakes with statistics, and how to avoid them". The article based on the paper in eLife by the authors Makin and Orban de Xivry lists out the ten most common statistical mistakes in scientific research. The paper also includes tips for both the Reviewers to detect such mistakes and for Researchers (authors) to avoid them.

Many of the issues listed are linked to the p-value computations which is used to establish significance of statistical tests & draw conclusions from it. However, its incorrect usage, understanding, corrections, manipulation, etc. results in rendering the test ineffective and insignificant results getting reported. Issues of Sampling and adequate Control Groups along with faulty attempts by authors to establish Causation where none exists are also common in scientific literature.  

As per the authors, these issues typically happen due to ineffective experimental designs, inappropriate analyses and/or flawed reasoning. A strong publication bias & pressure on researchers to publish significant results as opposed to correct but failed experiments makes matters worse. Moreover senior researchers entrusted to mentor juniors are often unfamiliar with fundamentals and prone to making these errors themselves. Their aversion to taking criticism becomes a further roadblock to improvement.  

While correct mentoring of early stage researchers will certainly help, change can also come in by making science open access. Open science/ research must include details on all aspects of the study and all the materials involved such as data and analysis code. On the other hand, at the institutions and funders level incentivizing correctness over productivity can also prove beneficial.

Friday, April 17, 2020

Analysis of Deaths Registered In Delhi Between 2015 - 2018

The Directorate of Economics and Statistics & Office of Chief Registrar (Births & Deaths), Government of National Capital Territory (NCT) of Delhi annually publishes its report on registrations of births and deaths that have taken place within the NCT of Delhi. The report, an overview of the Civil Registration System (CRS) in the NCT of Delhi, is a source of very useful stats on birth, deaths, infant mortality and so on within the Delhi region.

The detailed reports can be downloaded in the form of pdf files from the website of the Department of Economics and Statistics, Delhi Government. Anonymized, cleaned data is made available in the form of tables in Section Titled "STATISTICAL TABLES" in the pdf files. The births and deaths data is aggregated by attributes like age, profession, gender, etc.

Approach

In this article, an analysis has been done of tables D-4 (DEATHS BY SEX AND MONTH OF OCCURRENCE (URBAN)), D-5 (DEATHS BY TYPE OF ATTENTION AT DEATH (URBAN)), & D-8 (DEATHS BY AGE, OCCUPATION AND SEX (URBAN)) from the above pdfs. Data from for the four years 2015-18 (presently downloadable from the department's website) has been used from these tables for evaluating mortality trends in Delhi for the three most populous Urban districts of North DMC, South DMC & East DMC for the period 2015-18. 

Analysis




1) Cyclic Trends: Data for absolute death counts for period Jan-2015 to Dec-2018 is plotted in table "T1: Trends 2015-18". Another view of the same data is as monthly percentage of annual shown in table "T-2: Month/ Year_Total %".




Both tables clearly show that there is a spike in the number of deaths in the colder months of Dec to Feb. About 30% of all deaths in Delhi happen within these three months. The percentages are fairly consistent for both genders and across all 3 districts of North, South & East DMCs.

As summer sets in from March the death percentages start dropping. Reaching the lowest points below 7% monthly for June & July as the monsoons set in. Towards the end of monsoons, a second spike is seen around Aug/ Sep followed by a dip in Oct/ Nov before the next winters when the cyclic trends repeat.


  


Trends reported above are also seen with moving averages, plotted in Table "T-3: 3-Monthly Moving Avg", across the three districts and genders. Similar trends, though not plotted here, are seen in the moving averages of other tenures (such as 2 & 4 months).

2) Gender Differences: In terms of differences between genders, far more deaths of males as compared to females were noted during the peak winters on Delhi between 2015-18. This is shown in table "T4: Difference Male & Female".




From a peak gap of about 1000 in the colder months it drops to about 550-600 range in the summer months, particularly for the North & South DMCs. A narrower gap is seen the East DMC, largely attributable to its smaller population size as compared to the other two districts.






Table "T5: Percentage Male/ Female*100" plots the percentage of male deaths to females over the months. The curves of the three districts though quite wavy primarily stay within the rough band of 1.5 to 1.7 times male deaths as compared to females. The spike of the winter months is clearly visible in table T5 as well.    

3) Cross District Differences in Attention Type: Table "T6: Percentage Attention Type" plots the different form of Attention Type (hospital, non-institutional, doctor/ nurse, family, etc.) received by the person at the time of death.




While in East DMC, over 60% people were in institutional care the same is almost 20% points lower for North & South DMCs. For the later two districts the percentage for No Medical Attention received has remained consistently high, the South DMC being particularly high over 40%.

4) Vulnerable Age: Finally, a plot of the vulnerable age groups is shown in table "T7: Age 55 & Above". A clear spike in death rates is seen in the 55-64 age group, perhaps attributable to the act of retirement from active profession & subsequent life style changes. The gender skewness within the 55-64 age group may again be due to the inherent skewness in the workforce, having far higher number of male workers, who would be subjected to the effects of retirement. This aspect could be probed further from other data sources.







Age groups in-between 65-69 show far lower mortality rates as they are perhaps better adjusted and healthier. Finally, a spike is seen in the number of deaths in the super senior citizens aged 70 & above, which must be largely attributable to their advancing age resulting in frail health.

Conclusion

The analysis in this article was done using data published by the Directorate of Economics and Statistics & Office of Chief Registrar (Births & Deaths), Government of National Capital Territory (NCT) of Delhi annually on registrations of births and deaths within the NCT of Delhi. Data of mortality from the three most populous districts of North DMC, South DMC and East DMC of Delhi were analysed. Some specific monthly, yearly and age group related trends are reported here.

The analysis can be easily performed over the other districts of Delhi, as well as for data from current years as and when those are made available by the department. The data may also be used for various modeling and simulation purposes and training machine learning algorithms. A more real-time sharing of raw (anonymized, aggregated) data by the department via api's or other data feeds may be looked at in the future. These may prove beneficial for the research and data science community who may put the data to good use for public health and welfare purposes.

Resouces: 

Downloadable Datasheets For Analysis:

Friday, February 28, 2020

Defence R&D Organisation Young Scientists Lab (DYSL)


Recently there was quite a lot of buzz in the media about the launch of DRDO Young Scientists Lab (DYSL). 5 such labs have been formed by DRDO each headed by a young director under the age of 35! Each lab has its own specialized focus area from among fields such as AI, Quantum Computing, Cognitive Technologies, Asymmetric Technologies and Smart Materials.

When trying to look for specifics on what these labs are doing, particularly the AI lab, there is very little to go by for now. While a lot of information about the vintage DRDO Centre of AI and Robotics (CAIR) lab is available on the DRDO website, there's practically nothing there regarding the newly formed DRDO Young Scientists Lab on AI (DYSL-AI). Neither are the details available anywhere else in the public domain, till end-Feb 2020 atleast. While these would certainly get updated soon for now there are just these interviews with the directors of the DYSL labs:

  • Doordarshan's Y-Factor Interview with the 5 DYSL Directors Mr. Parvathaneni Shiva Prasad, Mr. Manish Pratap Singh, Mr. Ramakrishnan Raghavan, Mr. Santu Sardar, Mr. Sunny Manchanda







  • Rajya Sabha TV Interview with DYSL-AI Director Mr. Sunny Manchanda





Wednesday, February 26, 2020

Sampling Plan for Binomial Population with Zero Defects

Rough notes on sample size requirement calculations for a given confidence interval for a Binomial Population - having a probability p of success & (1 – p) of failure. The first article of relevance is Binomial Confidence Interval which lists out the different approaches to be taken when dealing with:

  • Large n (> 15), large p (>0.1) => Normal Approximation
  • Large n (> 15), small p (<0.1) => Poisson Approximation
  • Small n (< 15), small p (<0.1) => Binomial Table

On the other side, there are derivatives of the Bayes Success Run theorem such as Acceptance Sampling, Zero Defect Sampling, etc. used to work out statistically valid sampling plans. These approaches are based on a successful run of n tests, in which either zero or a an upper bounded k-failures are seen.

These approaches are used in various industries like healthcare, automotive, military, etc. for performing inspections, checks and certifications of components, parts and devices. The sampling could be single sampling (one sample of size n with confidence c), or double sampling (a first smaller sample n1 with confidences c1 & a second larger sample n2 with confidence c2 to be used if test on sample n1 shows more than c1 failures), and other sequential sampling versions of it. A few rule of thumb approximations have also emerged in practice based on the success run techique:

  • Rule of 3s: That provides a bound for p=3/n, with a 95% confidence for a given success run of length n, with zero defects.

Footnote on Distributions:
  • Poisson confidence interval is derived from Gamma Distribution - which is defined using the two-parameters shape & scale. Exponential, Erlang & Chi-Squared are all special cases of Gamma Distrubtion. Gamma distribution is used in areas such as prediction of wait time, insurance claims, wireless communication signal power fading, age distribution of cancer events, inter-spike intervals, genomics. Gamma is also the conjugate prior of Bayesian statistics & exponential distribution.

Tuesday, March 26, 2019

Opinions On A Topic

Media agencies of the day are busy flooding us with news - wanted, unwanted, real, fake, good, bad, ugly, whatever. Yet, for the user the challenge to stay truly updated has never been this tough. Sifting the hay from the chaff is both computationally & practically hard!

There's a real need to automatically detect, flag & block misleading information from propagating. Though at the moment the technology doesn't exist, offerings are very likely to come up soon & get refined over time to nail the problem well enough. While we await breakthroughs on that front, for now the best bet is to depend on traditional human judgment.

- Make use of a set (not one or two) of trusted media sources, that employ professionals & expert journalists. Rely on their expertise to do the job of collecting & presenting the facts correctly. Assuming (hopefully) that these people/ organizations behave professionally, the information that gets through to these sources would be far better.

- Fact check details across the entire set of sources. This helps mitigate against a temporary (or permanent) deliberate/ inadvertent faltering, manipulation, influence, etc. of one odd sources. Use the set as a weak quorum that collectively highlights & prevents propagation of misinformation. Even if a few members there falter, unlikely that all would. The majority would not allow the fakes to make it into their respective channels.

- Challenging part being if a certain piece shows up as a breaking news on one channel & not the others. Could default to labeling it as fake/ unverified, with the following considerations for the news piece:

 Case 1: Turns out fake, doesn't show up on the other sources
     => Remains Correctly Marked Fake


 Case 2: Turns out to be genuine & eventually shows up on other/ majority sources
    => Gets Correctly Marked True
 

 Case 3: Is genuine, but acquired via some form of journalistic brilliance (expose, criminal/ undercover journalism, etc.) that can't be re-run, or is about a region/ issue largely ignored by the mainstream media unwilling to do the verification, or for some other reason can't be verified
    => Remains Incorrectly Marked Fake


Case 3 is obviously the toughest to crack. While some specifics maybe impossible to verify, other allied details could be easier to access & verify. Once some other media groups (beyond the one that reported) get involved in the secondary verification there is some likelihood of true facts emerging.

For those marginalized there are social groups & organizations, governmental & non-governmental that have some reports published on issues from ground zero. At the same time, as connectivity improves, citizens themselves would be able to bring forth local issues onto national & international platforms. In the interim, these will have to be relied upon until commercial interests & mainstream media eventually bring the marginalized into the folds. Nonetheless, much more thought & effort is needed to check the spread of misinformation.

Finally, here's a little script 'op-on.sh' / 'op-on.py' (works/ tested on *nix desktop), to look up opinions (buzz) on any given topic across a set of media agencies, of repute. Alternatively, a bookmarklet could be added to the browser, which would enable looking up the opinions across the sites. The op-on bookmarklet (tested on Firefox & Chrome) can be installed by right clicking & adding as a bookmark in the browser (or by copying the script into the url of a new bookmark). Pop-up blockers in the browser will need to be temporarily disabled (e.g. by clicking allow pop-ups in Firefox) for the script to work.

The set of media agencies that these scripts look up include groups like TOI, IE, India Today, Times Now, WION, Ndtv, Hindu, HT, Print, Quint, Week, Reuters, BBC, and so on. This might help the curious human reader to look up all those sources for opinions on any topic of interest.

Update 1 (16-Sep-19): Some interesting developments:

Friday, February 22, 2019

Taller Woman Dance Partner

In the book Think Stats, there's an exercise to work out the percentage of dance couples where the woman is taller, when paired up at random. Mean heights (cm) & their variances are given as 178 & 59.4 for men, & 163 & 52.8 for women.

The two height distributions for men & women can be assumed Normal. The solution is to work out the total area under the two curves where the condition height_woman > height_men holds. This will need to be done for every single height point (h), i.e.  under the  entire spread of the two curves (-∞, ∞). In other words, the integral of the height curve for men from (-∞, h) having height < h,  multiplied by the integral of the height curve for women from (h, ∞) having height > h.

There are empirical solutions where N data points are drawn from two Normal distributions with appropriate mean & sd (variance) for men & women. These are paired up at random (averaged over k-runs) to compute the number of pairs with taller women.

The area computation can also be approximated by limiting to the ±3 sd range (includes 99.7%) of height values on either side of the two curves (140cm to 185cm). Then by sliding along the height values (h) starting from h=185 down in steps of size (s) s=0.5 or so, compute z at each point:

z = (h - m)/ sd, where m & sd are the corresponding mean & standard deviation of the two curves.

Refer to the one-sided standard normal value to compute percentage women to the right of h (>z) & percentage of men to left of h (<z). The product of the two is the corresponding value at point h. A summation of the same over all h results in the final percentage value. The equivalent solution using NORMDIST yields a likelihood of ~7.5%, slightly below expected (due to the coarse step size of 0.5). 

C1 plots the percent. of women to the right & percent. of men to the left at different height values. C2 is the likelihood of seeing a couple with a taller woman within each step window of size 0.5. Interestingly, the peak in C2 is between heights 172-172.5, about 1.24 sd from women's mean (163) & 0.78 sd from men's mean (178). The spike at the end of the curve C2 at point 185.5 is for the likelihood of all heights > 185.5, i.e. the window (185.5,∞).

Playing around with different height & variance values yields other final results. For instance at the moment the two means are separated by 2*sd. If we reduced this to 1*sd the mean height of women (or men) to about 170.5 cm, the final likelihood jumps to about 23%. This is understandable since the population now has far more taller women. The height variance for men is more than women, setting them to identical values 52.8 (fewer shorter men) results in lowering of the percentage to about 6.9%, vs. setting them to 59.4 (more taller women) increases the percentage to 8.1%.





Sample data points from Confidence_Interval_Heights.ods worksheet:


Point (p) Women
Men Likelihood (l)

z_f
Using p
r_f =
% to Right
r_f_delta =
% in between

z_m
Using p
l_m=
% to Left
l = l_m
* r_f_delta
185.5 3.0964606 0.0009792 0.0009792
0.9731237 0.8347541 0.0008174
185 3.0276504 0.0012323 0.0002531
0.9082488 0.8181266 0.0002071
184.5 2.9588401 0.0015440 0.0003117
0.8433739 0.8004903 0.0002495
184 2.8900299 0.0019260 0.0003820
0.7784989 0.7818625 0.0002987
183.5 2.8212196 0.0023921 0.0004660
0.7136240 0.7622702 0.0003553
183 2.7524094 0.0029579 0.0005659
0.6487491 0.7417497 0.0004197
182.5 2.6835992 0.0036417 0.0006838
0.5838742 0.7203475 0.0004926
182 2.6147889 0.0044641 0.0008224
0.5189993 0.6981194 0.0005741

...

Thursday, February 21, 2019

Poincaré and the Baker

There is a story about the famous French mathematician Poincaré, where he kept track of the weight of the loaves of bread that he bought for a year from one single baker. At the end of the year he complained to the police that the baker has been cheating people selling loaves with a mean weight of 950g instead of 1000g. The baker was issued a warning & let off.

Poincaré continued weighing the loaves that he bought from the baker through the next year at the end which he again complained. This time even though the mean weight of the loaves was 1000g, he pointed out that the baker was still continuing to bake loaves with lower weights, but handing out the heavier ones to Poincaré. The story though made up makes for an interesting exercise for a stats intro class.

The crux of the solution, as shown by others, is to model the baked loaves as a Normal distribution.

At the end of year 1

The expected  mean m=1000g, since the baker had been cheating the mean shows up as 950g. The standard deviation (sd) is assumed to remain unaffected. To know if the drop in mean from the expected value m1=1000 to the actual m2=950 for identical sd, a hypothesis test can be performed to ascertain if the sampled breads are drawn from the same Normal distribution having mean value m1=1000, or from a separate one having mean m2=950. The null hypothesis is H0(m1 = m2) vs. the alternate hypothesis H1(m1 != m2).

|z| = |(m1-m2)/(sd/sqrt(n))| where n is the no of samples.

With an assumption of Poincaré's frequency of buying at 1 loaf/ day (n=365), & sd=50,

z=19.1, much larger than 1% (2.576) level of significance, so the null hypothesis can be rejected. Poincaré was right in calling out the baker's cheating!

From the original story we are only certain about the values of m1=1000 & m2=950. Playing around with the other values, for instance changing n=52 (1 loaf/ week), or sd to 100g (novice baker, more variance), could yield other z values & different decisions about the null hypothesis.     

At the end of year 2

What Poincaré observes is a different sort of curve with a mean=1000g, but that is far narrower than Normal. A few quick checks make him realize that he's looking at an Extreme Value Distribution (EVD) or the Gumbell distribution. Using data from his set of bread samples, he could easily work out the parameters of the EVD (mean, median, shape β, location μ, etc.). Evident from the EVD curve was the baker's modus-operandi of handing out specially earmarked heavier loaves (heaviest ones would yield narrowest curve) to Poincaré.