Sunday, February 24, 2019

Memory Gene

Have a conjecture that soon someone's going to be discovering a memory gene in the human genome. This doesn't seem to be have been done or published in any scientific literature so far. The concept of Genetic Memory from an online game is close, but then that's fiction.

The idea of the memory gene is that this gene on the human genome will act as a memory card. Whatever data the individual writes to the memory gene in their lifetime can be later retrieved by their progenies in their lifetimes. The space available in the memory gene would be small compared to what is available in the brain. If the disk space of the human brain is a Petabyte (= 10^12 Kilobytes), space of the memory gene would be about 10 Kilobytes. So very little can be written to the memory gene.

Unlike the brain to which every bit of information (visual, aural, textual, etc.) can be written to at will, writing to the memory gene would require some serious intent & need. Writing to the memory gene would be more akin to etching on a brass plate - strenuous but permanent. The intent would be largely triggered by the individual's experience(s), particularly ones that triggers strong emotions perhaps beneficial to survival. Once written to the memory gene this information would carry forward to the offsprings.

The human genome is known to have about 2% coding DNA & the rest non-coding DNA. The coding portions carry the instructions (genetic instructions) to synthesize proteins, while the purpose of the non-coding portions is not clearly known so far. The memory gene is likely to have a memory addressing mechanism, followed by the actual memory data stored in the large non coding portion.

At the early age of 2 or 3 years, when a good portion of brain development has happened in the individual, the memory recovery will begin. The mRNA, ribosome & the rest of the translation machinery will get to work in translating the genetic code from the memory gene to synthesize the appropriate proteins & biomolecules of the brain cell. In the process the memory data would be restored block by block in the brain. This would perhaps happen over a period of low activity such as night's sleep. The individual would later awaken to new transferred knowledge about unknowns, that would appear to be intuitive. Since the memory recovery would take place at an early age, conflicts in experiences between the individual and the ancestor wouldn't happen.
  
These are some basic features of the very complex memory gene. As mentioned earlier, this is purely a conjecture and shouldn't be taken otherwise. Look forward to exploring genuine scientific researches in this space as they get formalized & shared.

Update 1 (27-Aug-19):
For some real research take a look at the following:

 =>> Arch Gene:
 Neuronal gene Arc required for synaptic plasticity and cognition. Resemble  retroviral/retrotransposon in their transfer between cells followed by activity dependent translation. These studies throw light on a completely new way through which neurons could send genetic information to one another. More details from the 2018 publication available here:
  •    https://www.nih.gov/news-events/news-releases/memory-gene-goes-viral
  •    https://www.cell.com/cell/comments/S0092-8674(17)31504-0
  •    https://www.ncbi.nlm.nih.gov/pubmed/29328915/

  =>> Memories pass between generations (2013):
 When grand-parent generation mice are taught to fear an odor, their next two generations (children & grand-children) retain the fear:
  •    https://www.bbc.com/news/health-25156510
  •    https://www.nature.com/articles/nn.3594
  •    https://www.nature.com/articles/nn.3603

 =>> Epigenetics & Cellular Memory (1970s onwards):
  •    https://en.wikipedia.org/w/index.php?title=Genetic_memory_(biology)&oldid=882561903
  •    https://en.wikipedia.org/w/index.php?title=Genomic_imprinting&oldid=908987981

  =>> Psychology - Genetic Memory (1940s onwards):Largely focused on the phenomenon of knowing things that weren't explicitly learned by an individual:
  •    https://blogs.scientificamerican.com/guest-blog/genetic-memory-how-we-know-things-we-never-learned/
  •    https://en.wikipedia.org/w/index.php?title=Genetic_memory_(psychology)&oldid=904552075
  •    http://www.bahaistudies.net/asma/The-Concept-of-the-Collective-Unconscious.pdf
  •    https://en.wikipedia.org/wiki/Collective_unconscious

Friday, February 22, 2019

Taller Woman Dance Partner

In the book Think Stats, there's an exercise to work out the percentage of dance couples where the woman is taller, when paired up at random. Mean heights (cm) & their variances are given as 178 & 59.4 for men, & 163 & 52.8 for women.

The two height distributions for men & women can be assumed Normal. The solution is to work out the total area under the two curves where the condition height_woman > height_men holds. This will need to be done for every single height point (h), i.e.  under the  entire spread of the two curves (-∞, ∞). In other words, the integral of the height curve for men from (-∞, h) having height < h,  multiplied by the integral of the height curve for women from (h, ∞) having height > h.

There are empirical solutions where N data points are drawn from two Normal distributions with appropriate mean & sd (variance) for men & women. These are paired up at random (averaged over k-runs) to compute the number of pairs with taller women.

The area computation can also be approximated by limiting to the ±3 sd range (includes 99.7%) of height values on either side of the two curves (140cm to 185cm). Then by sliding along the height values (h) starting from h=185 down in steps of size (s) s=0.5 or so, compute z at each point:

z = (h - m)/ sd, where m & sd are the corresponding mean & standard deviation of the two curves.

Refer to the one-sided standard normal value to compute percentage women to the right of h (>z) & percentage of men to left of h (<z). The product of the two is the corresponding value at point h. A summation of the same over all h results in the final percentage value. The equivalent solution using NORMDIST yields a likelihood of ~7.5%, slightly below expected (due to the coarse step size of 0.5). 

C1 plots the percent. of women to the right & percent. of men to the left at different height values. C2 is the likelihood of seeing a couple with a taller woman within each step window of size 0.5. Interestingly, the peak in C2 is between heights 172-172.5, about 1.24 sd from women's mean (163) & 0.78 sd from men's mean (178). The spike at the end of the curve C2 at point 185.5 is for the likelihood of all heights > 185.5, i.e. the window (185.5,∞).

Playing around with different height & variance values yields other final results. For instance at the moment the two means are separated by 2*sd. If we reduced this to 1*sd the mean height of women (or men) to about 170.5 cm, the final likelihood jumps to about 23%. This is understandable since the population now has far more taller women. The height variance for men is more than women, setting them to identical values 52.8 (fewer shorter men) results in lowering of the percentage to about 6.9%, vs. setting them to 59.4 (more taller women) increases the percentage to 8.1%.





Sample data points from Confidence_Interval_Heights.ods worksheet:


Point (p) Women
Men Likelihood (l)

z_f
Using p
r_f =
% to Right
r_f_delta =
% in between

z_m
Using p
l_m=
% to Left
l = l_m
* r_f_delta
185.5 3.0964606 0.0009792 0.0009792
0.9731237 0.8347541 0.0008174
185 3.0276504 0.0012323 0.0002531
0.9082488 0.8181266 0.0002071
184.5 2.9588401 0.0015440 0.0003117
0.8433739 0.8004903 0.0002495
184 2.8900299 0.0019260 0.0003820
0.7784989 0.7818625 0.0002987
183.5 2.8212196 0.0023921 0.0004660
0.7136240 0.7622702 0.0003553
183 2.7524094 0.0029579 0.0005659
0.6487491 0.7417497 0.0004197
182.5 2.6835992 0.0036417 0.0006838
0.5838742 0.7203475 0.0004926
182 2.6147889 0.0044641 0.0008224
0.5189993 0.6981194 0.0005741

...

Thursday, February 21, 2019

Poincaré and the Baker

There is a story about the famous French mathematician Poincaré, where he kept track of the weight of the loaves of bread that he bought for a year from one single baker. At the end of the year he complained to the police that the baker has been cheating people selling loaves with a mean weight of 950g instead of 1000g. The baker was issued a warning & let off.

Poincaré continued weighing the loaves that he bought from the baker through the next year at the end which he again complained. This time even though the mean weight of the loaves was 1000g, he pointed out that the baker was still continuing to bake loaves with lower weights, but handing out the heavier ones to Poincaré. The story though made up makes for an interesting exercise for a stats intro class.

The crux of the solution, as shown by others, is to model the baked loaves as a Normal distribution.

At the end of year 1

The expected  mean m=1000g, since the baker had been cheating the mean shows up as 950g. The standard deviation (sd) is assumed to remain unaffected. To know if the drop in mean from the expected value m1=1000 to the actual m2=950 for identical sd, a hypothesis test can be performed to ascertain if the sampled breads are drawn from the same Normal distribution having mean value m1=1000, or from a separate one having mean m2=950. The null hypothesis is H0(m1 = m2) vs. the alternate hypothesis H1(m1 != m2).

|z| = |(m1-m2)/(sd/sqrt(n))| where n is the no of samples.

With an assumption of Poincaré's frequency of buying at 1 loaf/ day (n=365), & sd=50,

z=19.1, much larger than 1% (2.576) level of significance, so the null hypothesis can be rejected. Poincaré was right in calling out the baker's cheating!

From the original story we are only certain about the values of m1=1000 & m2=950. Playing around with the other values, for instance changing n=52 (1 loaf/ week), or sd to 100g (novice baker, more variance), could yield other z values & different decisions about the null hypothesis.     

At the end of year 2

What Poincaré observes is a different sort of curve with a mean=1000g, but that is far narrower than Normal. A few quick checks make him realize that he's looking at an Extreme Value Distribution (EVD) or the Gumbell distribution. Using data from his set of bread samples, he could easily work out the parameters of the EVD (mean, median, shape β, location μ, etc.). Evident from the EVD curve was the baker's modus-operandi of handing out specially earmarked heavier loaves (heaviest ones would yield narrowest curve) to Poincaré.

Wednesday, February 20, 2019

Professional

Almost all definitions of the word professional have an ethical component mentioned in the definition. Yet, we are flooded by news of frauds and malpractices from organizations across sectors and geographies, run by professionals mostly having top of the line credentials.

The article "Star-studded CVs and moral numbness" provides some interesting insights, esp. from an exercise with a batch of budding professionals from a top tier management school. The authors were perhaps dealing with a relatively young lot of individuals having less exposure (typically with <3 years work experience). Had the exercise been done within the walls of the corporate war-rooms, with hardened professionals clued into the ways of the real world, the outcome would have been totally different. Forget remorse, there would be no realization of any wrong doing. Voices of protests would be snubbed or worse, shown the exit door. 

Monday, February 18, 2019

System Reliability

People are surrounded by devices of all kinds. Reliability of the device is one of the key aspects of the user's experience of a device, particularly over the long term. This also has an implication on the general opinion (positive or negative) that the user forms about the device, its brand & manufacturer. An understanding of reliability is thus important for the manufacturer and the user.

Reliability numbers are worked initially at the design phase by the manufacturer. Explicit targets for the product are set which govern the design choices. Later several rounds of testing done by the manufacturer and/ or the certifying authority mostly before device roll-out to ascertain the actual numbers. In certain cases these may need to be re-looked at due to unexplained failures, manufacturing defects, etc. while the device is in-service. Such evaluations can be performed during routine maintenance of the device or via explicit recall of the device to the designated service station. Data collected is analyzed to understand & resolve the underlying issues in the device and the causes of failures.

Reliability Analysis

There are some standard methods adopted by the manufacturers (OEMs), etc. to calculate reliability numbers of the device. These include among others quantitative techniques such as capturing Mean Time to Failure (MTTF), Mean Time Between Failure (MTBF) and Mean Time to Repair (MTTF) at the device and/ or its sub-components level. MTTF is a measure of the time (or number of cycles, runs, etc.) at which the device is likely to fail (failure rate), while MTBF is the equivalent value for repairable devices that accounts for the interval between failure incidents. MTTR is the corresponding time spent in repair. For repairable systems:
   MTBF = MTTF + MTTR

These numbers are aggregates applicable to a general population of devices and not at one specific device level. So a device with MTBF value of 30,000 hours, implies that a population of size 30 devices are likely to run for 1000 hours on an average, collectively clocking 30K device hours.

For an exponential Reliability R(t) = exp(-t/MTBF), probability of a specific device surviving upto its rated t=MTBF is:
 R(t) = exp(MTBF/MTBF) = exp(-1) = 36.8%

For repairable systems, another term used often is Availability.
 Availability = System Up Time/ (System Up Time + System Down Time)

For mission critical systems that can not accept any downtime, Availability equals Reliability!

Weibull Analysis

Statistical techniques such as the Weibull Analysis is also very common for reliability computations. Weibull analysis makes use of data from failed as well as non-failed devices to work out device lifespan & reliability. A set of samples of the device`are observed under test conditions & a statistical distribution (model) is fitted to the data collected from these test samples. The fitted model is thereafter used to make predictions about the reliability of the entire population of devices which would be operating under the real world conditions.

The Weibull model uses three parameters for β: Shape (shape of the distribution), η: Scale parameter (spread), γ: Location (Location in time). Interestingly, the Weibull model is able to nicely capture the standard U-shaped, Bath Tub reliability curve typically seen over various device lifespans. In the early life-span of a device (testing, acceptance stage) the failure & defect rates are high (& has a β < 1). As these get fixed, the failure rate drops quickly to the steady operation ready, Useful Life stage.

In the Useful Life (β = 1) stage the device is stable & ready to roll-out to the end-user. Defects in this second stage are mainly due to design issues, operation, human errors, unexpected failures, etc. Finally, the device enters the Wear-out phase (β > 1), where the device or certain sub-components start showing natural wear & tear. Repairs & maintenance help to keep the device in good working shape for a while. Finally, there comes a time when the repairs are no longer viable due to costs or other reasons & then the device is taken out of service. Decisions around scheduled inspections, overhauls, etc. can be planned based on the different stage of the device life cycle & the corresponding values of β. 

There are other exponential distributions such as Poisson, Rayleigh, Gamma, Beta, etc. which are applied to specific types of devices, domains and failures cases. Selection of the appropriate distribution is important for a proper reliability analysis.

Sampling and Confidence Levels

Once devices are live, actual on-ground analysis can also be done for certain categories of devices. Data can be collected from a representative sample of devices operating on ground. Techniques from statistics for reliably sampling & deriving confidence intervals for an underlying population can be applied for this purpose.

The analysis is typically done for a Binomial population of devices where a certain p% of the population (devices) are expected to fail, while (1-p)% are expected to operate fine (without failure). Assuming a confidence interval of c (tolerance interval), the sample size n is worked out by taking a Normal approximating for the Binomial distribution (simplifying the calculations):

    n = Z2 X (p) X (1 - p)/ c2


where Z is constant chosen based on the desired confidence value from the Standard Normal Curve. Z = 1.96 for Confidence 95%, 2.58 for 99%, and so on.

   (E.g. 1) For example, if for a certain device, 4% devices are expected to fail, p=0.04:

      (1.a) With 99% confidence level, for a 1% confidence interval, c=0.01:
n = 2.58*2.58*0.04*(1-0.04)/(0.01*0.01) = 2,556 samples are needed

    (1.b) For a tighter 0.1% confidence interval, c=0.001:
n = 2.58*2.58*0.04*(1-0.04)/(0.001*0.001) = 255,605 samples (100x more than (1.a))are needed

   (1.c) Similarly, for a higher confidence level of 99.99% (Z=3.891), at the same 1% confidence level:
n = 3.891*3.891*0.04*(1-0.04)/(0.01*0.01) = 5814 samples (more than (1.a)) are needed

The above sample size estimator assumes a very large, or infinite population. In case of finite sized population, the following correction is applied to the cases above:

   n_finite = n / (1 + (n-1)/size_pop)

  (1.b.1) Applying the correction to the case (1.b) above, assuming a total population of a fixed 30,000 devices only:
n_finite = 255605/ (1 + (255605-1)/30000) = 28,400 devices, which need to be sampled to achieve a 0.1% confidence interval (tolerance) at the 99% confidence levels.

As discussed earlier, the reliability trends for devices tend to fit lifetime dependent exponential distributions such as Weibull better. Confidence levels in such cases are worked out accordingly using the appropriate distribution. For instance with a small constant failure rate (λ) expected, an exponential or a Poisson reliability model is a better approximation to Binomial than Normal. The confidence interval for λ is worked out as a Chi-Square distribution with 2n degrees of freedom, where n is a count of failures seen over time in the sampled set of devices.

Redundancy

Some systems need high fault tolerance. Reliability for such systems can be improved by introducing redundant systems in parallel, thereby replacing the Single Point of Failure (SPOF). When one device fails an alternate one can perform the job in its place.

Reliability of the redundant system:
    R = 1 - p1 X p2 X .. X pk

   where p1,..,pk are the probability of failure of the backup redundant systems.

    (E.g. 2) In the above example where the single device system with a failure rate p=0.04, & a reliability of 96% (1-0.04), if we introduce an identical redundant/ backup device also with p=0.04, reliability goes up to R = 1 - 0.04*0.04  = 99.84%.

k-out-of-n Systems

An alternate set-up is a consensus based (k out of n) system. In this set-up, the system fails only when more than the quorum number (k, typically 50%) of devices fail. The reliability of the quorum system is:
   R_quorum_system = 1 - probability of more than k (quorum) device failures

The reliability is maximized for a majority quorum, i.e. k = n/2+1. 

Monitoring Systems

Another typical approach is to introduce monitoring systems. The monitoring systems can be in the form of a sensor (optical, non-optical), a logger, a heart-beat polling unit, a human operator, or a combination of these. Whenever the monitoring system finds the primary system faltering, it raise alarms for corrective measures to be taken which may include stopping/ replacing the faulty device and/ or switching over to a backup system if available.

The reliability of the monitoring systems is assumed to be much higher than the underlying system being monitored, ideally a 100%. Monitoring systems are taken to be operating in sequence to the underlying system, so the reliability of the overall system is:
    R_monitored_system = R_device X R_monitoring 

In other words, a failure in either the device or the monitor or both, will result in failure of the system, increasing the overall chances of failure. Yet, monitoring systems are effective on ground since they are the first line of defense for the system. They are able to raise alarms for the human operator to intervene early (lowering MTTR).

In certain set-ups the monitoring system are also enabled to automatically switch over to a backup device when there is a failure with the primary device. This helps reduce the down time (MTTR) to a negligible value, if not zero. With a system that has redundant devices & a single monitoring system the SPOF shifts to the monitoring system. A further refinement to the system design (such as Zab, Paxos, etc.) entails setting up the monitoring system in a k-of-n, typically majority, quorum. All decisions regarding the state of the underlying devices is taken by the quorum. The majority quorum is also resilient upto k=n/2 failures of the monitoring system.

Through good system design & thought, the reliability at the system level can be significantly boosted even if the sub-components are less reliable. Design & engineering teams must possess sound reliability analysis skills to be able to build world class products. An awareness of reliability aspects also helps the end-user to decide on the right device that suits their requirements & continues to function properly over its lifespan.

Tuesday, February 5, 2019

Towards A Clean Ganga

The hope for a clean Ganga river remains eternal in our hearts. There have been several attempts at cleaning the Ganga over time, most recently under the Namami Gange project. The goal is to get to within the acceptable water quality standards, for a clean pollution free river (Nirmal Dhara), with uninterrupted, adequate flow (Aviral Dhara). Progress however, seems rather limited/ slow, with no due date in sight for making Ganga clean again. 

For the data oriented, numbers on the current state of the Ganga are available on the CPCB website in real-time. There are some 30+ monitoring stations located at different points along the Ganga. These centres collect data from Ganga & publish it in near real-time. Beyond the rudimentary web portal of the CPCB, an API based access to the data should also be made available. This would allow other people to leverage the underlying data for analytical purposes & build interesting apps. Data can reveal insights on several aspects such as seasonal factors, flow volume, portions with pollution spikes, changes in pollution levels over time, impact due to specific events or interventions, etc. Open-sourcing data is the way to go!

Another source of data on Ganga water quality are the reports that get published by CPCB & other environmentalists/ researchers working in this area. At times the data published in the reports have been collected by the authors themselves & provide a secondary check to the numbers from CPCB & others.

Yet another, though less rigorous, option is to crowd-source the data. For various reasons (religious, tourism, adventure, livelihood, etc.) people visit different spots of the Ganga throughout the year. A few motivated people among them could help baseline the numbers on water quality using low-end, free/ cheap phone based apps & devices, & publish the results out for public use. Hydrocolor is one such phone based app developed as part of a Ph.D. dissertation, that uses the Phone camera (a RGB radiometer) to measure water quality. The app auto-calibrates to handle variations across devices, platforms, weather conditions, etc.

Similarly there is a home-made device called Secchi disks that can be used for measuring the turbidity of water. Aerial, drone & IOT based devices are also being conceived by people across the world as solutions to track health & pollution of water bodies in their respective cities. We could adapt such tools to monitor the state of the river Ganga over time as she progresses towards good health.