Sunday, February 24, 2019

Memory Gene

Have a conjecture that soon someone's going to be discovering a memory gene in the human genome. This doesn't seem to be have been done or published in any scientific literature so far. The concept of Genetic Memory from an online game is close, but then that's fiction.

The idea of the memory gene is that this gene on the human genome will act as a memory card. Whatever data the individual writes to the memory gene in their lifetime can be later retrieved by their progenies in their lifetimes. The space available in the memory gene would be small compared to what is available in the brain. If the disk space of the human brain is a Petabyte (= 10^12 Kilobytes), space of the memory gene would be about 10 Kilobytes. So very little can be written to the memory gene.

Unlike the brain to which every bit of information (visual, aural, textual, etc.) can be written to at will, writing to the memory gene would require some serious intent & need. Writing to the memory gene would be more akin to etching on a brass plate - strenuous but permanent. The intent would be largely triggered by the individual's experience(s), particularly ones that triggers strong emotions perhaps beneficial to survival. Once written to the memory gene this information would carry forward to the offsprings.

The human genome is known to have about 2% coding DNA & the rest non-coding DNA. The coding portions carry the instructions (genetic instructions) to synthesize proteins, while the purpose of the non-coding portions is not clearly known so far. The memory gene is likely to have a memory addressing mechanism, followed by the actual memory data stored in the large non coding portion.

At the early age of 2 or 3 years, when a good portion of brain development has happened in the individual, the memory recovery will begin. The mRNA, ribosome & the rest of the translation machinery will get to work in translating the genetic code from the memory gene to synthesize the appropriate proteins & biomolecules of the brain cell. In the process the memory data would be restored block by block in the brain. This would perhaps happen over a period of low activity such as night's sleep. The individual would later awaken to new transferred knowledge about unknowns, that would appear to be intuitive. Since the memory recovery would take place at an early age, conflicts in experiences between the individual and the ancestor wouldn't happen.
  
These are some basic features of the very complex memory gene. As mentioned earlier, this is purely a conjecture and shouldn't be taken otherwise. Look forward to exploring genuine scientific researches in this space as they get formalized & shared.

Update 1 (27-Aug-19):
For some real research take a look at the following:

 =>> Arch Gene:
 Neuronal gene Arc required for synaptic plasticity and cognition. Resemble  retroviral/retrotransposon in their transfer between cells followed by activity dependent translation. These studies throw light on a completely new way through which neurons could send genetic information to one another. More details from the 2018 publication available here:
  •    https://www.nih.gov/news-events/news-releases/memory-gene-goes-viral
  •    https://www.cell.com/cell/comments/S0092-8674(17)31504-0
  •    https://www.ncbi.nlm.nih.gov/pubmed/29328915/

  =>> Memories pass between generations (2013):
 When grand-parent generation mice are taught to fear an odor, their next two generations (children & grand-children) retain the fear:
  •    https://www.bbc.com/news/health-25156510
  •    https://www.nature.com/articles/nn.3594
  •    https://www.nature.com/articles/nn.3603

 =>> Epigenetics & Cellular Memory (1970s onwards):
  •    https://en.wikipedia.org/w/index.php?title=Genetic_memory_(biology)&oldid=882561903
  •    https://en.wikipedia.org/w/index.php?title=Genomic_imprinting&oldid=908987981

  =>> Psychology - Genetic Memory (1940s onwards):Largely focused on the phenomenon of knowing things that weren't explicitly learned by an individual:
  •    https://blogs.scientificamerican.com/guest-blog/genetic-memory-how-we-know-things-we-never-learned/
  •    https://en.wikipedia.org/w/index.php?title=Genetic_memory_(psychology)&oldid=904552075
  •    http://www.bahaistudies.net/asma/The-Concept-of-the-Collective-Unconscious.pdf
  •    https://en.wikipedia.org/wiki/Collective_unconscious

Friday, February 22, 2019

Taller Woman Dance Partner

In the book Think Stats, there's an exercise to work out the percentage of dance couples where the woman is taller, when paired up at random. Mean heights (cm) & their variances are given as 178 & 59.4 for men, & 163 & 52.8 for women.

The two height distributions for men & women can be assumed Normal. The solution is to work out the total area under the two curves where the condition height_woman > height_men holds. This will need to be done for every single height point (h), i.e.  under the  entire spread of the two curves (-∞, ∞). In other words, the integral of the height curve for men from (-∞, h) having height < h,  multiplied by the integral of the height curve for women from (h, ∞) having height > h.

There are empirical solutions where N data points are drawn from two Normal distributions with appropriate mean & sd (variance) for men & women. These are paired up at random (averaged over k-runs) to compute the number of pairs with taller women.

The area computation can also be approximated by limiting to the ±3 sd range (includes 99.7%) of height values on either side of the two curves (140cm to 185cm). Then by sliding along the height values (h) starting from h=185 down in steps of size (s) s=0.5 or so, compute z at each point:

z = (h - m)/ sd, where m & sd are the corresponding mean & standard deviation of the two curves.

Refer to the one-sided standard normal value to compute percentage women to the right of h (>z) & percentage of men to left of h (<z). The product of the two is the corresponding value at point h. A summation of the same over all h results in the final percentage value. The equivalent solution using NORMDIST yields a likelihood of ~7.5%, slightly below expected (due to the coarse step size of 0.5). 

C1 plots the percent. of women to the right & percent. of men to the left at different height values. C2 is the likelihood of seeing a couple with a taller woman within each step window of size 0.5. Interestingly, the peak in C2 is between heights 172-172.5, about 1.24 sd from women's mean (163) & 0.78 sd from men's mean (178). The spike at the end of the curve C2 at point 185.5 is for the likelihood of all heights > 185.5, i.e. the window (185.5,∞).

Playing around with different height & variance values yields other final results. For instance at the moment the two means are separated by 2*sd. If we reduced this to 1*sd the mean height of women (or men) to about 170.5 cm, the final likelihood jumps to about 23%. This is understandable since the population now has far more taller women. The height variance for men is more than women, setting them to identical values 52.8 (fewer shorter men) results in lowering of the percentage to about 6.9%, vs. setting them to 59.4 (more taller women) increases the percentage to 8.1%.





Sample data points from Confidence_Interval_Heights.ods worksheet:


Point (p) Women
Men Likelihood (l)

z_f
Using p
r_f =
% to Right
r_f_delta =
% in between

z_m
Using p
l_m=
% to Left
l = l_m
* r_f_delta
185.5 3.0964606 0.0009792 0.0009792
0.9731237 0.8347541 0.0008174
185 3.0276504 0.0012323 0.0002531
0.9082488 0.8181266 0.0002071
184.5 2.9588401 0.0015440 0.0003117
0.8433739 0.8004903 0.0002495
184 2.8900299 0.0019260 0.0003820
0.7784989 0.7818625 0.0002987
183.5 2.8212196 0.0023921 0.0004660
0.7136240 0.7622702 0.0003553
183 2.7524094 0.0029579 0.0005659
0.6487491 0.7417497 0.0004197
182.5 2.6835992 0.0036417 0.0006838
0.5838742 0.7203475 0.0004926
182 2.6147889 0.0044641 0.0008224
0.5189993 0.6981194 0.0005741

...

Thursday, February 21, 2019

Poincaré and the Baker

There is a story about the famous French mathematician Poincaré, where he kept track of the weight of the loaves of bread that he bought for a year from one single baker. At the end of the year he complained to the police that the baker has been cheating people selling loaves with a mean weight of 950g instead of 1000g. The baker was issued a warning & let off.

Poincaré continued weighing the loaves that he bought from the baker through the next year at the end which he again complained. This time even though the mean weight of the loaves was 1000g, he pointed out that the baker was still continuing to bake loaves with lower weights, but handing out the heavier ones to Poincaré. The story though made up makes for an interesting exercise for a stats intro class.

The crux of the solution, as shown by others, is to model the baked loaves as a Normal distribution.

At the end of year 1

The expected  mean m=1000g, since the baker had been cheating the mean shows up as 950g. The standard deviation (sd) is assumed to remain unaffected. To know if the drop in mean from the expected value m1=1000 to the actual m2=950 for identical sd, a hypothesis test can be performed to ascertain if the sampled breads are drawn from the same Normal distribution having mean value m1=1000, or from a separate one having mean m2=950. The null hypothesis is H0(m1 = m2) vs. the alternate hypothesis H1(m1 != m2).

|z| = |(m1-m2)/(sd/sqrt(n))| where n is the no of samples.

With an assumption of Poincaré's frequency of buying at 1 loaf/ day (n=365), & sd=50,

z=19.1, much larger than 1% (2.576) level of significance, so the null hypothesis can be rejected. Poincaré was right in calling out the baker's cheating!

From the original story we are only certain about the values of m1=1000 & m2=950. Playing around with the other values, for instance changing n=52 (1 loaf/ week), or sd to 100g (novice baker, more variance), could yield other z values & different decisions about the null hypothesis.     

At the end of year 2

What Poincaré observes is a different sort of curve with a mean=1000g, but that is far narrower than Normal. A few quick checks make him realize that he's looking at an Extreme Value Distribution (EVD) or the Gumbell distribution. Using data from his set of bread samples, he could easily work out the parameters of the EVD (mean, median, shape β, location μ, etc.). Evident from the EVD curve was the baker's modus-operandi of handing out specially earmarked heavier loaves (heaviest ones would yield narrowest curve) to Poincaré.

Wednesday, February 20, 2019

Professional

Almost all definitions of the word professional have an ethical component mentioned in the definition. Yet, we are flooded by news of frauds and malpractices from organizations across sectors and geographies, run by professionals mostly having top of the line credentials.

The article "Star-studded CVs and moral numbness" provides some interesting insights, esp. from an exercise with a batch of budding professionals from a top tier management school. The authors were perhaps dealing with a relatively young lot of individuals having less exposure (typically with <3 years work experience). Had the exercise been done within the walls of the corporate war-rooms, with hardened professionals clued into the ways of the real world, the outcome would have been totally different. Forget remorse, there would be no realization of any wrong doing. Voices of protests would be snubbed or worse, shown the exit door. 

Monday, February 18, 2019

System Reliability

People are surrounded by devices of all kinds. Reliability of the device is one of the key aspects of the user's experience of a device, particularly over the long term. This also has an implication on the general opinion (positive or negative) that the user forms about the device, its brand & manufacturer. An understanding of reliability is thus important for the manufacturer and the user.

Reliability numbers are worked initially at the design phase by the manufacturer. Explicit targets for the product are set which govern the design choices. Later several rounds of testing done by the manufacturer and/ or the certifying authority mostly before device roll-out to ascertain the actual numbers. In certain cases these may need to be re-looked at due to unexplained failures, manufacturing defects, etc. while the device is in-service. Such evaluations can be performed during routine maintenance of the device or via explicit recall of the device to the designated service station. Data collected is analyzed to understand & resolve the underlying issues in the device and the causes of failures.

Reliability Analysis

There are some standard methods adopted by the manufacturers (OEMs), etc. to calculate reliability numbers of the device. These include among others quantitative techniques such as capturing Mean Time to Failure (MTTF), Mean Time Between Failure (MTBF) and Mean Time to Repair (MTTF) at the device and/ or its sub-components level. MTTF is a measure of the time (or number of cycles, runs, etc.) at which the device is likely to fail (failure rate), while MTBF is the equivalent value for repairable devices that accounts for the interval between failure incidents. MTTR is the corresponding time spent in repair. For repairable systems:
   MTBF = MTTF + MTTR

These numbers are aggregates applicable to a general population of devices and not at one specific device level. So a device with MTBF value of 30,000 hours, implies that a population of size 30 devices are likely to run for 1000 hours on an average, collectively clocking 30K device hours.

For an exponential Reliability R(t) = exp(-t/MTBF), probability of a specific device surviving upto its rated t=MTBF is:
 R(t) = exp(MTBF/MTBF) = exp(-1) = 36.8%

For repairable systems, another term used often is Availability.
 Availability = System Up Time/ (System Up Time + System Down Time)

For mission critical systems that can not accept any downtime, Availability equals Reliability!

Weibull Analysis

Statistical techniques such as the Weibull Analysis is also very common for reliability computations. Weibull analysis makes use of data from failed as well as non-failed devices to work out device lifespan & reliability. A set of samples of the device`are observed under test conditions & a statistical distribution (model) is fitted to the data collected from these test samples. The fitted model is thereafter used to make predictions about the reliability of the entire population of devices which would be operating under the real world conditions.

The Weibull model uses three parameters for β: Shape (shape of the distribution), η: Scale parameter (spread), γ: Location (Location in time). Interestingly, the Weibull model is able to nicely capture the standard U-shaped, Bath Tub reliability curve typically seen over various device lifespans. In the early life-span of a device (testing, acceptance stage) the failure & defect rates are high (& has a β < 1). As these get fixed, the failure rate drops quickly to the steady operation ready, Useful Life stage.

In the Useful Life (β = 1) stage the device is stable & ready to roll-out to the end-user. Defects in this second stage are mainly due to design issues, operation, human errors, unexpected failures, etc. Finally, the device enters the Wear-out phase (β > 1), where the device or certain sub-components start showing natural wear & tear. Repairs & maintenance help to keep the device in good working shape for a while. Finally, there comes a time when the repairs are no longer viable due to costs or other reasons & then the device is taken out of service. Decisions around scheduled inspections, overhauls, etc. can be planned based on the different stage of the device life cycle & the corresponding values of β. 

There are other exponential distributions such as Poisson, Rayleigh, Gamma, Beta, etc. which are applied to specific types of devices, domains and failures cases. Selection of the appropriate distribution is important for a proper reliability analysis.

Sampling and Confidence Levels

Once devices are live, actual on-ground analysis can also be done for certain categories of devices. Data can be collected from a representative sample of devices operating on ground. Techniques from statistics for reliably sampling & deriving confidence intervals for an underlying population can be applied for this purpose.

The analysis is typically done for a Binomial population of devices where a certain p% of the population (devices) are expected to fail, while (1-p)% are expected to operate fine (without failure). Assuming a confidence interval of c (tolerance interval), the sample size n is worked out by taking a Normal approximating for the Binomial distribution (simplifying the calculations):

    n = Z2 X (p) X (1 - p)/ c2


where Z is constant chosen based on the desired confidence value from the Standard Normal Curve. Z = 1.96 for Confidence 95%, 2.58 for 99%, and so on.

   (E.g. 1) For example, if for a certain device, 4% devices are expected to fail, p=0.04:

      (1.a) With 99% confidence level, for a 1% confidence interval, c=0.01:
n = 2.58*2.58*0.04*(1-0.04)/(0.01*0.01) = 2,556 samples are needed

    (1.b) For a tighter 0.1% confidence interval, c=0.001:
n = 2.58*2.58*0.04*(1-0.04)/(0.001*0.001) = 255,605 samples (100x more than (1.a))are needed

   (1.c) Similarly, for a higher confidence level of 99.99% (Z=3.891), at the same 1% confidence level:
n = 3.891*3.891*0.04*(1-0.04)/(0.01*0.01) = 5814 samples (more than (1.a)) are needed

The above sample size estimator assumes a very large, or infinite population. In case of finite sized population, the following correction is applied to the cases above:

   n_finite = n / (1 + (n-1)/size_pop)

  (1.b.1) Applying the correction to the case (1.b) above, assuming a total population of a fixed 30,000 devices only:
n_finite = 255605/ (1 + (255605-1)/30000) = 28,400 devices, which need to be sampled to achieve a 0.1% confidence interval (tolerance) at the 99% confidence levels.

As discussed earlier, the reliability trends for devices tend to fit lifetime dependent exponential distributions such as Weibull better. Confidence levels in such cases are worked out accordingly using the appropriate distribution. For instance with a small constant failure rate (λ) expected, an exponential or a Poisson reliability model is a better approximation to Binomial than Normal. The confidence interval for λ is worked out as a Chi-Square distribution with 2n degrees of freedom, where n is a count of failures seen over time in the sampled set of devices.

Redundancy

Some systems need high fault tolerance. Reliability for such systems can be improved by introducing redundant systems in parallel, thereby replacing the Single Point of Failure (SPOF). When one device fails an alternate one can perform the job in its place.

Reliability of the redundant system:
    R = 1 - p1 X p2 X .. X pk

   where p1,..,pk are the probability of failure of the backup redundant systems.

    (E.g. 2) In the above example where the single device system with a failure rate p=0.04, & a reliability of 96% (1-0.04), if we introduce an identical redundant/ backup device also with p=0.04, reliability goes up to R = 1 - 0.04*0.04  = 99.84%.

k-out-of-n Systems

An alternate set-up is a consensus based (k out of n) system. In this set-up, the system fails only when more than the quorum number (k, typically 50%) of devices fail. The reliability of the quorum system is:
   R_quorum_system = 1 - probability of more than k (quorum) device failures

The reliability is maximized for a majority quorum, i.e. k = n/2+1. 

Monitoring Systems

Another typical approach is to introduce monitoring systems. The monitoring systems can be in the form of a sensor (optical, non-optical), a logger, a heart-beat polling unit, a human operator, or a combination of these. Whenever the monitoring system finds the primary system faltering, it raise alarms for corrective measures to be taken which may include stopping/ replacing the faulty device and/ or switching over to a backup system if available.

The reliability of the monitoring systems is assumed to be much higher than the underlying system being monitored, ideally a 100%. Monitoring systems are taken to be operating in sequence to the underlying system, so the reliability of the overall system is:
    R_monitored_system = R_device X R_monitoring 

In other words, a failure in either the device or the monitor or both, will result in failure of the system, increasing the overall chances of failure. Yet, monitoring systems are effective on ground since they are the first line of defense for the system. They are able to raise alarms for the human operator to intervene early (lowering MTTR).

In certain set-ups the monitoring system are also enabled to automatically switch over to a backup device when there is a failure with the primary device. This helps reduce the down time (MTTR) to a negligible value, if not zero. With a system that has redundant devices & a single monitoring system the SPOF shifts to the monitoring system. A further refinement to the system design (such as Zab, Paxos, etc.) entails setting up the monitoring system in a k-of-n, typically majority, quorum. All decisions regarding the state of the underlying devices is taken by the quorum. The majority quorum is also resilient upto k=n/2 failures of the monitoring system.

Through good system design & thought, the reliability at the system level can be significantly boosted even if the sub-components are less reliable. Design & engineering teams must possess sound reliability analysis skills to be able to build world class products. An awareness of reliability aspects also helps the end-user to decide on the right device that suits their requirements & continues to function properly over its lifespan.

Tuesday, February 5, 2019

Towards A Clean Ganga

The hope for a clean Ganga river remains eternal in our hearts. There have been several attempts at cleaning the Ganga over time, most recently under the Namami Gange project. The goal is to get to within the acceptable water quality standards, for a clean pollution free river (Nirmal Dhara), with uninterrupted, adequate flow (Aviral Dhara). Progress however, seems rather limited/ slow, with no due date in sight for making Ganga clean again. 

For the data oriented, numbers on the current state of the Ganga are available on the CPCB website in real-time. There are some 30+ monitoring stations located at different points along the Ganga. These centres collect data from Ganga & publish it in near real-time. Beyond the rudimentary web portal of the CPCB, an API based access to the data should also be made available. This would allow other people to leverage the underlying data for analytical purposes & build interesting apps. Data can reveal insights on several aspects such as seasonal factors, flow volume, portions with pollution spikes, changes in pollution levels over time, impact due to specific events or interventions, etc. Open-sourcing data is the way to go!

Another source of data on Ganga water quality are the reports that get published by CPCB & other environmentalists/ researchers working in this area. At times the data published in the reports have been collected by the authors themselves & provide a secondary check to the numbers from CPCB & others.

Yet another, though less rigorous, option is to crowd-source the data. For various reasons (religious, tourism, adventure, livelihood, etc.) people visit different spots of the Ganga throughout the year. A few motivated people among them could help baseline the numbers on water quality using low-end, free/ cheap phone based apps & devices, & publish the results out for public use. Hydrocolor is one such phone based app developed as part of a Ph.D. dissertation, that uses the Phone camera (a RGB radiometer) to measure water quality. The app auto-calibrates to handle variations across devices, platforms, weather conditions, etc.

Similarly there is a home-made device called Secchi disks that can be used for measuring the turbidity of water. Aerial, drone & IOT based devices are also being conceived by people across the world as solutions to track health & pollution of water bodies in their respective cities. We could adapt such tools to monitor the state of the river Ganga over time as she progresses towards good health.

Friday, August 31, 2018

OpenDNS FamilyShield - A Safer Community

FamilyShield from OpenDNS is a free & simple DNS offering to block out most adult sites, proxy servers and phishing sites. Over the years since FamilyShield was first rolled out the service has continued to be effective & beneficial to millions of customers, particularly parents, in keeping net usage safe for children & families at home.

Innovations from OpenDNS include ideas such as leveraging the community for tagging domains (DomainTagging), identifying phishing sites (PhishTank), speeding up internet access via the OpenDNS Global Network, & a very clear/ open Anti-censorship policy. These are incorporated within the FamilyShield service to effectively block out harmful content & make internet access better for the user across devices. Finally, once ready to onboard, setting up FamilyShield on the router takes no effort at all!

Thursday, April 26, 2018

Biometric Authentication

Biometrics are being pegged as the next big deal for user authentication, esp. within the banking & financial sectors. Been having a few discussions to understand how good biometric authentication are as compared to traditional Pins? Is a return to being angootha-chaaps (thumb-print based) for all purposes the right thing to do, technically that is? Here's a lay-man attempt at answering some of those questions.

Staring off by calling out known facts & assumptions about thumb-prints (an example biometric): 

  • Thumb-prints are globally unique to every human being.
    (Counter: Enough people who don't have/ lose a thumb, or lose their thumb-prints due to some other reason. Also partial thumb-prints of two individuals taken of a portion of the thumb due to faults at the time of scanning, etc. may match.)
  • Thumb-prints stay consistent over the lifetime of an individual (adult).
    (Counter: May not be true due to physical changes in the human body, external injuries, growths, etc.)
  • Computers are basically binary machines. So whether it's a document (pdf, doc), an image file (jpg, gif, etc.), a video (mp4), a music file (wav), a Java program, a Linux operating system, etc. all of the data, instructions, etc. get encoded into a string of bytes (of 0s & 1s).
  • The thumb-print scan of an individual is similar to an image file (following a standard protocol), encoded as a string of bytes.
    The thumb-prints scans of two different individuals will result in two different strings of bytes, that are unique to the individual.
    Subsequent scans of the thumb-print of the same individual will result in exactly the same string of bytes over-time.

That's enough background information for a rough evaluation. A thumb-print scan of a certain size, say 10Kb is just a string of 10,000 bits of 0s & 1s. This is unique to an individual & stays the same over the individual's lifetime.

A 4-digit Pin on the other hand is a combination of four Integer numbers. Each Integer typically gets encoded into a 32-bit string. A 4-digit Pin is therefore a 4 * 32-bit = 128-bit string. The Pin normally stays the same, unless explicitly changed (rather infrequent).

In simplistic terms, when a request to authenticate an individual is made to a computer, it reads the incoming string of bits (from the Pin or the thumb-print) & matches it against a database of known/ all existing (1-to-1 or 1-to-N matches) strings. To the computer other than the difference in length between the two encoded strings of thumb-print (10,000-bit) & Pin (128-bit), there's not much difference between the two.

On the other hand, the Pin seems much better than the thumb-print if it were ever to get compromised due to a breach or a malicious app or something. The Pin can simply be changed & a new 128-bit string can replace the earlier one going forward. But in the case of the thumb-print there's really nothing that can be done as the individual's thumb-print scan will stay the same over time!

Yet another alternative for authentication is to use One Time Password (OTP). The OTP is also a 4-digit number (128-bit string) but it is re-issued each time over a separate out-of-band channel (such as SMS), is short lived, & is valid for just one use. These features make the OTP way more robust & immune to breaches & compromise.

What is a biometric to the human being, is just another string of bits to the machine, very similar to the string of bits of a Pin or an OTP. From the stand-point of safety though, the OTP is far superior to the other two. As is the common practice, it maybe ok to use biometric authentication within environments such as government offices, airports, etc. where the network is tightly regulated & monitored. For end-user authentication however, such as within phone apps, or internet payments, or other channels where the network or device is orders of magnitude more insecure & vulnerable these are not ideal. In general OTPs should be the top pick & biometrics the last option in such cases:

    OTP > Pin > Biometrics

Monday, April 9, 2018

Learning Deep

Head out straight to KdNugget's Top 20 Deep Learning Papers of 2018. Has a good listing of research publications spanning over the last 4-5 years. You could further go on to read the papers referred to within these papers & then those referred to in the referred papers & so on for some really deep learning!

Sunday, April 8, 2018

Application Security & OWASP

Lots of applications get developed these days to make the life of customers easy & comfortable. However, a cause for concern is the general lack of awareness of security aspects among app developers. As a result unsafe & buggy apps get released to production by the dozens.

Have come across quite a few such apps in recent times & duly reported them to the respective support/ dev teams. While some of these will get fixed, there does appear to be a lack of knowledge of security issues among the  teams. Had they known they would have mostly got it right upfront. Retrospective patching while common for newly discovered vulnerabilities, is no substitute for incorporating current standards & best practices that are well researched & documented.

OWASP is one of the leading open standards on security vulnerabilities. OWASP Top-10 Application Security Risks (latest: 2017) include things like Injection, Broken Authentication, Sensitive Data Exposure, etc. There's a whole bunch of material available online including an e-book with details & fixes for the vulnerabilities for the different stake-holders of the app. These are like the safety-belts that must be incorporated in all apps before allowing them to go-live.

Another major cause for widespread security issues in apps is the use of vulnerable frameworks & third party libraries by them. Buggy Javascript (JS) libraries are particularly guilty of pushing vulnerabilities down to apps. 

As per the Northeastern University research of outdated Javascript libraries on the web, of 133K websites evaluated 37% included at least one vulnerable library:
  - "not only website administrators, but also the dynamic architecture and developers of third-party services are to blame for the Web’s poor state of library management"
  - "libraries included transitively, or via ad and tracking code, are more likely to be vulnerable"

RetireJS initiative keeps a tab on the vulnerabilities in the JS libraries. As do the OWASP cheat sheets on 3rd Party JS & AJAX Security. Static analysers, security testing, sand-boxed executions, etc. are typical ways to address client side JS security vulnerabilities.

Security issues are equally widespread in frameworks & libraries from other languages. Java & Scala are fairly well covered by OWASP (though .Net, Php, etc. aren't). Evaluations of Java Spring framework against OWASP Top-10, listing of Java security framework, hdiv & Scala Frameworks provide context on how best to address security issues in some very popular frameworks.

Wednesday, March 7, 2018

Ubuntu 16.04 32-bit Display Issue/ Corruption on load/ boot

Recently tried refurbishing an old Benq Joybook U102 laptop with an Ubuntu OS. This one has a 32-bit Intel Atom Processor with 1G Ram, so needed the 32-bit Ubuntu-16.04 iso image (from alternative downloads).

Followed this up with the standard routine of creating a bootable usb using the downloaded iso, booting with the usb, plugging in the Lan cable (wifi drivers are initially unavailable/ downloaded later), formatting the disk & doing a fresh install (choosing the option to download all third-party drivers, etc). All this went off smoothly & the laptop was ready to reboot.

After restarting however found the display corrupted. Practically the entire screen from the left to right was covered with bright coloured stripes, dots & squares rendering it unusable. After a bit of fiddling around found a work-around to close the lid & reopen it, forcing the laptop to go into standby & then exit by pressing the power button. This did help restore the display, but felt there had to be a better solution.

A few suggestions online were to downgrade the Ubuntu kernel to an older version 4.12 or lower. Further search revealed the actual Bug in Kernel 4.13 : Intel Mobile Graphics 945 shows 80 % black screen. The work-around solution of Suspend/ Resume, as well as the proper solution of setting GRUB_GFXPAYLOAD_LINUX=text in grub file are mentioned there.

Setting the variable GFXPAYLOAD to text makes Linux boot up in normal text mode, typically to avoid display problems in the early boot sequence. As instructed made the addition to the /etc/default/grub file, ran sudo update-grub, rebooted, & the display issue was gone!