Tuesday, March 26, 2019

Opinions On A Topic

Media agencies of the day are busy flooding us with news - wanted, unwanted, real, fake, good, bad, ugly, whatever. Yet, for the user the challenge to stay truly updated has never been this tough. Sifting the hay from the chaff is both computationally & practically hard!

There's a real need to automatically detect, flag & block misleading information from propagating. Though at the moment the technology doesn't exist, offerings are very likely to come up soon & get refined over time to nail the problem well enough. While we await breakthroughs on that front, for now the best bet is to depend on traditional human judgment.

- Make use of a set (not one or two) of trusted media sources, that employ professionals & expert journalists. Rely on their expertise to do the job of collecting & presenting the facts correctly. Assuming (hopefully) that these people/ organizations behave professionally, the information that gets through to these sources would be far better.

- Fact check details across the entire set of sources. This helps mitigate against a temporary (or permanent) deliberate/ inadvertent faltering, manipulation, influence, etc. of one odd sources. Use the set as a weak quorum that collectively highlights & prevents propagation of misinformation. Even if a few members there falter, unlikely that all would. The majority would not allow the fakes to make it into their respective channels.

- Challenging part being if a certain piece shows up as a breaking news on one channel & not the others. Could default to labeling it as fake/ unverified, with the following considerations for the news piece:

 Case 1: Turns out fake, doesn't show up on the other sources
     => Remains Correctly Marked Fake


 Case 2: Turns out to be genuine & eventually shows up on other/ majority sources
    => Gets Correctly Marked True
 

 Case 3: Is genuine, but acquired via some form of journalistic brilliance (expose, criminal/ undercover journalism, etc.) that can't be re-run, or is about a region/ issue largely ignored by the mainstream media unwilling to do the verification, or for some other reason can't be verified
    => Remains Incorrectly Marked Fake


Case 3 is obviously the toughest to crack. While some specifics maybe impossible to verify, other allied details could be easier to access & verify. Once some other media groups (beyond the one that reported) get involved in the secondary verification there is some likelihood of true facts emerging.

For those marginalized there are social groups & organizations, governmental & non-governmental that have some reports published on issues from ground zero. At the same time, as connectivity improves, citizens themselves would be able to bring forth local issues onto national & international platforms. In the interim, these will have to be relied upon until commercial interests & mainstream media eventually bring the marginalized into the folds. Nonetheless, much more thought & effort is needed to check the spread of misinformation.

Finally, here's a little script 'op-on.sh' / 'op-on.py' (works/ tested on *nix desktop), to look up opinions (buzz) on any given topic across a set of media agencies, of repute. Alternatively, a bookmarklet could be added to the browser, which would enable looking up the opinions across the sites. The op-on bookmarklet (tested on Firefox & Chrome) can be installed by right clicking & adding as a bookmark in the browser (or by copying the script into the url of a new bookmark). Pop-up blockers in the browser will need to be temporarily disabled (e.g. by clicking allow pop-ups in Firefox) for the script to work.

The set of media agencies that these scripts look up include groups like TOI, IE, India Today, Times Now, WION, Ndtv, Hindu, HT, Print, Quint, Week, Reuters, BBC, and so on. This might help the curious human reader to look up all those sources for opinions on any topic of interest.

Update 1 (16-Sep-19): Some interesting developments:

Friday, March 8, 2019

Secure DNS

Domain Name System (DNS) is one of the backbones of the internet. DNS helps translate a URL (e.g. blahblah.com) to its corresponding IP address (i.e. 10.02.93.54). Thanks to the DNS human's can access the internet via human friendly URL, than having to remember & punch in numeric IP. So much simpler to say "look it up on Google", than saying "look it up on 172.168...".

Working of DNS

The working of the DNS involves looking up DNS servers spread out over the internet. When a user enters a URL in the browser, the address resolver in their system looks up the DNS servers configured at their system (router/ network, ISP, etc.) for the corresponding IP address. The resolver recursively looks up the Root DNS server, then the top level domain (.com, .in), then second level domain (Google, Yahoo, etc.) (the Authoritative server for the domain), & from it finally the sub-domain (www, mail, etc.) to arrive at the corresponding IP address.

DNS requests are typically made in plain text via UDP or TCP. In addition to destination URL, these requests also carry enough source identifiable information with them. Together with the recursive nature of the lookups via several intermediaries, this makes DNS requests vulnerable to being observed & tracked. The response could even be spoofed via a malicious intermediary that changes the IP address & direct the user to a scam site.

DNS over HTTPS (DoH)

A very recent development has been the introduction of DNS over HTTPS (DoH) in Firefox. HTTPS is the standard protocol used for end-to-end encryption of traffic over the internet. This prevents eavesdropping of the traffic between the client & the server by any intermediary.

To further secure the DNS request, DoH also brings in the concept of Trusted Recursive Resolvers (TRR). The TRR is trusted & of repute, & provides guarantees of privacy & security to the user. The default for Firefox is Cloudflare, though other TRRs are available for the user to choose from. Sadly though, OpenDNS isn't onboard with DoH or TRR, instead has its own offerings called DNSCrypt. Hope to see more convergence as adoption of these technologies improves in the future.

Setting-up DoH with Firefox (ver. 65.0) requires going to Preferences > Network Setting & checking "Enable DNS over HTTPS", with the default Cloudflare TRR. Alternatively, the flags "network.trr.mode" & "network.trr.uri" could be set-up via the about:config.

To confirm if the set-up is correct, navigate to the Cloudflare test my browser page & validate. This should result in successful check marks in green against "Secure DNS" & "TLS 1.3". Some further set-ups may be needed in case the other two checks fail.

For DNSSEC a DNSSEC compatible DNS server will need to be added. Pick Cloudflare DNS, Google DNS or any other from the DNS severs list. On the other hand, for Encrypted SNI indicator, the flag "network.security.esni.enabled" can be enabled. Since ESNI is still at an experimental stage, there could be changes (or bugs) that get uncovered & resolved in the future.

Enabling at a Global Level

The DoH setting discussed here is limited to Firefox. DNS lookups done outside of Firefox from any other browser, application or OS is unable to leverage DoH. DoH at the global/ OS level could be set-up via proxies. Given that DoH is over HTTPS, primarily a high level protocol for secure transfer of Hyper Text Documents, it maybe preferable securing DNS directly over TLS protocol.

In this regard DNS over TLS (DoT) is being developed. Ubuntu ver.18.0 & some Linux flavours offer DoT support experimentally. While DoT has some catching up to do viz-a-vis DoH, raging debates are continuing regarding the merits & demerits of the two options for securing DNS requests. Over time we can hope for the gaps & issues to be resolved, & far better privacy & security offered to the end-user.

Update 1 (25-Feb-21) 
Reference links regarding enabling DoH across devices.

Enabling In Android:
- https://android.stackexchange.com/questions/214574/how-do-i-enable-dns-over-https-on-firefox-for-android 
(In addition to the various network.trr.* settings to use OpenDns Family Shield DoH, additionally lookup the IP for the domain name "doh.familyshield.opendns.com" & set that value to network.trr.uri) 
- https://blog.cloudflare.com/enable-private-dns-with-1-1-1-1-on-android-9-pie/

DoH Providers:
- https://github.com/curl/curl/wiki/DNS-over-HTTPS (Cloudflare also offers a Family shield/ filter)
- https://support.opendns.com/hc/en-us/articles/360038086532-Using-DNS-over-HTTPS-DoH-with-OpenDNS
- https://support.mozilla.org/en-US/kb/dns-over-https-doh-faqs

Tuesday, March 5, 2019

Human Design

"What a blessing (mercy) it would be if we could open and shut our ears as easily as we open and shut our eyes. - Georg C. Lichtenberg"

So true. Many an offensive situations could be diffused by simply dropping down the earlids. In a hyper-noisy nation like ours where the chatter never dies down, earmarked (sic!) noise free zones (around hospitals, schools, etc.) wouldn't exist. There could even be earlid-downed marches to protest against the high decibel rants pushed at us from all nooks & corners of the planet.

Perhaps Kikazaru/ Mikazaru, the first macaque who prescribed to us hear no evil, would be seen jumping around like never before. Only to be reminded the very next minute by his two wise buddies of its futility. And how their respective advices have been largely ignored despite there being lids for the eyes & the mouth. Finally, we would perhaps be able to truly experience the world in the way that people who can't hear experience it, even today. So yes, I agree with Mr. Lichtenberg that it would be a real blessing!

In that same spirit, we could also do with another design change, one that might already exist in a parallel universe somewhere. Would be nice to shift humans from a 4-hourly hunger cycle to a more pragmatic 4-monthly one. No getting hungry every few hours, no snacking, no gorging, no fun (seriously)?

There'd instead be a triannual feasting day for the individual. That would be the day to celebrate, bigger than any birthday or anniversary combined. The person concerned would probably down a few hundred kilos of their favorite gourmets. Gastronomic desires fulfilled like there's no tomorrow. There really wouldn't be one for the next four months. Guests meanwhile, would be making merry - singing, dancing, & everything else - awaiting their day of feasting. 

There are stories about Indian mystics & sadhus who achieved a state of being, or were just built differently, where they didn't need any food for days together. But they seem to have gone extinct, save for some hunger artists. On the other hand, many animal species are known to feed in cycles with long fasting breaks in between. The camel for instance carries a special biological organ (the hump) to store food (fat) reserves, & can go without food & water for weeks together. In nature the concept is not so rare, a few hundred genes at play that's all.     

Yet, the impact from a triannual feeding cycle to our social structures would be unimaginable. For instance the movie screenplay where the protagonist is complaining about the paapi pet (evil stomach) would simply be gone. Hunger, malnourishment, perhaps even poverty would be over. Or is that taking it too far? Newer enterprises would no doubt emerge that would work their way to profitability around the altered version of this fundamental base human need for food. In any case, there would be a paradigm shift on our social, economic & policy frameworks all over. Our entire existence would be markedly different, & hopefully better.

Monday, March 4, 2019

AB Testing To Establish Causation

A/B testing is a form of testing performed to compare the effectiveness of different versions of a product with randomly distributed (i.i.d.) end-user groups. Each group gets to use only one version of the product. Assignment of any particular user to a specific group is done at random, without any biases, etc. User composition of the different groups are assumed to be similar, to the extent that switching the version of the products between any two groups at random would make no difference to the overall results of the test.

A/B testing is an example of a simple randomized control trial . This sort of tests help establish causal relationship between a particular element of change & the measured outcome. The element of change could be something like change of location of certain elements of a page, adding/ removing a feature of the product, conversion rate, and so on. The outcome could be to measure the impact on additional purchase, clicks, time of engagement, etc.

During the testing period, users are at randomly assigned to the different groups. The key aspect of the test is the random assignment of users being done in real-time of otherwise similar users, so that no other unknown confounding factors (demographics, seasonality, tech. competence, background, etc.) have no impact on the test objective. When tested with a fairly large number of users, almost every group will end-up with a good sample of users that are representative of the underlying population.

One of the groups (the control group) is shown the baseline version (maybe an older version of an existing product) of the product against which the alternate versions are compared. For every group the proportions of users that fulfilled the stated objective (purchased, clicked, converted, etc.) is captured.

The proportions (pi) are then used to compute the test statistics Z-value (assuming a large normally distributed user base), confidence intervals, etc. The null hypothesis being that the proportions (pi) are all similar/ not significantly different from the proportion of the control group (pc).

For the two version A/B test scenario

   Null hypothesis H0(p1 = pc) vs. the alternate hypothesis H1(p1 != pc).

   p1 = X1/ N1 (Test group)
   pc = Xc/ Nc (Control group)
   p_total = (X1 + Xc)/(N1 + Nc) (For the combined population) ,
            where X1, Xc: Number of users from groups test & control that fulfilled the objective,
                & N1, Nc: Total number of users from test & control group

  Z = Observed_Difference / Standard_Error
    = (p1 - pc)/ sqrt(p_total * (1 - p_total) * (1/N1 + 1/Nc))



The confidence level for the computed Z value is looked up in a normal table. Depending upon whether it is greater than 1.96 (or 2.56) the null hypothesis can be rejected with a confidence level of 95% (or 99%, or higher). This would indicate that the behavior of the test group is significantly different from the control group, the likely cause for which being the element of change brought in by the alternate version of the product. On the other hand, if the Z value is less than 1.96, the null hypothesis is not rejected & the element of change not considered to have made any significant impact on fulfillment of the stated objective.

Sunday, March 3, 2019

Mendelian Randomization

The second law of Mendelian inheritance is about independent assortment of alleles at the time of gamete (sperm & egg cells) formation. Therefore within the population of any given species, genetic variants are likely to be distributed at random, independent of any external factors. This insight forms the basis of Mendelian Randomization (MR) technique, typically applied in studies of epidemiology.

Studies of epidemiology try to establish the causal link (given some known association) between a particular risk factor & a disease. For e.g. smoking to cancer, blood pressure to stroke, etc. The association in many cases is found to be non-causal, or reverse causal, etc. Establishing the true effect becomes challenging due to the presence of confounding factors such as social, behavioral, environmental, physiological, etc. MR helps to tackle the confounding factors in such situations.

In MR, genetic variants (polymorphism) or genotype that have an effect similar to the risk factor/ exposure are identified. An additional constraint being that the genotype must not have any direct influence on the disease. Existence of genotype in the population is random, independent of any external influence. So presence (or absence) of disease within the population possessing the genotype, establishes (or refutes) that the risk factor/ effect is actually the cause for the disease. Several researches based on Mendelian randomization have been done  successfully.

Example 1: There could be a study to establish the causal relationship (given observed association) between raised cholesterol levels & chronic heart disease (CHD). Given the presence of several confounding factors such as age, physiology, smoking/ drinking habits, reverse causation (CHD causing raised cholesterol), etc., MR approach would be beneficial.

The approach would be to identify a genotype/ gene variant that is known to be linked to an increase in total cholesterol levels (but has no direct bearing on CHD). The propensity for CHD is tested for all subjects having the particular genotype, which if/ when found much higher than the general population (not possessing the gene variant) establishes that raised cholesterol levels have a causal relationship with CHD.

Instrumental Variables

MR is an application of the statistical technique of instrumental variables (IV) estimation. IV technique is also used to establish causal relationships in the presence of confounding factors.

When applied to regression models, IV technique is particularly beneficial when the explanatory variable (covariates) are correlated with the error term & give biased results. The choice of IV is such that it only induces changes in the explanatory variables, without having any independent effect on dependent variables. The IV must not be correlated to the error term. Selecting an IV that fulfills these criterias is largely done through an analytical process supported by some observational data, & by leveraging relevant priors about the field of study.

Equating MR to IV 
  • Risk Factor/ Effect = Explanatory Variable, 
  • Disease = Dependent Variable
  • Genotype = Instrument Variable 
Selection of genotype (IV) is based on prior knowledge of genes, from existing sources, literature, etc.

Saturday, March 2, 2019

Mendelian Inheritance

Gregor Mendel was a phenomenal scientist of the nineteenth century. Actually a Monk by profession he is considered the founder of modern genetics. In the 1850-60s, in the garden of his monastery he performed systematic hybridization experiments with the Pea plant over successive generations (second - F2, third - F3, etc.). Through these experiments he was able to conclude that traits get inherited by progenies in the form of discrete traits with a perfectly binary (either/ or) characteristic from the ancestors, as opposed to the then existing notion of a blending of traits. 

The following are the laws of Mendelian inheritance:
  • Law of Segregation:  During gamete (sperm or egg cell) formation, allele pairs separate out at random & only one of the alleles are carried by each gamete for each gene.
  • Law of Independent Assortment: Genes for different traits segregate independently of other pairs of alleles during the formation of gametes.
  • Law of Dominance: Some alleles are dominant while others are recessive. When present the dominant ones dominate.

Where,

 Human body is

  made of -> Cells                              
                                                                  (Building block of life, contain
                                                                       biomolecules such as Protein, DNA)
         containing --> Chromosomes            
                                                                 (One DNA Molecule + some proteins
                                                                       in the cell's nucleus, double helix
                                                                       shape, 46 in humans: 23 each
                                                                       inherited from either parent)
                             having     ---> Genes                     
                                                                  (Code to synthesize proteins
                                                                       & biocomponents, 2 Alleles or
                                                                        variant forms of a trait,
                                                                        one inherited per parent)

                                     that get coded to ----> Proteins   
                                                                 (Large biomolecules of amino acid
                                                                       chains, participate in vast variety
                                                                       of cellular processes & biological
                                                                       functions, metabolic reactions,
                                                                       signaling, etc. Exist within & get
                                                                       recycled by the cells)
                                                 
While the findings of Mendel were not popular initially, they were re-discovered almost half a century later. Though Mendel limited the experiments to traits that were governed by a single gene, the results were significant. These helped formed our understanding of genetics & heredity (genes) that continue to this day.

Sunday, February 24, 2019

Memory Gene

Have a conjecture that soon someone's going to be discovering a memory gene in the human genome. This doesn't seem to be have been done or published in any scientific literature so far. The concept of Genetic Memory from an online game is close, but then that's fiction.

The idea of the memory gene is that this gene on the human genome will act as a memory card. Whatever data the individual writes to the memory gene in their lifetime can be later retrieved by their progenies in their lifetimes. The space available in the memory gene would be small compared to what is available in the brain. If the disk space of the human brain is a Petabyte (= 10^12 Kilobytes), space of the memory gene would be about 10 Kilobytes. So very little can be written to the memory gene.

Unlike the brain to which every bit of information (visual, aural, textual, etc.) can be written to at will, writing to the memory gene would require some serious intent & need. Writing to the memory gene would be more akin to etching on a brass plate - strenuous but permanent. The intent would be largely triggered by the individual's experience(s), particularly ones that triggers strong emotions perhaps beneficial to survival. Once written to the memory gene this information would carry forward to the offsprings.

The human genome is known to have about 2% coding DNA & the rest non-coding DNA. The coding portions carry the instructions (genetic instructions) to synthesize proteins, while the purpose of the non-coding portions is not clearly known so far. The memory gene is likely to have a memory addressing mechanism, followed by the actual memory data stored in the large non coding portion.

At the early age of 2 or 3 years, when a good portion of brain development has happened in the individual, the memory recovery will begin. The mRNA, ribosome & the rest of the translation machinery will get to work in translating the genetic code from the memory gene to synthesize the appropriate proteins & biomolecules of the brain cell. In the process the memory data would be restored block by block in the brain. This would perhaps happen over a period of low activity such as night's sleep. The individual would later awaken to new transferred knowledge about unknowns, that would appear to be intuitive. Since the memory recovery would take place at an early age, conflicts in experiences between the individual and the ancestor wouldn't happen.
  
These are some basic features of the very complex memory gene. As mentioned earlier, this is purely a conjecture and shouldn't be taken otherwise. Look forward to exploring genuine scientific researches in this space as they get formalized & shared.

Update 1 (27-Aug-19):
For some real research take a look at the following:

 =>> Arch Gene:
 Neuronal gene Arc required for synaptic plasticity and cognition. Resemble  retroviral/retrotransposon in their transfer between cells followed by activity dependent translation. These studies throw light on a completely new way through which neurons could send genetic information to one another. More details from the 2018 publication available here:
  •    https://www.nih.gov/news-events/news-releases/memory-gene-goes-viral
  •    https://www.cell.com/cell/comments/S0092-8674(17)31504-0
  •    https://www.ncbi.nlm.nih.gov/pubmed/29328915/

  =>> Memories pass between generations (2013):
 When grand-parent generation mice are taught to fear an odor, their next two generations (children & grand-children) retain the fear:
  •    https://www.bbc.com/news/health-25156510
  •    https://www.nature.com/articles/nn.3594
  •    https://www.nature.com/articles/nn.3603

 =>> Epigenetics & Cellular Memory (1970s onwards):
  •    https://en.wikipedia.org/w/index.php?title=Genetic_memory_(biology)&oldid=882561903
  •    https://en.wikipedia.org/w/index.php?title=Genomic_imprinting&oldid=908987981

  =>> Psychology - Genetic Memory (1940s onwards):Largely focused on the phenomenon of knowing things that weren't explicitly learned by an individual:
  •    https://blogs.scientificamerican.com/guest-blog/genetic-memory-how-we-know-things-we-never-learned/
  •    https://en.wikipedia.org/w/index.php?title=Genetic_memory_(psychology)&oldid=904552075
  •    http://www.bahaistudies.net/asma/The-Concept-of-the-Collective-Unconscious.pdf
  •    https://en.wikipedia.org/wiki/Collective_unconscious

Friday, February 22, 2019

Taller Woman Dance Partner

In the book Think Stats, there's an exercise to work out the percentage of dance couples where the woman is taller, when paired up at random. Mean heights (cm) & their variances are given as 178 & 59.4 for men, & 163 & 52.8 for women.

The two height distributions for men & women can be assumed Normal. The solution is to work out the total area under the two curves where the condition height_woman > height_men holds. This will need to be done for every single height point (h), i.e.  under the  entire spread of the two curves (-∞, ∞). In other words, the integral of the height curve for men from (-∞, h) having height < h,  multiplied by the integral of the height curve for women from (h, ∞) having height > h.

There are empirical solutions where N data points are drawn from two Normal distributions with appropriate mean & sd (variance) for men & women. These are paired up at random (averaged over k-runs) to compute the number of pairs with taller women.

The area computation can also be approximated by limiting to the ±3 sd range (includes 99.7%) of height values on either side of the two curves (140cm to 185cm). Then by sliding along the height values (h) starting from h=185 down in steps of size (s) s=0.5 or so, compute z at each point:

z = (h - m)/ sd, where m & sd are the corresponding mean & standard deviation of the two curves.

Refer to the one-sided standard normal value to compute percentage women to the right of h (>z) & percentage of men to left of h (<z). The product of the two is the corresponding value at point h. A summation of the same over all h results in the final percentage value. The equivalent solution using NORMDIST yields a likelihood of ~7.5%, slightly below expected (due to the coarse step size of 0.5). 

C1 plots the percent. of women to the right & percent. of men to the left at different height values. C2 is the likelihood of seeing a couple with a taller woman within each step window of size 0.5. Interestingly, the peak in C2 is between heights 172-172.5, about 1.24 sd from women's mean (163) & 0.78 sd from men's mean (178). The spike at the end of the curve C2 at point 185.5 is for the likelihood of all heights > 185.5, i.e. the window (185.5,∞).

Playing around with different height & variance values yields other final results. For instance at the moment the two means are separated by 2*sd. If we reduced this to 1*sd the mean height of women (or men) to about 170.5 cm, the final likelihood jumps to about 23%. This is understandable since the population now has far more taller women. The height variance for men is more than women, setting them to identical values 52.8 (fewer shorter men) results in lowering of the percentage to about 6.9%, vs. setting them to 59.4 (more taller women) increases the percentage to 8.1%.





Sample data points from Confidence_Interval_Heights.ods worksheet:


Point (p) Women
Men Likelihood (l)

z_f
Using p
r_f =
% to Right
r_f_delta =
% in between

z_m
Using p
l_m=
% to Left
l = l_m
* r_f_delta
185.5 3.0964606 0.0009792 0.0009792
0.9731237 0.8347541 0.0008174
185 3.0276504 0.0012323 0.0002531
0.9082488 0.8181266 0.0002071
184.5 2.9588401 0.0015440 0.0003117
0.8433739 0.8004903 0.0002495
184 2.8900299 0.0019260 0.0003820
0.7784989 0.7818625 0.0002987
183.5 2.8212196 0.0023921 0.0004660
0.7136240 0.7622702 0.0003553
183 2.7524094 0.0029579 0.0005659
0.6487491 0.7417497 0.0004197
182.5 2.6835992 0.0036417 0.0006838
0.5838742 0.7203475 0.0004926
182 2.6147889 0.0044641 0.0008224
0.5189993 0.6981194 0.0005741

...