In the book Think Stats, there's an exercise to work out the percentage of dance couples where the woman is taller, when paired up at random. Mean heights (cm) & their variances are given as 178 & 59.4 for men, & 163 & 52.8 for women.
The two height distributions for men & women can be assumed Normal. The solution is to work out the total area under the two curves where the condition height_woman > height_men holds. This will need to be done for every single height point (h), i.e. under the entire spread of the two curves (-∞, ∞). In other words, the integral of the height curve for men from (-∞, h) having height < h, multiplied by the integral of the height curve for women from (h, ∞) having height > h.
There are empirical solutions where N data points are drawn from two Normal distributions with appropriate mean & sd (variance) for men & women. These are paired up at random (averaged over k-runs) to compute the number of pairs with taller women.
The area computation can also be approximated by limiting to the ±3 sd range (includes 99.7%) of height values on either side of the two curves (140cm to 185cm). Then by sliding along the height values (h) starting from h=185 down in steps of size (s) s=0.5 or so, compute z at each point:
z = (h - m)/ sd, where m & sd are the corresponding mean & standard deviation of the two curves.
Refer to the one-sided standard normal value to compute percentage women to the right of h (>z) & percentage of men to left of h (<z). The product of the two is the corresponding value at point h. A summation of the same over all h results in the final percentage value. The equivalent solution using NORMDIST yields a likelihood of ~7.5%, slightly below expected (due to the coarse step size of 0.5).
C1 plots the percent. of women to the right & percent. of men to the left at different height values. C2 is the likelihood of seeing a couple with a taller woman within each step window of size 0.5. Interestingly, the peak in C2 is between heights 172-172.5, about 1.24 sd from women's mean (163) & 0.78 sd from men's mean (178). The spike at the end of the curve C2 at point 185.5 is for the likelihood of all heights > 185.5, i.e. the window (185.5,∞).
Playing around with different height & variance values yields other final results. For instance at the moment the two means are separated by 2*sd. If we reduced this to 1*sd the mean height of women (or men) to about 170.5 cm, the final likelihood jumps to about 23%. This is understandable since the population now has far more taller women. The height variance for men is more than women, setting them to identical values 52.8 (fewer shorter men) results in lowering of the percentage to about 6.9%, vs. setting them to 59.4 (more taller women) increases the percentage to 8.1%.
Sample data points from Confidence_Interval_Heights.ods worksheet:
...
The two height distributions for men & women can be assumed Normal. The solution is to work out the total area under the two curves where the condition height_woman > height_men holds. This will need to be done for every single height point (h), i.e. under the entire spread of the two curves (-∞, ∞). In other words, the integral of the height curve for men from (-∞, h) having height < h, multiplied by the integral of the height curve for women from (h, ∞) having height > h.
There are empirical solutions where N data points are drawn from two Normal distributions with appropriate mean & sd (variance) for men & women. These are paired up at random (averaged over k-runs) to compute the number of pairs with taller women.
The area computation can also be approximated by limiting to the ±3 sd range (includes 99.7%) of height values on either side of the two curves (140cm to 185cm). Then by sliding along the height values (h) starting from h=185 down in steps of size (s) s=0.5 or so, compute z at each point:
z = (h - m)/ sd, where m & sd are the corresponding mean & standard deviation of the two curves.
Refer to the one-sided standard normal value to compute percentage women to the right of h (>z) & percentage of men to left of h (<z). The product of the two is the corresponding value at point h. A summation of the same over all h results in the final percentage value. The equivalent solution using NORMDIST yields a likelihood of ~7.5%, slightly below expected (due to the coarse step size of 0.5).
C1 plots the percent. of women to the right & percent. of men to the left at different height values. C2 is the likelihood of seeing a couple with a taller woman within each step window of size 0.5. Interestingly, the peak in C2 is between heights 172-172.5, about 1.24 sd from women's mean (163) & 0.78 sd from men's mean (178). The spike at the end of the curve C2 at point 185.5 is for the likelihood of all heights > 185.5, i.e. the window (185.5,∞).
Playing around with different height & variance values yields other final results. For instance at the moment the two means are separated by 2*sd. If we reduced this to 1*sd the mean height of women (or men) to about 170.5 cm, the final likelihood jumps to about 23%. This is understandable since the population now has far more taller women. The height variance for men is more than women, setting them to identical values 52.8 (fewer shorter men) results in lowering of the percentage to about 6.9%, vs. setting them to 59.4 (more taller women) increases the percentage to 8.1%.
Sample data points from Confidence_Interval_Heights.ods worksheet:
Point (p) | Women | Men | Likelihood (l) | ||||
z_f Using p |
r_f = % to Right |
r_f_delta = % in between |
z_m Using p |
l_m= % to Left |
l = l_m * r_f_delta |
||
185.5 | 3.0964606 | 0.0009792 | 0.0009792 | 0.9731237 | 0.8347541 | 0.0008174 | |
185 | 3.0276504 | 0.0012323 | 0.0002531 | 0.9082488 | 0.8181266 | 0.0002071 | |
184.5 | 2.9588401 | 0.0015440 | 0.0003117 | 0.8433739 | 0.8004903 | 0.0002495 | |
184 | 2.8900299 | 0.0019260 | 0.0003820 | 0.7784989 | 0.7818625 | 0.0002987 | |
183.5 | 2.8212196 | 0.0023921 | 0.0004660 | 0.7136240 | 0.7622702 | 0.0003553 | |
183 | 2.7524094 | 0.0029579 | 0.0005659 | 0.6487491 | 0.7417497 | 0.0004197 | |
182.5 | 2.6835992 | 0.0036417 | 0.0006838 | 0.5838742 | 0.7203475 | 0.0004926 | |
182 | 2.6147889 | 0.0044641 | 0.0008224 | 0.5189993 | 0.6981194 | 0.0005741 |
...
No comments:
Post a Comment