class: center, middle, inverse, title-slide # Discrete Probability Distributions ##
Introduction to Global Health Data Science ###
Back to Website
###
Prof. Amy Herring --- layout: true <div class="my-footer"> <span> <a href="https://sta198f2021.github.io/website/" target="_blank">Back to website</a> </span> </div> --- class: middle # Probability Distributions --- ### Probability Distributions When we learned how to find probabilities using basic principles, we focused on one particular outcome or event, like the probability of being a smoker. We now explore the bigger picture by exploring all possible values of a discrete random variable, along with their associated probabilities. This list of possible values and probabilities is called the *probability distribution* of the random variable. --- ### Why do we care about probability distributions? We need probability distributions to make statements about how likely an event may be. For example, suppose we are interested in comparing two groups of pancreatic cancer patients, with one group on the standard treatment and one group on an experimental new treatment. The statistical test we use to compare the two groups depends on the variable of interest `\(X\)` and its probability distribution. One of you could define `\(X\)` to be a 0/1 variable indicating whether each patient survives 1 year, and someone else could instead define X as the number of months a patient survives. In these two cases, each definition of `\(X\)` has a different probability distribution, and you would use a different type of test for each case. --- ### Example: Y Chromosomes Consider Y chromosomes in the US population. 51.2% of births are to babies with Y chromosomes, and 48.8% of babies do not have Y chromosomes. Thus the probability of having a baby with a Y chromosome is 0.512. - The probability of an event is `\(\geq 0\)` and `\(\leq 1\)`. - The sum of the probabilities of all possible events is exactly 1 (e.g., 0.512+0.488=1). - So the probability of having a baby without a Y chromosome is 0.488. --- ### Bernoulli Distribution Consider a dichotomous (two-level) random variable `\(Y\)`. By definition, `\(Y\)` must assume one of two possible values, e.g. - failure or success - dead or alive - Duke student or not - current smoker or not - heads or tails (coin flip) A random variable of this type is known as a *Bernoulli* random variable, and we describe the probability of response using the parameter `\(\pi\)`. The distribution is named for [Jacob Bernoulli](https://en.wikipedia.org/wiki/Jacob_Bernoulli), who was part of a famous family of mathematicians in Switzerland. --- ### Bernoulli random variable These variables are often coded so that `\(Y=1\)` is called an *event* or *success*, `\(Y=0\)` is called a *failure*, and `\(\pi\)` is defined as the probability of a success. (This is not required, but you need to remember which level maps to `\(\pi\)`, and the other level then has probability `\(1-\pi\)`.) Examples: - Coin flip: let `\(Y=1\)` if heads and `\(Y=0\)` if tails, then `\(P(Y=1)=\pi=0.5\)` - Vegetarian in US: `\(Y=1\)` if vegetarian and `\(Y=0\)` if not, then `\(P(Y=1)=\pi=0.05\)` and `\(P(Y=0)=1-\pi=1-0.05=0.95\)` - Vegetarian in India: `\(Y=1\)` if vegetarian and `\(Y=0\)` if not, then `\(P(Y=1)=\pi=0.31\)` and `\(P(Y=0)=1-\pi=1-0.31=0.69\)` --- ### Bernoulli distribution - `\(Y\)` takes value 1 with probability `\(\pi\)` and value 0 with probability `\(1-\pi\)` - `\(P(Y=y)=\pi^y(1-\pi)^{1-y}\)` - `\(P(Y=1)=\pi^1(1-\pi)^0=\pi\)` (remember `\(x^0=1\)` for any `\(x\)`) - `\(P(Y=0)=\pi^0(1-\pi)^1=1-\pi\)` - We don't really need all this formality as it's probably easier just to keep track of `\(\pi\)` and `\(1-\pi\)` - However, we want to extend this to a more complex setting -- for example, in a randomly-selected group of 3 high school students, how surprising would it be to get 2 who have smoked e-cigarettes in the past month? --- ### Case Study: E-Cigarettes - The [CDC reports](https://www.cdc.gov/tobacco/data_statistics/fact_sheets/youth_data/tobacco_use/index.htm) that 19.6% of high school students have smoked e-cigarettes in the past 30 days. We'll round this to 20% for simplicity. - `\(P(Y=1)=P(Smoker)=\pi=0.2\)` and `\(P(Y=0)=0.8\)` - Now suppose we randomly select two independent high school students and define a new random variable `\(X\)` representing the number of smokers. X can take the values 0, 1, or 2. Let `\(Y_1\)` be the smoking status of the first student and `\(Y_2\)` be the smoking status of the second student, where `\(Y_j\)`=1 if student `\(j\)` smokes and takes value 0 otherwise. .pull-left[ Next we'll talk about how to get the *probability distribution* of `\(X\)`. ] .pull-right[ | `\(Y_1\)` | `\(Y_2\)` | `\(X\)` | `\(P(X)\)` | |:----:|:----:|:----:|:----:| | 0 | 0 | 0 | | | 1 | 0 | 1 | | | 0 | 1 | 1 | | | 1 | 1 | 2 | | ] --- Let `\(A_1\)` be the event that `\(Y_1=1\)` and let `\(A_2\)` be the event that `\(Y_2=1\)`. Because the people are independent, then `$$\begin{aligned} P(Y_1=Y_2=1) & = P(A_1 \cap A_2) \\ & = P(A_1) P(A_2 \mid A_1) \\ & = P(A_1) P(A_2) \\ & = \pi \times \pi = 0.2(0.2)=0.04. \end{aligned}$$` --- .pull-left[ Now we can fill in the bottom row of the probability distribution of `\(X\)`. ] .pull-right[ | `\(Y_1\)` | `\(Y_2\)` | `\(X\)` | `\(P(X)\)` | |:----:|:----:|:----:|:----:| | 0 | 0 | 0 | | | 1 | 0 | 1 | | | 0 | 1 | 1 | | | 1 | 1 | 2 | `\(0.02 \times 0.02 =0.04\)` | ] --- .pull-left[ It's straightforward to fill in the rest of the table. ] .pull-right[ | `\(Y_1\)` | `\(Y_2\)` | `\(X\)` | `\(P(X)\)` | |:----:|:----:|:----:|:----:| | 0 | 0 | 0 | `\(0.8 \times 0.8 = 0.64\)` | | 1 | 0 | 1 | `\(0.2 \times 0.8 = 0.16\)` | | 0 | 1 | 1 | `\(0.8 \times 0.2 = 0.16\)` | | 1 | 1 | 2 | `\(0.2 \times 0.2 =0.04\)` | ] --- .pull-left[ Recall our table: | `\(Y_1\)` | `\(Y_2\)` | `\(X\)` | `\(P(X)\)` | |:----:|:----:|:----:|:----:| | 0 | 0 | 0 | `\(0.8 \times 0.8 = 0.64\)` | | 1 | 0 | 1 | `\(0.2 \times 0.8 = 0.16\)` | | 0 | 1 | 1 | `\(0.8 \times 0.2 = 0.16\)` | | 1 | 1 | 2 | `\(0.2 \times 0.2 =0.04\)` | ] .pull-right[ We can clean up the table to get the probability distribution of `\(X\)`: | | | | | |:--:|:--:|:--:|:--:| | `\(X\)` | 0 | 1 | 2 | | `\(P(X=x)\)` | 0.64 | 0.32 | 0.04 | So if we randomly sample two US high schoolers, the probability that both are recent e-cig smokers is 0.04 (4% chance), the probability only one recently smoked is 0.32 (this can happen two ways -- either only the first smoked or only the second smoked), and the probability neither smoked e-cigs recently is 0.64. ] --- Now suppose we randomly sample 3 independent high school students! | `\(Y_1\)` | `\(Y_2\)` | `\(Y_3\)` | `\(X\)` | `\(P(X)\)` | |:----:|:----:|:----:|:----:|:----:| | 0 | 0 | 0 | 0 | | | 1 | 0 | 0 | 1 | | | 0 | 1 | 0 | 1 | | | 0 | 0 | 1 | 1 | | | 1 | 1 | 0 | 2 | | | 1 | 0 | 1 | 2 | | | 0 | 1 | 1 | 2 | | | 1 | 1 | 1 | 3 | | Better get started with the multiplication! --- Because these are independent high school students, we can calculate the probabilities in the same manner as before. | `\(Y_1\)` | `\(Y_2\)` | `\(Y_3\)` | `\(X\)` | `\(P(X)\)` | |:----:|:----:|:----:|:----:|:----:| | 0 | 0 | 0 | 0 | 0.8(0.8)(0.8)=0.512 | | 1 | 0 | 0 | 1 | 0.2(0.8)(0.8)=0.128 | | 0 | 1 | 0 | 1 | 0.8(0.2)(0.8)=0.128 | | 0 | 0 | 1 | 1 | 0.8(0.8)(0.2)=0.128 | | 1 | 1 | 0 | 2 | 0.2(0.2)(0.8)=0.032 | | 1 | 0 | 1 | 2 | 0.2(0.8)(0.2)=0.032 | | 0 | 1 | 1 | 2 | 0.8(0.2)(0.2)=0.032 | | 1 | 1 | 1 | 3 | 0.2(0.2)(0.2)=0.008 | The chance that 2 of 3 are recent e-cig smokers is 0.032+0.032+0.032=0.096 or 9.6% --- .pull-left[ | `\(Y_1\)` | `\(Y_2\)` | `\(Y_3\)` | `\(X\)` | `\(P(X)\)` | |:----:|:----:|:----:|:----:|:----:| | 0 | 0 | 0 | 0 | 0.8(0.8)(0.8)=0.512 | | 1 | 0 | 0 | 1 | 0.2(0.8)(0.8)=0.128 | | 0 | 1 | 0 | 1 | 0.8(0.2)(0.8)=0.128 | | 0 | 0 | 1 | 1 | 0.8(0.8)(0.2)=0.128 | | 1 | 1 | 0 | 2 | 0.2(0.2)(0.8)=0.032 | | 1 | 0 | 1 | 2 | 0.2(0.8)(0.2)=0.032 | | 0 | 1 | 1 | 2 | 0.8(0.2)(0.2)=0.032 | | 1 | 1 | 1 | 3 | 0.2(0.2)(0.2)=0.008 | ] .pull-right[ The probability distribution of `\(X\)`, the number of recent e-cig smokers out of three high school students, is now | | | | | | |:--:|:--:|:--:|:--:|:---:| | `\(X\)` | 0 | 1 | 2 | 3| | `\(P(X=x)\)` | 0.512 | 0.384 | 0.096 | 0.008 | ] --- ### Oh gosh, is she going to make us do this for 4 students? This is getting ridiculous! Hopefully now you can see the need for a formula to simplify things. We can use the *binomial distribution* to describe these patterns in general. --- ### Binomial distribution The *binomial distribution* is used to give us the probability of `\(X\)` "successes" from a sequence of `\(n\)` independent Bernoulli trials. In our example, each student would be an independent Bernoulli trial (either an e-cig smoker, or not). This distribution involves three assumptions. - There is a fixed number `\(n\)` of Bernoulli trials, each of which results in one of two mutually-exclusive outcomes - The outcomes of the `\(n\)` trials are independent - The probability of success `\(\pi\)` is the same for each trial The distribution is `\(P(X=x)=\begin{pmatrix} n \\ x \end{pmatrix}\pi^x(1-\pi)^{n-x},\)` and it has mean `\(n\pi\)` and variance `\(n\pi(1-\pi)\)`. --- ### Binomial distribution `$$P(X=x)=\begin{pmatrix} n \\ x \end{pmatrix}\pi^x(1-\pi)^{n-x}$$` It's really not so bad! First, look at the second part, `\(\pi^x(1-\pi)^{n-x}\)`. This is just multiplying the right combination of `\(\pi\)` and `\(1-\pi\)` as in the previous tables. For example, if we want the probability of 3 e-cig smokers, `\(X=3\)`, the second part is just `\(\pi^x(1-\pi)^{n-x}=0.2^3(0.8)^0=0.008\)`, just as in the table. --- ### Binomial distribution `$$P(X=x)=\begin{pmatrix} n \\ x \end{pmatrix}\pi^x(1-\pi)^{n-x}$$` - If we want the probability of 2 e-cig smokers, `\(X=2\)`, the second part is just `\(\pi^x(1-\pi)^{n-x}=0.2^2(0.8)^{3-2}=0.032\)`, which is what we see in any single row in which we have two smokers and one non-smoker. - This is the probability of any one specific combination of 2 smokers and 1 nonsmoker. Then we need to figure out how many combinations of 2 smokers and 1 nonsmoker we could get. - The first part, `\(\begin{pmatrix}n \\x \end{pmatrix}\)`, accounts for all the possible ways in which we can have 2 smokers out of 3 people. --- ### Combinations The first part, `\(\begin{pmatrix} n \\ x \end{pmatrix}\)`, is pronounced "n choose x" and is called a *combination*. In this particular setting, it is also called the *binomial coefficient*. It is a formula for the number of ways to pick `\(x\)` subjects from a larger group of `\(n\)` and is defined as `$$\begin{pmatrix} n \\ x \end{pmatrix} = \frac{n!}{x!(n-x)!}.$$` What's n!? Not hard either...n! (pronounced "n factorial") is just shorthand for the recursive multiplication `\(n!=n(n-1)(n-2)\cdots (1)\)`. So `\(3!=3(2)(1)=6\)`, `\(4!=4(3)(2)(1)=24\)`, and so forth. We define `\(0!=1\)`. How many ways can we pick 3 subjects from a group of 3? (Ok, that's easy, there's just one way.) Using the binomial coefficient, we see it is `\(\begin{pmatrix}3 \\ 3 \end{pmatrix}=\frac{3!}{3!0!}=\frac{3(2)(1)}{3(2)(1)1}=1\)`. --- ### Combinations Going back to the table, we see there are 3 ways to pick 2 subjects from a group of 3 students. We can also calculate `\(\begin{pmatrix}3 \\ 2 \end{pmatrix}=\frac{3!}{2!1!}=\frac{3(2)(1)}{(2)(1)(1)}=3\)`. --- ### Probability of getting 3 smokers in a group of 3 students Using the binomial distribution, the probability of getting 3 recent e-cig smokers in our sample of 3 students is `$$\begin{aligned} \begin{pmatrix} n \\ x \end{pmatrix}\pi^x(1-\pi)^{n-x} & = \frac{n!}{x!(n-x)!}\pi^x(1-\pi)^{n-x} \\ \begin{pmatrix} 3 \\ 3 \end{pmatrix}0.2^3(1-0.2)^{3-3} & = \frac{3!}{3!(3-3)!}0.2^3(1-0.2)^{3-3} \\ & \frac{3(2)(1)}{3(2)(1)1}0.2^3(0.8)^{0}=0.2^3 \\ & = 0.008. \end{aligned}$$` --- ### Distribution of e-cigarette smokers Let's use R to take a look at the distribution of the number of recent e-cig smokers in groups of varying size. We can actually run a big experiment in R using the CDC estimate of the percentage of high schoolers who have smoked e-cigarettes in the past 30 days, 19.6%. Let's draw a single random sample of 1000 high schoolers with smoking probability 0.196. Our random variable `\(X\)` is now the number of smokers in this big sample. In this experiment, how many smokers do we expect to see? A binomial random variable has expectation (mean) `\(n\pi\)`, so we expect to see somewhere around 1000(0.196)=196 smokers. --- ### Distribution of e-cig smokers ```r # If I want to reproduce my random draw exactly, I have to set # the seed. Otherwise, I'll get a different random sample every time set.seed(121372) # in R, what we called "n" is called size # R's "n" is the number of experiments to run # in this case, that's the # of samples of size 1000 to simulate rbinom(n=1,size=1000,prob=0.196) ``` ``` ## [1] 208 ``` In this case, we got `\(X=208\)` smokers, close to the expectation of `\(n\pi=1000(0.196)=196\)`. --- Now suppose instead of taking a single random sample, we want to run our experiment 10 times. ```r rbinom(n=10,size=1000,prob=0.196) ``` ``` ## [1] 197 186 220 179 174 199 189 188 182 193 ``` We can see that for each of our 10 experiments, `\(X\)` is pretty close to its expectation (population mean) of 196. --- If we run the experiment a very large number of times, we can get a good estimate of the distribution, which in this case is a binomial(n=1000,0.196) distribution. ```r binomdata = data.frame(X = rbinom(1000000, 1000, 0.196)) dim(binomdata) ``` ``` ## [1] 1000000 1 ``` --- ```r ggplot(binomdata, aes(x = X)) + geom_histogram() + labs(x = "Number of Smokers", title = "Binomial(1000000,0.196) Distribution") ``` <img src="w3-l02-discretedistributions_files/figure-html/plotbinomdata-1.png" width="60%" style="display: block; margin: auto;" /> --- ### In Vitro Fertilization (IVF) Infertility is a global health problem that is inadequately addressed worldwide. Some estimates indicate up to 1-2% of conceptions in Western countries are through IVF. Suppose the average success (live birth) rate of this procedure per embryo is 35% (in reality, success rates depend on parental characteristics such as age). In the US and in many other countries, IVF is expensive and unlikely to be covered by insurance. Because of the expense, parents wish to minimize the needed number of IVF treatments (i.e., maximize the success rate of a given IVF attempt). --- ### Embryos .pull-left[ Here's what embryos look like in the first days of life. Note that days are counted differently at the beginning in regular pregnancies and in IVF. In a regular pregnancy, dating is done back to the first day of the last menstrual period (there is no good marker of conception). So on day 1 of pregnancy, a woman is not pregnant! Typically ovulation is around day 14 of the cycle (and thus of pregnancy). In IVF, counting starts on the day the egg is fertilized (corresponding to around day 14 of a regular pregnancy if the sperm meets the egg right away!). ] .pull-right[ <img src="embryo.jpg" width="100%" style="display: block; margin: auto;" /> ] --- ### The Challenge To maximize the success rate, specialists usually implant more than one embryo. We want to determine the optimum number of embryos to implant so that the likelihood of having at least one baby is high, but the likelihood of having triplets, quadruplets, or higher-order births is low. Suppose we prefer a singleton birth, twins are acceptable (though not without increased risk), and the risk of higher-order births should be kept to a minimum. Assume the probability of each embryo surviving is independent of the others. --- How do we address this mathematically? - Let `\(Y\)` be the number of live births from a single IVF procedure that implants `\(n\)` embryos. Here's what we need to know. - Probability of having at least one baby: `\(P(Y \geq 1)\)`, which we want to be large - Probability of having triplets, quadruplets, or a higher-order birth: `\(P(Y>2)\)`, which we want to be small --- Binomial distribution: `\(P(Y=y)=\begin{pmatrix} n \\ y \end{pmatrix}\pi^y(1-\pi)^{n-y}\)` - `\(y\)` is the # of live births from `\(n\)` implanted embryos - `\(\pi=0.35\)` is the probability of a live birth - If we implant just 1 embryo, `\(n=1\)`. If we implant two embryos, `\(n=2\)`. Medical societies (e.g., American Society for Reproductive Medicine) have guidelines on the number of embryos to implant, ranging from 1-5, depending on embryo quality and the mother's age*. - We can use the binomial distribution to find `\(P(Y \geq 1)\)` and `\(P(Y>2)\)` for `\(n=1,2,3,4,5\)` `\(^*\)` So technically, `\(\pi\)` depends on embryo quality and the mother's age. We'll learn how to accommodate this formally when we study logistic regression models; we ignore it for now. --- Binomial distribution: `\(P(Y=y)=\begin{pmatrix} n \\ y \end{pmatrix}\pi^y(1-\pi)^{n-y}\)` - Quick trick: `\(P(Y \geq 1) = 1-P(Y=0)\)` (you either have no live births, or one or more live births) - Similarly, `\(P(Y>2)=1-P(Y=0)-P(Y=1)-P(Y=2)\)` if needed - In the easy case in which `\(n=1\)`, `\(P(Y \geq 1)=0.35\)` and `\(P(Y>2)=0\)`* *OK, not quite, as it is still possible for a single implanted embryo to divide, yielding twins, but we'll assume that isn't happening for now, as it's not terribly common. --- Binomial distribution: `\(P(Y=y)=\begin{pmatrix} n \\ y \end{pmatrix}0.35^y(1-0.35)^{n-y}\)` More generally, `$$\begin{aligned} P(Y \geq 1) & =1-P(Y=0)=1-\begin{pmatrix}n \\ 0 \end{pmatrix} 0.35^0 (0.65)^{n-0}=1-0.65^n \\ P(Y>2) & = 1-P(Y=0)-P(Y=1)-P(Y=2) \\ & = 1 - 0.65^n - \begin{pmatrix} n \\ 1 \end{pmatrix}0.35^1(0.65)^{n-1} - \begin{pmatrix} n \\ 2 \end{pmatrix}0.35^2(0.65)^{n-2} \\ & = 1 - 0.65^n - n(0.35)(0.65)^{n-1} - \frac{n!}{2!(n-2)!}0.35^2(0.65)^{n-2} \\ & = 1 - 0.65^n - n(0.35)(0.65)^{n-1} - \frac{n(n-1)}{2}0.35^2(0.65)^{n-2} \end{aligned}$$` --- Thus we have - `\(P(Y\geq 1)=1-0.65^n\)` - `\(P(Y > 2)=1 - 0.65^n - n(0.35)(0.65)^{n-1} - \frac{n(n-1)}{2}0.35^2(0.65)^{n-2}\)` If `\(n=5\)`, then we have - `\(P(Y\geq 1)=1-0.65^5=0.88\)` (good chance of a baby!) - `\(P(Y > 2)=1 - 0.65^5 - 5(0.35)(0.65)^{5-1} - \frac{5(4)}{2}0.35^2(0.65)^{5-2}=0.24\)` (good chance of lots o' babies!) --- Binomial distribution: `\(P(Y=y)=\begin{pmatrix} n \\ y \end{pmatrix}0.35^y(1-0.35)^{n-y}\)` | Embryos implanted (n) | P(at least 1 baby) | P(>2 babies) | |:-------:|:-------:|:-------:| | 1 | 0.35 | 0 | | 2 | 0.58 | 0 | | 3 | 0.73 | 0.04 | | 4 | 0.82 | 0.13 | | 5 | 0.88 | 0.24| Using this table of probabilities, and keeping in mind our criteria, how many embryos would you implant, and why? --- R has built-in functions for computing binomial probabilities that can save you a lot of work (especially when `\(n\)` is large) | Probability | R function | |:-----:|:-----:| | `\(P(Y=y \mid \pi, n)\)` | dbinom(y,n,pi) | | `\(P(Y \leq y \mid \pi, n)\)` | pbinom(y,n,pi) | For the last row of the table on the prior slide, `\(n=5\)` and `\(\pi=0.35\)`. The probability of at least one baby, `\(P(Y \geq 1)=1-P(Y=0)\)` can be calculated using in R the code *1-dbinom(0,5,0.35)*, which yields the value 0.8839709.