Probability

# Probability
## <br><br> Introduction to Data Science
### <a href="https://sta198f2021.github.io/website/">Course Website</a>
### <br> Prof. Amy Herring

---

layout: true
  
<div class="my-footer">
<span>
<a href="https://sta198f2021.github.io/website/" target="_blank">Back to website</a>
</span>
</div>

---

# Goal

---

## Goal of statistics

One goal of **statistics** is to answer a **research question**, by making **inferences** about a **population** based on data in one or more **samples**.
- Population: the entire group we would like to make conclusions about (e.g., all people aged 15 and up in China, all pregnancies worldwide)
- Sample: specific group we have collected data from (e.g., a random sample of non-institutionalized people aged 15 and up in sampled households, from sampled villages from sampled counties; pregnancies receiving prenatal care at Duke Hospital)
- The validity of our inferences depends on a variety of factors, including the **representativeness** of our sample

Question: how representative of worldwide pregnancies would  a sample from Duke Hospital be?

---

## Drawing Conclusions

In order to draw principled conclusions from our data, we rely on a formal probabilistic framework that allows us to *quantify uncertainty*.

**Statistical inference** is built upon the foundation of **probability theory**.

---

## Probability

The **probability** of an event tells us how likely an event is to occur, and it can take values from 0 to 1, inclusive. It can be viewed as the proportion of times the event would occur if it could be observed an infinite number of times. It can also be viewed as our degree of belief an event will happen.

---

# Formalizing Probability

---

## Events

An **event** is the basic element to which probability is applied, e.g. the result of an observation or experiment.

- **A** is the event that a person is a cigarette smoker
- **B** is the event that a person identifies as female
- **C** is the event that a person has blood type A+

---

## Sample Spaces

A **sample space** is the set of all possible outcomes.  So for example, the sample space could be all possible blood types, the event C is the event that your blood type is A+, and the rest of the sample space, `$\overline{C}$` as we will see in a minute, contains all other blood types.

The [prevalence of blood types varies widely around the world](https://en.wikipedia.org/wiki/Blood_type_distribution_by_country). About 36% of the US population has blood type A+, and we can say `$P(A+)=0.36$` in the US.

Sample spaces depend on the research question.  For exploration of ABO blood types, the sample space is {A+, A-, B+, B-, AB+, AB-, O+, O-}.

---

## Operations on Events

- The **union** of A and B, denoted `$A \cup B$`, is the event that A, or B, or both A and B, occur. Here, `$A \cup B$` is the event that a person is a smoker, or identifies as female, or both.

- The **intersection** of A and B, denoted `$A \cap B$`, is the event that both A and B occur. Here `$A \cap B$` is the event that a person both smokes and identifies as female. A and B are **disjoint** or **mutually exclusive** if `$A \cap B = \emptyset$` (A and B can't occur simultaneously). That's clearly not the case for smoking and identifying as female!

- The **complement** of A, denoted `$A^c$` or `$\overline{A}$`, is the event A does not occur (nonsmoker). `$A$` and `$\overline{A}$` are **mutually exclusive**.

---

## Venn Diagram

<img src="img/covidallergy.jpg" width="75%" style="display: block; margin: auto auto auto 0;" />
---

## Venn Diagram

<img src="img/santavenn.jpg" width="45%" style="display: block; margin: auto auto auto 0;" />
---

# How Do Probabilities Work?

---

## Probability Rules

- The probability of any event in the sample space is between 0 and 1, inclusive.

- The probability of the entire sample space is 1.

- If we know the probability of `$A$`, often denoted `$P(A)$`, it is easy to calculate the probability of `$\overline{A}$` as `$P(\overline{A})=1-P(A)$`. This is called the **complement rule**:  `$P(A)+P(\overline{A})=1$`.

---

## Additive Rule of Probability

When events are mutually exclusive (cannot occur together), `$P(A\cup B)=P(A)+P(B)$`.  When two events can occur simultaneously (think about the overlapping sections in the Venn diagrams), then we need to avoid double-counting when calculating the probability either of two events will occur.

The general **additive rule of probability** is therefore `$P(A \cup B)=P(A)+P(B)-P(A \cap B)$` -- because `$A \cap B$` is part of the event A and part of the event B, we need to avoid double-counting it.

If two events A and B are mutually exclusive, then `$P(A \cap B)=0$`.

---

## Computing Probabilities

Intuitively, we can think of the probability of an outcome (or set of outcomes) as the proportion of times the outcome (or set of outcomes) would occur if we observed the random process infinitely many times.

If all the outcomes in our random process (sample space `$\mathcal{S}$`) are equally likely, then for some event `$E$`, `$$P(E)=\frac{\text{# of outcomes in } E}{\text{# of total outcomes in } \mathcal{S}}.$$`

---

# Practice

---

## Vaccine Hesitancy in the UK

We will consider a population motivated by the findings of this UK study on vaccine hesitancy, which explored hesitancy across a variety of UK ethnic groups.

---

## Vaccine Hesitancy: Cohort of 10,000 People

| <div style="width:120px;text-align:left">Ethnicity</div> | <div style="width:340px;text-align:center">Vaccine Hesitant</div> | <div style="width:340px;text-align:center">Not Hesitant</div> |
|:------|------:|-------:|
| White British or Irish | 1362 | 7368 |
| Other white background | 71 | 199 |
| Mixed | 55 | 115 |
| Asian or Asian British - Indian | 37 | 143 |
| Asian or Asian British - Pakistani/Bangladeshi | 85 | 115 |
| Asian or Asian British - other | 15 | 95 |
| Black or Black British | 136 | 54 |
| Other Ethnic Group or Not Specified | 31 | 119 |

---

## Practicing with Probabilities

Define events A=vaccine hesitant and B=Asian or Asian British-Indian. Calculate the following probabilities for a randomly-selected person in this cohort.

- `$P(A)$`

- `$P(B)$`

- `$P(A \cap B)$`

- `$P(A \cup B)$`

- `$P(A \cup \overline{B})$`