Processing math: 100%
+ - 0:00:00
Notes for current slide
Notes for next slide

Probability



Introduction to Data Science

Course Website


Prof. Amy Herring

1 / 19

Goal

2 / 19

Goal of statistics

One goal of statistics is to answer a research question, by making inferences about a population based on data in one or more samples.

  • Population: the entire group we would like to make conclusions about (e.g., all people aged 15 and up in China, all pregnancies worldwide)
  • Sample: specific group we have collected data from (e.g., a random sample of non-institutionalized people aged 15 and up in sampled households, from sampled villages from sampled counties; pregnancies receiving prenatal care at Duke Hospital)
  • The validity of our inferences depends on a variety of factors, including the representativeness of our sample

Question: how representative of worldwide pregnancies would a sample from Duke Hospital be?

3 / 19

Drawing Conclusions

In order to draw principled conclusions from our data, we rely on a formal probabilistic framework that allows us to quantify uncertainty.

Statistical inference is built upon the foundation of probability theory.

4 / 19

Probability

The probability of an event tells us how likely an event is to occur, and it can take values from 0 to 1, inclusive. It can be viewed as the proportion of times the event would occur if it could be observed an infinite number of times. It can also be viewed as our degree of belief an event will happen.

5 / 19

Formalizing Probability

6 / 19

Events

An event is the basic element to which probability is applied, e.g. the result of an observation or experiment.

  • A is the event that a person is a cigarette smoker
  • B is the event that a person identifies as female
  • C is the event that a person has blood type A+
7 / 19

Sample Spaces

A sample space is the set of all possible outcomes. So for example, the sample space could be all possible blood types, the event C is the event that your blood type is A+, and the rest of the sample space, ¯C as we will see in a minute, contains all other blood types.

The prevalence of blood types varies widely around the world. About 36% of the US population has blood type A+, and we can say P(A+)=0.36 in the US.

Sample spaces depend on the research question. For exploration of ABO blood types, the sample space is {A+, A-, B+, B-, AB+, AB-, O+, O-}.

8 / 19

Operations on Events

  • The union of A and B, denoted AB, is the event that A, or B, or both A and B, occur. Here, AB is the event that a person is a smoker, or identifies as female, or both.

  • The intersection of A and B, denoted AB, is the event that both A and B occur. Here AB is the event that a person both smokes and identifies as female. A and B are disjoint or mutually exclusive if AB= (A and B can't occur simultaneously). That's clearly not the case for smoking and identifying as female!

  • The complement of A, denoted Ac or ¯A, is the event A does not occur (nonsmoker). A and ¯A are mutually exclusive.

9 / 19

Venn Diagram

10 / 19

Venn Diagram

11 / 19

How Do Probabilities Work?

12 / 19

Probability Rules

  • The probability of any event in the sample space is between 0 and 1, inclusive.

  • The probability of the entire sample space is 1.

  • If we know the probability of A, often denoted P(A), it is easy to calculate the probability of ¯A as P(¯A)=1P(A). This is called the complement rule: P(A)+P(¯A)=1.

13 / 19

Additive Rule of Probability

When events are mutually exclusive (cannot occur together), P(AB)=P(A)+P(B). When two events can occur simultaneously (think about the overlapping sections in the Venn diagrams), then we need to avoid double-counting when calculating the probability either of two events will occur.

The general additive rule of probability is therefore P(AB)=P(A)+P(B)P(AB) -- because AB is part of the event A and part of the event B, we need to avoid double-counting it.

If two events A and B are mutually exclusive, then P(AB)=0.

14 / 19

Computing Probabilities

Intuitively, we can think of the probability of an outcome (or set of outcomes) as the proportion of times the outcome (or set of outcomes) would occur if we observed the random process infinitely many times.

If all the outcomes in our random process (sample space S) are equally likely, then for some event E, P(E)=# of outcomes in E# of total outcomes in S.

15 / 19

Practice

16 / 19

Vaccine Hesitancy in the UK

We will consider a population motivated by the findings of this UK study on vaccine hesitancy, which explored hesitancy across a variety of UK ethnic groups.

17 / 19

Vaccine Hesitancy: Cohort of 10,000 People

Ethnicity
Vaccine Hesitant
Not Hesitant
White British or Irish 1362 7368
Other white background 71 199
Mixed 55 115
Asian or Asian British - Indian 37 143
Asian or Asian British - Pakistani/Bangladeshi 85 115
Asian or Asian British - other 15 95
Black or Black British 136 54
Other Ethnic Group or Not Specified 31 119
18 / 19

Practicing with Probabilities

Define events A=vaccine hesitant and B=Asian or Asian British-Indian. Calculate the following probabilities for a randomly-selected person in this cohort.

  • P(A)

  • P(B)

  • P(AB)

  • P(AB)

  • P(A¯B)

19 / 19

Goal

2 / 19
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow