Welcome to STA 198/GLHLTH 298!

# Welcome to STA 198/GLHLTH 298!
## <br><br> Introduction to Global Health Data Science
### <a href="https://sta198f2021.github.io/website/">Website</a>
### <br> Prof. Amy Herring

---

layout: true
  
<div class="my-footer">
<span>
<a href="https://sta198f2021.github.io/website/" target="_blank">Back to website</a>
</span>
</div>

---

## Data science

.pull-left-wide[
- Data science is an exciting discipline that allows you to turn raw data into understanding, insight, and knowledge.

- This is a course on health data science, with an emphasis on statistical thinking and global health challenges.

- Our process involves

- forming a question of interest,
  - (collecting) and summarizing data,
  - and interpreting and communicating results.

]

---

## Global health data science

STA 198 will

- provide a tour of basic statistical methods useful in public health and biomedical research

- emphasize intuition and understanding of the methods, with a focus on critical assessment of evidence, data-driven decision-making, and effective communication of insights from data

- make use of timely, relevant examples from global health science

- utilize free, modern software and reproducible research methods for transparency and data sharing.

]

---

## Course FAQ

**Q - Is this an intro stat course?**  
A - While statistics `$\ne$` data science, they are very closely related and have tremendous overlap. Hence, this course is a great way to get started with statistics. However, this course is *not*  your typical high school statistics course.

**Q - Will we be doing computing?**   
A - Yes, extensively.
]

---

## Course FAQ

**Q - What computing language will we learn?**  
A - R.

**Q: Why not language X?**  
A: We can discuss that over ☕.
]

---

## Course info online...

... where you can find everything except your grades!

<br>

]
(You can also get there through the Sakai syllabus link.)

---

# Software

---

.pull-left-wide[
<img src="img/excel.png" width="75%" style="display: block; margin: auto auto auto 0;" />
]
.pull-right[ 
We'll combine the ease of viewing in Excel with ...
]
---
.pull-left-wide[
<img src="img/r.png" width="60%" style="display: block; margin: auto auto auto 0;" />
]
.pull-right[ 
the rigor of the R programming language ...
]
---
.pull-left-wide[
<img src="img/rstudio.png" width="73%" style="display: block; margin: auto auto auto 0;" />
]
.pull-right[
in an integrated environment. To learn more, check out the [introductory video](https://www.youtube.com/watch?v=Q2QN1RpvLq8) for our computing toolkit.
]
---

# Data science life cycle

---

---

---

---

---

---

---

---

---

A study at Penn (Chen et al, 2008) found that men presenting at emergency departments with acute nontraumatic abdominal pain received painkillers more quickly than women.

---

# Let's dive in!

---

## Men's Health Gap

- Health gaps are differences in the prevalence of disease, access to healthcare, or health outcomes across different groups.

- The earlier slide accompanied an article about a small (n `$\approx$` 900) study in one emergency room that showed men were more likely to receive painkillers, and received them more quickly, than women, given similar presentations.

- Mortality data and life expectancy from infancy tell a story about a health gap in the opposite direction.

]

---

## Dilemma: Measurement of Gender Minorities

.pull-left-wide[
- Research on health gaps among gender minorities is limited due to a variety of factors, including

- binary gender construction,
  - pre-populated vs. open-field survey items,
  - and many others
  
- These measurement issues lead to misclassification and impede research on important health issues affecting gender minorities

- Once data are collected, analysts may have limited options (survey design is critical!)

]

---

## Life Expectancy

- The [Institute for Health Metrics and Evaluation (IHME)](http://www.healthdata.org) is a resource for data on a variety of important health outcomes worldwide.

- IHME maintains the [Global Burden of Disease (GBD)](http://ghdx.healthdata.org) tool, a valuable resource for policymakers and others that quantifies health loss due to a variety of risk factors, diseases, and injuries.

- We consider data from IHME on (estimated) infant life expectancy from the years 1994-2019 as a function of location (primarily country) and binary gender.

- Here **life expectancy** is the \# of years an infant can expect to live if mortality rates in the current year remain unchanged for the rest of their life. Life expectancy usually underestimates how long the baby will actually live, as mortality rates have been declining over time.

---

# What is in a dataset?

---

## Dataset terminology

- Each row is an **observation**
- Each column is a **variable**

```r
life <- readr::read_csv("lifeexp/lifeexpectancy_infant.csv")

life
```

```
## # A tibble: 12,240 x 4
##   location sex     year lifeexp
##   <chr>    <chr>  <dbl>   <dbl>
## 1 Japan    Female  2019    87.7
## 2 Japan    Female  2018    87.6
## 3 Japan    Female  2017    87.6
## 4 Japan    Female  2016    87.4
## 5 Japan    Female  2015    87.3
## 6 Japan    Female  2014    87.1
## # … with 12,234 more rows
```

]

---

## What's in the life expectancy data?

Take a `glimpse` at the data:

```r
glimpse(life)
```

```
## Rows: 12,240
## Columns: 4
## $ location <chr> "Japan", "Japan", "Japan", "Japan", "Japan", "…
## $ sex      <chr> "Female", "Female", "Female", "Female", "Femal…
## $ year     <dbl> 2019, 2018, 2017, 2016, 2015, 2014, 2013, 2019…
## $ lifeexp  <dbl> 87.65871, 87.64361, 87.64209, 87.44221, 87.328…
```

---

.question[
How many rows and columns does this dataset have?
What does each row represent?
What does each column represent?
]

---

```r
nrow(life) # number of rows
```

```
## [1] 12240
```

```r
ncol(life) # number of columns
```

```
## [1] 4
```

```r
dim(life)  # dimensions (row column)
```

```
## [1] 12240     4
```
]

---

# Exploratory data analysis

---

## What is EDA?

- Exploratory data analysis (EDA) is an approach to analysing data sets to summarize its main characteristics
- Often, this is visual -- this is what we'll focus on first
- But we might also calculate summary statistics and perform data wrangling/manipulation/transformation at (or before) this stage of the analysis -- this is what we'll focus on next

---

## Life expectancy over time

.question[ 
How would you describe the relationship between year and life expectancy?
What other variables would help us understand data points that don't follow the overall trend?
What is causing the **outliers** at the bottom?
]

`$~~~~~~~$` Ok, ugly plot!  We'll break it down soon -- but first, the outliers....

---

```r
# we want to look at the bottom 5 values of life expectancy #(5 just in case some points are exactly the same)
life %>% top_n(-5,lifeexp)
```

```
## # A tibble: 5 x 4
##   location sex     year lifeexp
##   <chr>    <chr>  <dbl>   <dbl>
## 1 Burundi  Male    1997   40.8 
## 2 Haiti    Female  2010   36.9 
## 3 Haiti    Male    2010   28.8 
## 4 Rwanda   Female  1994   11.2 
## 5 Rwanda   Male    1994    9.14
```

Are these data errors, realistic estimates, or neither?  (Hint: recall how life expectancy is calculated)

---

# Data visualization

---

## Data visualization

> *"The simple graph has brought more information to the data analyst's mind than any other device." --- John Tukey*

- Data visualization is the creation and study of the visual representation of data
- Many tools for visualizing data -- R is one of them
- Many approaches/systems within R for making data visualizations -- **ggplot2** is one of them, and that's what we're going to use

---

## ggplot2 `$\in$` tidyverse

.pull-left[
<img src="img/ggplot2-part-of-tidyverse.png" width="80%" style="display: block; margin: auto;" />
] 
.pull-right[ 
- **ggplot2** is tidyverse's data visualization package 
- `gg` in "ggplot2" stands for Grammar of Graphics 
- Inspired by the book **Grammar of Graphics** by Leland Wilkinson
]

---

## Grammar of Graphics

.pull-left-narrow[
A grammar of graphics is a tool that enables us to concisely describe the components of a graphic
]
.pull-right-wide[
<img src="img/grammar-of-graphics.png" width="100%" style="display: block; margin: auto;" />
]

.footnote[ Source: [BloggoType](http://bloggotype.blogspot.com/2016/08/holiday-notes2-grammar-of-graphics.html)]

---

## Men's gap in life expectancy by year

Let's subset to a few countries to de-clutter the plot.

```r
life %>%
  filter(location %in% c("United States of America","Rwanda","China")) %>%
      ggplot(aes(x = year, y = lifeexp,shape=sex,color=location))+geom_point()
```

We'll learn to make this a lot better later!

---

.question[ 
- What are the functions doing the plotting?
- What is the dataset being plotted?
- Which variables map to which features (aesthetics) of the plot?
]

```r
life %>%
  filter(location %in% c("United States of America","Rwanda","China")) %>%
      ggplot(aes(x = year, y = lifeexp,shape=sex,color=location))+geom_point()
```

---

## Hello ggplot2!

.pull-left-wide[
- `ggplot()` is the main function in ggplot2
- Plots are constructed in layers
- Structure of the code for plots can be summarized as

```r
ggplot(data = [dataset], 
       mapping = aes(x = [x-variable], y = [y-variable])) +
   geom_xxx() +
   other options
```

- The ggplot2 package comes with the tidyverse

```r
library(tidyverse)
```

- For help with ggplot2, see [ggplot2.tidyverse.org](http://ggplot2.tidyverse.org/)
]

---

# Why do we visualize?

---

## Anscombe's quartet/datasaurus dozen

```
## # A tibble: 142 x 3
##   dataset     x     y
##   <chr>   <dbl> <dbl>
## 1 dino     55.4  97.2
## 2 dino     51.5  96.0
## 3 dino     46.2  94.5
## 4 dino     42.8  91.4
## 5 dino     40.8  88.3
## 6 dino     38.7  84.9
## # … with 136 more rows
```

```
## # A tibble: 142 x 3
##   dataset     x     y
##   <chr>   <dbl> <dbl>
## 1 star     58.2  91.9
## 2 star     58.2  92.2
## 3 star     58.7  90.3
## 4 star     57.3  89.9
## 5 star     58.1  92.0
## 6 star     57.5  88.1
## # … with 136 more rows
```
] 
.pull-right[

```
## # A tibble: 142 x 3
##   dataset      x     y
##   <chr>    <dbl> <dbl>
## 1 bullseye  51.2  83.3
## 2 bullseye  59.0  85.5
## 3 bullseye  51.9  85.8
## 4 bullseye  48.2  85.0
## 5 bullseye  41.7  84.0
## 6 bullseye  37.9  82.6
## # … with 136 more rows
```

```
## # A tibble: 142 x 3
##   dataset     x     y
##   <chr>   <dbl> <dbl>
## 1 dots     51.1  90.9
## 2 dots     50.5  89.1
## 3 dots     50.2  85.5
## 4 dots     50.1  83.1
## 5 dots     50.6  82.9
## 6 dots     50.3  83.0
## # … with 136 more rows
```
]

---

## Summarising Anscombe's quartet/datasaurus dozen

```r
summdat=datasaurus_dozen %>%
  group_by(dataset) %>%
  summarise(
    mean_x = mean(x), 
    mean_y = mean(y),
    sd_x = sd(x),
    sd_y = sd(y),
    r = cor(x, y)
  )
```

---

## Summarising Anscombe's quartet/datasaurus dozen

```
## # A tibble: 13 x 6
##    dataset    mean_x mean_y  sd_x  sd_y       r
##    <chr>       <dbl>  <dbl> <dbl> <dbl>   <dbl>
##  1 away         54.3   47.8  16.8  26.9 -0.0641
##  2 bullseye     54.3   47.8  16.8  26.9 -0.0686
##  3 circle       54.3   47.8  16.8  26.9 -0.0683
##  4 dino         54.3   47.8  16.8  26.9 -0.0645
##  5 dots         54.3   47.8  16.8  26.9 -0.0603
##  6 h_lines      54.3   47.8  16.8  26.9 -0.0617
##  7 high_lines   54.3   47.8  16.8  26.9 -0.0685
##  8 slant_down   54.3   47.8  16.8  26.9 -0.0690
##  9 slant_up     54.3   47.8  16.8  26.9 -0.0686
## 10 star         54.3   47.8  16.8  26.9 -0.0630
## 11 v_lines      54.3   47.8  16.8  26.9 -0.0694
## 12 wide_lines   54.3   47.8  16.8  26.9 -0.0666
## 13 x_shape      54.3   47.8  16.8  26.9 -0.0656
```

---

## Visualizing the data

<!--

ggplot(quartet, aes(x = x, y = y)) +
  geom_point() +
  facet_wrap(~ set, ncol = 4)
  
life %>%
    ggplot(aes(x = year, y = lifeexp))+geom_point()

let's see what's going on with the low outliers
life %>% top_n(-5,lifeexp)

life %>%
  filter(location %in% c("United States of America","Rwanda","China")) %>%
      ggplot(aes(x = year, y = lifeexp))+geom_point()
      
# our country name looks too long
life[["location"]] <-life[["location"]] %>%
  str_replace( pattern = "United States of America", replacement = "USA")

life %>%
  filter(location == "USA") %>%
      ggplot(aes(x = year, y = lifeexp))+geom_point()

life %>%
  filter(location=="USA") %>%
      ggplot(aes(x = year, y = lifeexp,group=sex))+geom_point()+geom_line(aes(color=sex))

life %>%
  filter(location %in% c("USA","Rwanda","China")) %>%
      ggplot(aes(x = year, y = lifeexp,group=sex))+geom_point()+geom_line(aes(color=sex)) + facet_grid(location ~ .)

life %>%
  filter(location %in% c("USA","Brazil","China")) %>%
      ggplot(aes(x = year, y = lifeexp,group=sex))+geom_point()+geom_line(aes(color=sex)) + facet_grid(location ~ .)

life %>%
  filter(location %in% c("USA","Brazil","China")) %>%
      ggplot(aes(x = year, y = lifeexp,group=location))+geom_point()+geom_line(aes(color=location)) + facet_grid(sex ~ .)
      
      
# compare male vs female life expectancy

life %>%
  filter(year == '2019') %>%
     spread(sex,lifeexp) %>%
      ggplot(aes(x=Female,y=Male))+geom_point()+
      xlab("Female Life Expectancy (Years)")+ ylab("Male Life Expectancy (Years)") + ggtitle ("2019 Life Expectancy")
      
#add line with slope 1 intercept 0      
life %>%
  filter(year == '2019') %>%
     spread(sex,lifeexp) %>%
      ggplot(aes(x=Female,y=Male))+geom_point()+
      xlab("Female Life Expectancy (Years)")+ ylab("Male Life Expectancy (Years)") + ggtitle ("2019 Life Expectancy")+geom_abline(intercept=0,slope=1)
      
#points below line female life expectancy > male
#what are the points above the line?

life %>%
  filter(year == '2019') %>%
     spread(sex,lifeexp) %>%
       top_n(-3,Female-Male)

#is men's health gap increasing over time?
# to get decent scatter plot graph women life exp vs men
#in a single year
#calculate women-men for a country in each year
#need then to manipulate data to put on one row -- future lab on wrangling?

-->