Tidying data

# Tidying data
## <br><br> Introduction to Global Health Data Science
### <a href="https://sta198f2021.github.io/website/">Back to Website</a>
### <br> Prof. Amy Herring

---

layout: true
  
<div class="my-footer">
<span>
<a href="https://sta198f2021.github.io/website/" target="_blank">Back to website</a>
</span>
</div>

---

# .hand[We...]

This slide deck adapted from Data Science in a Box!

---

Today we'll do the following:

1. Learn how to `tidy` data

2. Practice data tidying skills to reproduce a plot of presidential approval ratings

3. Step back and review. We revisit the COVID example from week 1 and explore how those data were wrangled to create the plot.

---

## Data: Transparency in Medical Pricing

The US Centers for Medicare and Medicaid Services now requires hospitals to [release pricing information to patients before providing services](https://www.healthaffairs.org/do/10.1377/hblog20200304.157067/full/). Hospitals have varied in how accessible they have made this data to consumers. For example, Duke Hospital refers patients [to a website](https://www.dukehealth.org/paying-for-care/what-duke-charges-services) that contains a .csv file, and it directs patients to [a financial counselor](https://www.dukehealth.org/paying-for-care/estimate-care-will-cost) to help get estimates (the .csv file is just a list and provides no information about how many procedures may be charged, etc).

---

## Data: Duke Medicare Prices

<br>

```
## # A tibble: 3 × 4
##      ID proc_1       proc_2       proc_3     
##   <dbl> <chr>        <chr>        <chr>      
## 1     1 stent_nocomp othercardio  <NA>       
## 2     2 appendectomy <NA>         <NA>       
## 3     3 stent_comp   appendectomy othercardio
```
]

--
.pull-right[
### .pink[We want...]

```
## # A tibble: 9 × 3
##      ID proc_no procedure   
##   <dbl> <chr>   <chr>       
## 1     1 proc_1  stent_nocomp
## 2     1 proc_2  othercardio 
## 3     1 proc_3  <NA>        
## 4     2 proc_1  appendectomy
## 5     2 proc_2  <NA>        
## 6     2 proc_3  <NA>        
## 7     3 proc_1  stent_comp  
## 8     3 proc_2  appendectomy
## 9     3 proc_3  othercardio
```
]

---

## A grammar of data tidying

.pull-left[
<img src="img/tidyr-part-of-tidyverse.png" width="60%" style="display: block; margin: auto;" />
]
.pull-right[
The goal of tidyr is to help you tidy your data via

- pivoting for going between wide and long data
- splitting and combining character columns
- nesting and unnesting columns
- clarifying how `NA`s should be treated
]

---

# Pivoting data

---

## Not this...

---

## but this!

.center[
<img src="img/tidyr-longer-wider.gif" width="45%" style="background-color: #FDF6E3" style="display: block; margin: auto;" />
]

---

## Wider vs. longer

--
.pull-right[
### .pink[longer]
more rows

---

## `pivot_longer()`

```r
pivot_longer(
* data,
  cols, 
  names_to = "name", 
  values_to = "value"
  )
```
]

---

## `pivot_longer()`

```r
pivot_longer(
  data, 
* cols,
  names_to = "name", 
  values_to = "value"
  )
```
]

---

## `pivot_longer()`

.pull-left[
- `data` (as usual)
- `cols`: columns to pivot into longer format 
- `names_to`: name of the column where column names of pivoted variables go (character string)
]
.pull-right[

```r
pivot_longer(
  data, 
  cols, 
* names_to = "name",
  values_to = "value"
  )
```
]

---

## `pivot_longer()`

.pull-left[
- `data` (as usual)
- `cols`: columns to pivot into longer format 
- `names_to`: name of the column where column names of pivoted variables go (character string)
- `values_to`: name of the column where data in pivoted variables go (character string)
]
.pull-right[

```r
pivot_longer(
  data, 
  cols, 
  names_to = "name", 
* values_to = "value"
  )
```
]

---

## Patients `\(\rightarrow\)` interventions

```r
interventions <- patients %>%
* pivot_longer(
*   cols = proc_1:proc_3,  # variables proc_1 to proc_3
*   names_to = "proc_no",  # column names -> new column called proc_no
*   values_to = "procedure"     # values in columns -> new column called procedure
*   )

interventions
```

---

## Why pivot?

Most likely, because the next step of your analysis needs it

```r
prices
```

```
## # A tibble: 4 × 2
##   procedure    medicare_price
##   <chr>                 <dbl>
## 1 stent_nocomp         15827.
## 2 appendectomy         10717.
## 3 othercardio          22591.
## 4 stent_comp           31719.
```
]
.pull-right[

```r
interventions %>%
* left_join(prices)
```

```
## # A tibble: 9 × 4
##      ID proc_no procedure    medicare_price
##   <dbl> <chr>   <chr>                 <dbl>
## 1     1 proc_1  stent_nocomp         15827.
## 2     1 proc_2  othercardio          22591.
## 3     1 proc_3  <NA>                    NA 
## 4     2 proc_1  appendectomy         10717.
## 5     2 proc_2  <NA>                    NA 
## 6     2 proc_3  <NA>                    NA 
## 7     3 proc_1  stent_comp           31719.
## 8     3 proc_2  appendectomy         10717.
## 9     3 proc_3  othercardio          22591.
```
]

---

## Interventions `\(\rightarrow\)` patients

.pull-left-narrow[
- `data` (as usual)
- `names_from`: which column in the long format contains what should be column names in the wide format
- `values_from`: which column in the long format contains what should be values in the new columns in the wide format
]
.pull-right-wide[

```r
interventions %>%
* pivot_wider(
*   names_from = proc_no,
*   values_from = procedure
* )
```

---

# Line Graph Case Study: Approval Rating of President Trump

---

.footnote[
Source: [FiveThirtyEight](https://projects.fivethirtyeight.com/trump-approval-ratings/adults/)
]

---

## Data

```r
trump
```

```
## # A tibble: 2,702 × 4
##    subgroup date       approval disapproval
##    <chr>    <date>        <dbl>       <dbl>
##  1 Voters   2020-10-04     44.7        52.2
##  2 Adults   2020-10-04     43.2        52.6
##  3 Adults   2020-10-03     43.2        52.6
##  4 Voters   2020-10-03     45.0        51.7
##  5 Adults   2020-10-02     43.3        52.4
##  6 Voters   2020-10-02     44.5        52.1
##  7 Voters   2020-10-01     44.1        52.8
##  8 Adults   2020-10-01     42.7        53.3
##  9 Adults   2020-09-30     42.2        53.7
## 10 Voters   2020-09-30     44.2        52.7
## # … with 2,692 more rows
```

---

## Goal

.pull-left-wide[
<img src="w4-l03-tidying_files/figure-html/unnamed-chunk-20-1.png" width="100%" style="display: block; margin: auto;" />
]
--
.pull-right-narrow[
**Aesthetic mappings:**  
✅  x = `date`  
❌      y = `rating_value`  
❌      color = `rating_type`

**Facet:**  
✅  `subgroup` (Adults and Voters)
]

---

## Pivot

```r
trump_longer <- trump %>%
  pivot_longer(
    cols = c(approval, disapproval),
    names_to = "rating_type",
    values_to = "rating_value"
  )

trump_longer
```

```
## # A tibble: 5,404 × 4
##    subgroup date       rating_type rating_value
##    <chr>    <date>     <chr>              <dbl>
##  1 Voters   2020-10-04 approval            44.7
##  2 Voters   2020-10-04 disapproval         52.2
##  3 Adults   2020-10-04 approval            43.2
##  4 Adults   2020-10-04 disapproval         52.6
##  5 Adults   2020-10-03 approval            43.2
##  6 Adults   2020-10-03 disapproval         52.6
##  7 Voters   2020-10-03 approval            45.0
##  8 Voters   2020-10-03 disapproval         51.7
...
```

---

## Plot

```r
ggplot(trump_longer, 
       aes(x = date, y = rating_value, color = rating_type, group = rating_type)) +
  geom_line() +
  facet_wrap(~ subgroup)
```

---

```r
ggplot(trump_longer, 
       aes(x = date, y = rating_value, 
           color = rating_type, group = rating_type)) +
  geom_line() +
  facet_wrap(~ subgroup) +
* scale_color_manual(values = c("darkgreen", "orange")) +
* labs(
*   x = "Date", y = "Rating",
*   color = NULL,
*   title = "How (un)popular is President Trump?",
*   subtitle = "Estimates based on polls of all adults and polls of likely/registered voters",
*   caption = "Source: FiveThirtyEight modeling estimates"
* )
```
]

.panel[.panel-name[Plot]
<img src="w4-l03-tidying_files/figure-html/unnamed-chunk-23-1.png" width="75%" style="display: block; margin: auto;" />
]

]

---

```r
ggplot(trump_longer, 
       aes(x = date, y = rating_value, 
           color = rating_type, group = rating_type)) +
  geom_line() +
  facet_wrap(~ subgroup) +
  scale_color_manual(values = c("darkgreen", "orange")) + 
  labs( 
    x = "Date", y = "Rating", 
    color = NULL, 
    title = "How (un)popular is President Trump?", 
    subtitle = "Estimates based on polls of all adults and polls of likely/registered voters", 
    caption = "Source: FiveThirtyEight modeling estimates" 
  ) 
```
]

.panel[.panel-name[Plot]
<img src="w4-l03-tidying_files/figure-html/unnamed-chunk-24-1.png" width="75%" style="display: block; margin: auto;" />
]

]

---

## Coronavirus Deaths

First, we should call the libraries we used to make the plot during the first week of class. Because the `coronavirus` package is updated daily, we should re-install it to get the most up to date data.

```r
library(devtools) #to install package from GitHub
#reinstall coronavirus package to update #'s (they update daily)
devtools::install_github("RamiKrispin/coronavirus")
library(coronavirus)
library(lubridate) #to facilitate working with dates
library(glue) #we use in formatting the date
library(scales) #use to control notation on axis scale
library(ggrepel) #for writing names on plot
```

---

## Recall the Data

The data frame called `coronavirus` in the coronavirus package provides a daily summary of the Coronavirus (COVID-19) cases by country. Each row in the data frame represents a country (or, where relevant, state/province).  Note that the data provided in this package provides daily number of deaths, confirmed cases, and recovered cases. Today we will focus on deaths.

We will start by making our selection for the countries we want to explore.

---

```r
head(coronavirus) #recall variables
```

```
##         date province country     lat      long      type cases
## 1 2020-01-22  Alberta  Canada 53.9333 -116.5765 confirmed     0
## 2 2020-01-23  Alberta  Canada 53.9333 -116.5765 confirmed     0
## 3 2020-01-24  Alberta  Canada 53.9333 -116.5765 confirmed     0
## 4 2020-01-25  Alberta  Canada 53.9333 -116.5765 confirmed     0
## 5 2020-01-26  Alberta  Canada 53.9333 -116.5765 confirmed     0
## 6 2020-01-27  Alberta  Canada 53.9333 -116.5765 confirmed     0
```

```r
countries <- c(
  "China",
  "France",
  "United Kingdom",
  "US",
  "Turkey"
)
```

---

How did we wrangle those data?  First, we filtered to a smaller list of countries, edited long country names if needed, and summed all the deaths by country and date. For some countries like Canada on the previous slide, we have data on the provincial level, so we need to summarize to put data on the country level.

```r
country_data <- coronavirus %>%
  # filter for deaths in countries of interest
  filter(
    type == "death",
    country %in% countries
  ) %>%
  # fix country labels for pretty plotting
  mutate(
    country = case_when(
      country == "United Kingdom" ~ "UK",
      TRUE ~ country
    )
  ) %>%
  # calculate number of total cases for each country and date
  group_by(country, date) %>%
  summarise(tot_cases = sum(cases)) %>%
```

---

Next, we ordered by date and calculated the cumulative case counts. Then, we excluded days before the 10th confirmed death due to irregularities in early reporting, and we converted calendar date to days elapsed.

```r
  # arrange by date in ascending order
  arrange(date) %>%
  # record daily cumulative cases 
  mutate(cumulative_cases = cumsum(tot_cases)) %>%
  # only use days since the 10th confirmed death
  filter(cumulative_cases > 9) %>%
  # record days elapsed, end date, and end label
  mutate(
    days_elapsed = as.numeric(date - min(date)),
    end_date     = if_else(date == max(date), TRUE, FALSE),
    end_label    = if_else(end_date, country, NULL)
  ) %>%
  # ungroup
  ungroup()
```
---

We also need to take a note of the "as of date" for the data so that we can properly label our visualization.

```r
as_of_date <- country_data %>% 
  summarise(max(date)) %>% 
  pull()
as_of_date_formatted <- glue("{wday(as_of_date, label = TRUE)}, {month(as_of_date, label = TRUE)} {day(as_of_date)}, {year(as_of_date)}")
```

These data are as of Thu, Sep 16, 2021.

---

```r
ggplot(data = country_data,
       mapping = aes(x = days_elapsed, 
                     y = cumulative_cases, 
                     color = country, 
                     label = end_label)) +
  geom_line(size = 0.7, alpha = 0.8) +
  geom_point(data = country_data %>% filter(end_date)) +
  geom_label_repel(nudge_y = 1, direction = "y", hjust = 1) + 
  guides(color = FALSE) +
  scale_color_viridis_d() +
  scale_y_continuous(labels = label_comma()) +
  theme_minimal() +
  labs(
    x = "Days since 10th confirmed death",
    y = "Cumulative number of deaths",
    title = "Cumulative deaths from COVID-19, selected countries",
    subtitle = glue("Data as of", as_of_date_formatted, .sep = " "),
    caption = "Source: github.com/RamiKrispin/coronavirus"
  )
```
]
.panel[.panel-name[Output]
<img src="w4-l03-tidying_files/figure-html/visualise2-1.png" width="60%" style="display: block; margin: auto;" />

]
]

---

We might prefer a plot of daily deaths (here I filter to US - pick your own countries).

```r
country_data %>%
  filter(country == "US") %>%
  ggplot(aes(x = date,
             y = tot_cases,
             color = country,)) +
  geom_line() +
  labs(
    x = "Date",
    y = "Daily number of deaths",
    title = "Daily US deaths from COVID-19",
    subtitle = glue("Data as of", as_of_date_formatted, .sep = " "),
    caption = "Source: github.com/RamiKrispin/coronavirus"
  )
```
]
.panel[.panel-name[Output]
<img src="w4-l03-tidying_files/figure-html/daily2-1.png" width="60%" style="display: block; margin: auto;" />

]
]

---

Or of daily confirmed cases....

```r
confirmed_case_US <- coronavirus %>%
  # filter for deaths in countries of interest
  filter(
    type == "confirmed",
    country == "US"
  ) %>%
  # calculate number of total cases for each country and date
  group_by(country, date) %>%
  summarise(tot_cases = sum(cases)) %>%
  # arrange by date in ascending order
  arrange(date) %>%
  # record daily cumulative cases 
  mutate(cumulative_cases = cumsum(tot_cases)) %>%
  # only use days since the 10th confirmed case
  filter(cumulative_cases > 9) %>%
  # record days elapsed, end date, and end label
  mutate(
    days_elapsed = as.numeric(date - min(date)),
    end_date     = if_else(date == max(date), TRUE, FALSE),
    end_label    = if_else(end_date, country, NULL)
  ) %>%
  # ungroup
  ungroup()
```

---
.panelset[
.panel[.panel-name[Code]

```r
confirmed_case_US %>%
  ggplot(aes(x = date,
             y = tot_cases)) +
  geom_line() +
  scale_y_continuous(labels = label_comma()) +
  labs(
    x = "Date",
    y = "Daily number of confirmed cases",
    title = "Daily US Cases of COVID-19",
    subtitle = glue("Data as of", as_of_date_formatted, .sep = " "),
    caption = "Source: github.com/RamiKrispin/coronavirus"
  )
```
]
.panel[.panel-name[Output]
<img src="w4-l03-tidying_files/figure-html/dplotconfirmed2-1.png" width="60%" style="display: block; margin: auto;" />

]
]