class: center, middle, inverse, title-slide # Tidying data ##
Introduction to Global Health Data Science ###
Back to Website
###
Prof. Amy Herring --- layout: true <div class="my-footer"> <span> <a href="https://sta198f2021.github.io/website/" target="_blank">Back to website</a> </span> </div> --- class: middle # .hand[We...] .huge[.green[have]] .hand[data organised in a suboptimal manner for our analysis] .huge[.pink[want]] .hand[to reorganise the data to carry on with our analysis] This slide deck adapted from Data Science in a Box! --- Today we'll do the following: 1. Learn how to `tidy` data 2. Practice data tidying skills to reproduce a plot of presidential approval ratings 3. Step back and review. We revisit the COVID example from week 1 and explore how those data were wrangled to create the plot. --- ## Data: Transparency in Medical Pricing The US Centers for Medicare and Medicaid Services now requires hospitals to [release pricing information to patients before providing services](https://www.healthaffairs.org/do/10.1377/hblog20200304.157067/full/). Hospitals have varied in how accessible they have made this data to consumers. For example, Duke Hospital refers patients [to a website](https://www.dukehealth.org/paying-for-care/what-duke-charges-services) that contains a .csv file, and it directs patients to [a financial counselor](https://www.dukehealth.org/paying-for-care/estimate-care-will-cost) to help get estimates (the .csv file is just a list and provides no information about how many procedures may be charged, etc). --- ## Data: Duke Medicare Prices <br> .pull-left[ ### .green[We have...] ``` ## # A tibble: 3 × 4 ## ID proc_1 proc_2 proc_3 ## <dbl> <chr> <chr> <chr> ## 1 1 stent_nocomp othercardio <NA> ## 2 2 appendectomy <NA> <NA> ## 3 3 stent_comp appendectomy othercardio ``` ] -- .pull-right[ ### .pink[We want...] ``` ## # A tibble: 9 × 3 ## ID proc_no procedure ## <dbl> <chr> <chr> ## 1 1 proc_1 stent_nocomp ## 2 1 proc_2 othercardio ## 3 1 proc_3 <NA> ## 4 2 proc_1 appendectomy ## 5 2 proc_2 <NA> ## 6 2 proc_3 <NA> ## 7 3 proc_1 stent_comp ## 8 3 proc_2 appendectomy ## 9 3 proc_3 othercardio ``` ] --- ## A grammar of data tidying .pull-left[ <img src="img/tidyr-part-of-tidyverse.png" width="60%" style="display: block; margin: auto;" /> ] .pull-right[ The goal of tidyr is to help you tidy your data via - pivoting for going between wide and long data - splitting and combining character columns - nesting and unnesting columns - clarifying how `NA`s should be treated ] --- class: middle # Pivoting data --- ## Not this... <img src="img/pivot.gif" width="70%" style="display: block; margin: auto;" /> --- ## but this! .center[ <img src="img/tidyr-longer-wider.gif" width="45%" style="background-color: #FDF6E3" style="display: block; margin: auto;" /> ] --- ## Wider vs. longer .pull-left[ ### .green[wider] more columns ``` ## # A tibble: 3 × 4 ## ID proc_1 proc_2 proc_3 ## <dbl> <chr> <chr> <chr> ## 1 1 stent_nocomp othercardio <NA> ## 2 2 appendectomy <NA> <NA> ## 3 3 stent_comp appendectomy othercardio ``` ] -- .pull-right[ ### .pink[longer] more rows ``` ## # A tibble: 9 × 3 ## ID proc_no procedure ## <dbl> <chr> <chr> ## 1 1 proc_1 stent_nocomp ## 2 1 proc_2 othercardio ## 3 1 proc_3 <NA> ## 4 2 proc_1 appendectomy ## 5 2 proc_2 <NA> ## 6 2 proc_3 <NA> ## 7 3 proc_1 stent_comp ## 8 3 proc_2 appendectomy ## 9 3 proc_3 othercardio ``` ] --- ## `pivot_longer()` .pull-left[ - `data` (as usual) ] .pull-right[ ```r pivot_longer( * data, cols, names_to = "name", values_to = "value" ) ``` ] --- ## `pivot_longer()` .pull-left[ - `data` (as usual) - `cols`: columns to pivot into longer format ] .pull-right[ ```r pivot_longer( data, * cols, names_to = "name", values_to = "value" ) ``` ] --- ## `pivot_longer()` .pull-left[ - `data` (as usual) - `cols`: columns to pivot into longer format - `names_to`: name of the column where column names of pivoted variables go (character string) ] .pull-right[ ```r pivot_longer( data, cols, * names_to = "name", values_to = "value" ) ``` ] --- ## `pivot_longer()` .pull-left[ - `data` (as usual) - `cols`: columns to pivot into longer format - `names_to`: name of the column where column names of pivoted variables go (character string) - `values_to`: name of the column where data in pivoted variables go (character string) ] .pull-right[ ```r pivot_longer( data, cols, names_to = "name", * values_to = "value" ) ``` ] --- ## Patients `\(\rightarrow\)` interventions ```r interventions <- patients %>% * pivot_longer( * cols = proc_1:proc_3, # variables proc_1 to proc_3 * names_to = "proc_no", # column names -> new column called proc_no * values_to = "procedure" # values in columns -> new column called procedure * ) interventions ``` ``` ## # A tibble: 9 × 3 ## ID proc_no procedure ## <dbl> <chr> <chr> ## 1 1 proc_1 stent_nocomp ## 2 1 proc_2 othercardio ## 3 1 proc_3 <NA> ## 4 2 proc_1 appendectomy ## 5 2 proc_2 <NA> ## 6 2 proc_3 <NA> ## 7 3 proc_1 stent_comp ## 8 3 proc_2 appendectomy ## 9 3 proc_3 othercardio ``` --- ## Why pivot? Most likely, because the next step of your analysis needs it -- .pull-left[ ```r prices ``` ``` ## # A tibble: 4 × 2 ## procedure medicare_price ## <chr> <dbl> ## 1 stent_nocomp 15827. ## 2 appendectomy 10717. ## 3 othercardio 22591. ## 4 stent_comp 31719. ``` ] .pull-right[ ```r interventions %>% * left_join(prices) ``` ``` ## # A tibble: 9 × 4 ## ID proc_no procedure medicare_price ## <dbl> <chr> <chr> <dbl> ## 1 1 proc_1 stent_nocomp 15827. ## 2 1 proc_2 othercardio 22591. ## 3 1 proc_3 <NA> NA ## 4 2 proc_1 appendectomy 10717. ## 5 2 proc_2 <NA> NA ## 6 2 proc_3 <NA> NA ## 7 3 proc_1 stent_comp 31719. ## 8 3 proc_2 appendectomy 10717. ## 9 3 proc_3 othercardio 22591. ``` ] --- ## Interventions `\(\rightarrow\)` patients .pull-left-narrow[ - `data` (as usual) - `names_from`: which column in the long format contains what should be column names in the wide format - `values_from`: which column in the long format contains what should be values in the new columns in the wide format ] .pull-right-wide[ ```r interventions %>% * pivot_wider( * names_from = proc_no, * values_from = procedure * ) ``` ``` ## # A tibble: 3 × 4 ## ID proc_1 proc_2 proc_3 ## <dbl> <chr> <chr> <chr> ## 1 1 stent_nocomp othercardio <NA> ## 2 2 appendectomy <NA> <NA> ## 3 3 stent_comp appendectomy othercardio ``` ] --- class: middle # Line Graph Case Study: Approval Rating of President Trump --- <img src="img/trump-approval.png" width="70%" style="display: block; margin: auto;" /> .footnote[ Source: [FiveThirtyEight](https://projects.fivethirtyeight.com/trump-approval-ratings/adults/) ] --- ## Data ```r trump ``` ``` ## # A tibble: 2,702 × 4 ## subgroup date approval disapproval ## <chr> <date> <dbl> <dbl> ## 1 Voters 2020-10-04 44.7 52.2 ## 2 Adults 2020-10-04 43.2 52.6 ## 3 Adults 2020-10-03 43.2 52.6 ## 4 Voters 2020-10-03 45.0 51.7 ## 5 Adults 2020-10-02 43.3 52.4 ## 6 Voters 2020-10-02 44.5 52.1 ## 7 Voters 2020-10-01 44.1 52.8 ## 8 Adults 2020-10-01 42.7 53.3 ## 9 Adults 2020-09-30 42.2 53.7 ## 10 Voters 2020-09-30 44.2 52.7 ## # … with 2,692 more rows ``` --- ## Goal .pull-left-wide[ <img src="w4-l03-tidying_files/figure-html/unnamed-chunk-20-1.png" width="100%" style="display: block; margin: auto;" /> ] -- .pull-right-narrow[ **Aesthetic mappings:** ✅ x = `date` ❌ y = `rating_value` ❌ color = `rating_type` **Facet:** ✅ `subgroup` (Adults and Voters) ] --- ## Pivot ```r trump_longer <- trump %>% pivot_longer( cols = c(approval, disapproval), names_to = "rating_type", values_to = "rating_value" ) trump_longer ``` ``` ## # A tibble: 5,404 × 4 ## subgroup date rating_type rating_value ## <chr> <date> <chr> <dbl> ## 1 Voters 2020-10-04 approval 44.7 ## 2 Voters 2020-10-04 disapproval 52.2 ## 3 Adults 2020-10-04 approval 43.2 ## 4 Adults 2020-10-04 disapproval 52.6 ## 5 Adults 2020-10-03 approval 43.2 ## 6 Adults 2020-10-03 disapproval 52.6 ## 7 Voters 2020-10-03 approval 45.0 ## 8 Voters 2020-10-03 disapproval 51.7 ... ``` --- ## Plot ```r ggplot(trump_longer, aes(x = date, y = rating_value, color = rating_type, group = rating_type)) + geom_line() + facet_wrap(~ subgroup) ``` <img src="w4-l03-tidying_files/figure-html/unnamed-chunk-22-1.png" width="60%" style="display: block; margin: auto;" /> --- .panelset[ .panel[.panel-name[Code] ```r ggplot(trump_longer, aes(x = date, y = rating_value, color = rating_type, group = rating_type)) + geom_line() + facet_wrap(~ subgroup) + * scale_color_manual(values = c("darkgreen", "orange")) + * labs( * x = "Date", y = "Rating", * color = NULL, * title = "How (un)popular is President Trump?", * subtitle = "Estimates based on polls of all adults and polls of likely/registered voters", * caption = "Source: FiveThirtyEight modeling estimates" * ) ``` ] .panel[.panel-name[Plot] <img src="w4-l03-tidying_files/figure-html/unnamed-chunk-23-1.png" width="75%" style="display: block; margin: auto;" /> ] ] --- .panelset[ .panel[.panel-name[Code] ```r ggplot(trump_longer, aes(x = date, y = rating_value, color = rating_type, group = rating_type)) + geom_line() + facet_wrap(~ subgroup) + scale_color_manual(values = c("darkgreen", "orange")) + labs( x = "Date", y = "Rating", color = NULL, title = "How (un)popular is President Trump?", subtitle = "Estimates based on polls of all adults and polls of likely/registered voters", caption = "Source: FiveThirtyEight modeling estimates" ) ``` ] .panel[.panel-name[Plot] <img src="w4-l03-tidying_files/figure-html/unnamed-chunk-24-1.png" width="75%" style="display: block; margin: auto;" /> ] ] --- ## Coronavirus Deaths First, we should call the libraries we used to make the plot during the first week of class. Because the `coronavirus` package is updated daily, we should re-install it to get the most up to date data. ```r library(devtools) #to install package from GitHub #reinstall coronavirus package to update #'s (they update daily) devtools::install_github("RamiKrispin/coronavirus") library(coronavirus) library(lubridate) #to facilitate working with dates library(glue) #we use in formatting the date library(scales) #use to control notation on axis scale library(ggrepel) #for writing names on plot ``` --- ## Recall the Data The data frame called `coronavirus` in the coronavirus package provides a daily summary of the Coronavirus (COVID-19) cases by country. Each row in the data frame represents a country (or, where relevant, state/province). Note that the data provided in this package provides daily number of deaths, confirmed cases, and recovered cases. Today we will focus on deaths. We will start by making our selection for the countries we want to explore. --- ```r head(coronavirus) #recall variables ``` ``` ## date province country lat long type cases ## 1 2020-01-22 Alberta Canada 53.9333 -116.5765 confirmed 0 ## 2 2020-01-23 Alberta Canada 53.9333 -116.5765 confirmed 0 ## 3 2020-01-24 Alberta Canada 53.9333 -116.5765 confirmed 0 ## 4 2020-01-25 Alberta Canada 53.9333 -116.5765 confirmed 0 ## 5 2020-01-26 Alberta Canada 53.9333 -116.5765 confirmed 0 ## 6 2020-01-27 Alberta Canada 53.9333 -116.5765 confirmed 0 ``` ```r countries <- c( "China", "France", "United Kingdom", "US", "Turkey" ) ``` --- How did we wrangle those data? First, we filtered to a smaller list of countries, edited long country names if needed, and summed all the deaths by country and date. For some countries like Canada on the previous slide, we have data on the provincial level, so we need to summarize to put data on the country level. ```r country_data <- coronavirus %>% # filter for deaths in countries of interest filter( type == "death", country %in% countries ) %>% # fix country labels for pretty plotting mutate( country = case_when( country == "United Kingdom" ~ "UK", TRUE ~ country ) ) %>% # calculate number of total cases for each country and date group_by(country, date) %>% summarise(tot_cases = sum(cases)) %>% ``` --- Next, we ordered by date and calculated the cumulative case counts. Then, we excluded days before the 10th confirmed death due to irregularities in early reporting, and we converted calendar date to days elapsed. ```r # arrange by date in ascending order arrange(date) %>% # record daily cumulative cases mutate(cumulative_cases = cumsum(tot_cases)) %>% # only use days since the 10th confirmed death filter(cumulative_cases > 9) %>% # record days elapsed, end date, and end label mutate( days_elapsed = as.numeric(date - min(date)), end_date = if_else(date == max(date), TRUE, FALSE), end_label = if_else(end_date, country, NULL) ) %>% # ungroup ungroup() ``` --- We also need to take a note of the "as of date" for the data so that we can properly label our visualization. ```r as_of_date <- country_data %>% summarise(max(date)) %>% pull() as_of_date_formatted <- glue("{wday(as_of_date, label = TRUE)}, {month(as_of_date, label = TRUE)} {day(as_of_date)}, {year(as_of_date)}") ``` These data are as of Thu, Sep 16, 2021. --- .panelset[ .panel[.panel-name[Code] ```r ggplot(data = country_data, mapping = aes(x = days_elapsed, y = cumulative_cases, color = country, label = end_label)) + geom_line(size = 0.7, alpha = 0.8) + geom_point(data = country_data %>% filter(end_date)) + geom_label_repel(nudge_y = 1, direction = "y", hjust = 1) + guides(color = FALSE) + scale_color_viridis_d() + scale_y_continuous(labels = label_comma()) + theme_minimal() + labs( x = "Days since 10th confirmed death", y = "Cumulative number of deaths", title = "Cumulative deaths from COVID-19, selected countries", subtitle = glue("Data as of", as_of_date_formatted, .sep = " "), caption = "Source: github.com/RamiKrispin/coronavirus" ) ``` ] .panel[.panel-name[Output] <img src="w4-l03-tidying_files/figure-html/visualise2-1.png" width="60%" style="display: block; margin: auto;" /> ] ] --- We might prefer a plot of daily deaths (here I filter to US - pick your own countries). .panelset[ .panel[.panel-name[Code] ```r country_data %>% filter(country == "US") %>% ggplot(aes(x = date, y = tot_cases, color = country,)) + geom_line() + labs( x = "Date", y = "Daily number of deaths", title = "Daily US deaths from COVID-19", subtitle = glue("Data as of", as_of_date_formatted, .sep = " "), caption = "Source: github.com/RamiKrispin/coronavirus" ) ``` ] .panel[.panel-name[Output] <img src="w4-l03-tidying_files/figure-html/daily2-1.png" width="60%" style="display: block; margin: auto;" /> ] ] --- Or of daily confirmed cases.... ```r confirmed_case_US <- coronavirus %>% # filter for deaths in countries of interest filter( type == "confirmed", country == "US" ) %>% # calculate number of total cases for each country and date group_by(country, date) %>% summarise(tot_cases = sum(cases)) %>% # arrange by date in ascending order arrange(date) %>% # record daily cumulative cases mutate(cumulative_cases = cumsum(tot_cases)) %>% # only use days since the 10th confirmed case filter(cumulative_cases > 9) %>% # record days elapsed, end date, and end label mutate( days_elapsed = as.numeric(date - min(date)), end_date = if_else(date == max(date), TRUE, FALSE), end_label = if_else(end_date, country, NULL) ) %>% # ungroup ungroup() ``` --- .panelset[ .panel[.panel-name[Code] ```r confirmed_case_US %>% ggplot(aes(x = date, y = tot_cases)) + geom_line() + scale_y_continuous(labels = label_comma()) + labs( x = "Date", y = "Daily number of confirmed cases", title = "Daily US Cases of COVID-19", subtitle = glue("Data as of", as_of_date_formatted, .sep = " "), caption = "Source: github.com/RamiKrispin/coronavirus" ) ``` ] .panel[.panel-name[Output] <img src="w4-l03-tidying_files/figure-html/dplotconfirmed2-1.png" width="60%" style="display: block; margin: auto;" /> ] ]