Data Wrangling I

class: center, middle, inverse, title-slide

# Data Wrangling I
## <br><br> Introduction to Global Health Data Science
### <a href="https://sta198f2021.github.io/website/">Back to Website</a>
### <br> Prof. Amy Herring

---

layout: true
  
<div class="my-footer">
<span>
<a href="https://sta198f2021.github.io/website/" target="_blank">Back to website</a>
</span>
</div>

---

# Main Ideas

- Organizing our data according to a consistent set of "tidy" principles makes
data easy to work with and leverages the ways `R` is effective.

- Often we need to wrangle our data in order to extract meaning (including
creating new variables, calculating summary statistics, subsetting data, etc).

- Using only **seven key verbs** we can accomplish a wide variety of data 
wrangling tasks.

---

## Tidy Principles

Tidy data have three related characteristics

1. Each variable forms a column.

2. Each observation forms a row.

3. Each value has its own cell.

---

## Rural and Urban Patterns of Global Obesity

Obesity is a major risk factor for numerous non-communicable diseases, including diabetes, cardiovascular disease, hypertension, stroke, and some cancers. We consider data from an exercise developed by the [Johns Hopkins Open Case Studies Project](https://www.opencasestudies.org/ocs-bp-rural-and-urban-obesity/#Motivation) on average body mass index around the world.

Body mass index (BMI) is calculated as weight in kg divided by the square of height in m:  `$BMI=\frac{kg}{m^2}$`. While it is not a perfect surrogate for adiposity (e.g., people may have a high BMI due to muscle mass rather than body fat), it is an easily-obtained surrogate that is related to health outcomes of interest.

---

## WHO Classification of BMI

---

## Data

Our data include summaries of BMI from 200 countries worldwide, both at the national level and in urban and rural areas separately, for the years 1985 and 2017, separately for men and women.

---

## Data Wrangling

Often we need to wrangle our data to extract meaning. This includes calculating
new variables, summary statistics, grouping by variables, renaming, reordering, 
selecting subsets of data, filtering by various conditions, etc.

We can accomplish a great deal of wrangling by learning just 
**seven key verbs**. Each of these functions takes a data frame as input and 
returns a data frame as output.

- filter
- arrange
- select
- slice
- mutate
- summarize
- group_by

---

We first explore the data a bit.

```r
load("BMI_long.Rdata")
glimpse(BMI_long)
```

```
## Rows: 2,400
## Columns: 5
## $ Country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "A…
## $ Sex     <chr> "Men", "Men", "Men", "Men", "Men", "Men", "Wome…
## $ Region  <chr> "National", "Rural", "Urban", "National", "Rura…
## $ Year    <chr> "1985", "1985", "1985", "2017", "2017", "2017",…
## $ BMI     <dbl> 20.2, 19.7, 22.4, 22.8, 22.5, 23.6, 20.6, 20.1,…
```

```r
names(BMI_long)
```

```
## [1] "Country" "Sex"     "Region"  "Year"    "BMI"
```

---

```r
head(BMI_long)
```

```
## # A tibble: 6 × 5
##   Country     Sex   Region   Year    BMI
##   <chr>       <chr> <chr>    <chr> <dbl>
## 1 Afghanistan Men   National 1985   20.2
## 2 Afghanistan Men   Rural    1985   19.7
## 3 Afghanistan Men   Urban    1985   22.4
## 4 Afghanistan Men   National 2017   22.8
## 5 Afghanistan Men   Rural    2017   22.5
## 6 Afghanistan Men   Urban    2017   23.6
```

The `head()` function returns "A tibble: 6 x 5" and then the first six rows of
the `BMI_long` data. A **tibble** is a tweaked, opinionated version of the `R`
data frame.

---

There are a few differences a `tidyverse` tibble and an `R` data frame. We discuss two primary differences.

First, a tibble provides more information than a data frame. When you print a tibble, 
it will show the first ten rows and all of the columns that fit on the screen,
along with the type of each column. Try this with the `BMI_long` data.

You can 
modify the number of rows and columns shown using the `print()` function options.

Second, tibbles are somewhat more strict than data frames when it comes to 
subsetting data.

---

### `select()`

The `select()` function picks off one or more columns by name.

Let's say we want a dataset that only contains the variable BMI.

```r
select(BMI_long, BMI)
```

```
## # A tibble: 2,400 × 1
##     BMI
##   <dbl>
## 1  20.2
## 2  19.7
## 3  22.4
## 4  22.8
## 5  22.5
## 6  23.6
## # … with 2,394 more rows
```

---

We can also use `select()` to exclude variables. Let's exclude `BMI` but
keep all other variables.

```r
select(BMI_long, -BMI)
```

```
## # A tibble: 2,400 × 4
##   Country     Sex   Region   Year 
##   <chr>       <chr> <chr>    <chr>
## 1 Afghanistan Men   National 1985 
## 2 Afghanistan Men   Rural    1985 
## 3 Afghanistan Men   Urban    1985 
## 4 Afghanistan Men   National 2017 
## 5 Afghanistan Men   Rural    2017 
## 6 Afghanistan Men   Urban    2017 
## # … with 2,394 more rows
```

---

We can also use `select()` to select a range of variables. Here, we select the 
first three variables.

```r
select(BMI_long, Country:Region)
```

```
## # A tibble: 2,400 × 3
##   Country     Sex   Region  
##   <chr>       <chr> <chr>   
## 1 Afghanistan Men   National
## 2 Afghanistan Men   Rural   
## 3 Afghanistan Men   Urban   
## 4 Afghanistan Men   National
## 5 Afghanistan Men   Rural   
## 6 Afghanistan Men   Urban   
## # … with 2,394 more rows
```

---

### `arrange()`

The `arrange()` function orders rows (observations) in specific ways.

Let's arrange the data by descending BMI, with large BMI values on top.

```r
arrange(BMI_long, desc(BMI))
```

```
## # A tibble: 2,400 × 5
##   Country        Sex   Region   Year    BMI
##   <chr>          <chr> <chr>    <chr> <dbl>
## 1 American Samoa Women Urban    2017   35.4
## 2 American Samoa Women National 2017   35.3
## 3 American Samoa Women Rural    2017   35  
## 4 Samoa          Women Urban    2017   34.7
## 5 American Samoa Men   Rural    2017   34.6
## 6 American Samoa Women Urban    1985   34.4
## # … with 2,394 more rows
```

---

Or with low BMI values on top.

```r
arrange(BMI_long, BMI)
```

```
## # A tibble: 2,400 × 5
##   Country     Sex   Region   Year    BMI
##   <chr>       <chr> <chr>    <chr> <dbl>
## 1 Bangladesh  Women Rural    1985   17.7
## 2 Bangladesh  Women National 1985   18.2
## 3 Timor-Leste Women Rural    1985   18.2
## 4 Viet Nam    Women Rural    1985   18.2
## 5 Nepal       Women Rural    1985   18.3
## 6 Ethiopia    Men   Rural    1985   18.4
## # … with 2,394 more rows
```

---

Or by two factors...

```r
arrange(BMI_long, desc(Country), BMI)
```

```
## # A tibble: 2,400 × 5
##   Country  Sex   Region   Year    BMI
##   <chr>    <chr> <chr>    <chr> <dbl>
## 1 Zimbabwe Men   Rural    1985   20.9
## 2 Zimbabwe Men   National 1985   21.2
## 3 Zimbabwe Men   Rural    2017   21.9
## 4 Zimbabwe Men   Urban    1985   22.1
## 5 Zimbabwe Men   National 2017   22.3
## 6 Zimbabwe Men   Urban    2017   23.3
## # … with 2,394 more rows
```

---

### The Pipe

The pipe is a technique for passing information from one process to another.

```r
BMI_long %>%
  arrange(desc(BMI)) %>%
  select(Country, BMI)
```

```
## # A tibble: 2,400 × 2
##   Country          BMI
##   <chr>          <dbl>
## 1 American Samoa  35.4
## 2 American Samoa  35.3
## 3 American Samoa  35  
## 4 Samoa           34.7
## 5 American Samoa  34.6
## 6 American Samoa  34.4
## # … with 2,394 more rows
```

When reading code "in English", say "and then" whenever you see a pipe.

---

### `slice()`

Slice selects rows based on their position.

Here we slice off the first 5 rows of the `BMI_long` data.

```r
BMI_long %>%
  slice(1:5)
```

```
## # A tibble: 5 × 5
##   Country     Sex   Region   Year    BMI
##   <chr>       <chr> <chr>    <chr> <dbl>
## 1 Afghanistan Men   National 1985   20.2
## 2 Afghanistan Men   Rural    1985   19.7
## 3 Afghanistan Men   Urban    1985   22.4
## 4 Afghanistan Men   National 2017   22.8
## 5 Afghanistan Men   Rural    2017   22.5
```

---

**Question:** What is the code chunk below accomplishing? Guess before running
the code.

.panelset[
.panel[.panel-name[Code]

```r
BMI_long %>%
  arrange(desc(BMI)) %>%
  slice(1:5)
```
]
.panel[.panel-name[Output]

```
## # A tibble: 5 × 5
##   Country        Sex   Region   Year    BMI
##   <chr>          <chr> <chr>    <chr> <dbl>
## 1 American Samoa Women Urban    2017   35.4
## 2 American Samoa Women National 2017   35.3
## 3 American Samoa Women Rural    2017   35  
## 4 Samoa          Women Urban    2017   34.7
## 5 American Samoa Men   Rural    2017   34.6
```
]
]

---

To add comments to code, use the pound sign. This is helpful for debugging as 
well - you can temporarily disable a line.

**Question:** What will happen if you comment out the line containing 
`arrange()` in the previous code chunk? Try it.

.panelset[
.panel[.panel-name[Code]

```r
BMI_long %>%
  #arrange(desc(BMI)) %>%
  slice(1:5)
```
]
.panel[.panel-name[Output]

---

### `filter()`

`filter()` selects rows satisfying certain conditions.

We can use a single condition. Here we select all rows where the country is Iceland.

```r
BMI_long %>%
  filter(Country == "Iceland")
```

```
## # A tibble: 12 × 5
##   Country Sex   Region   Year    BMI
##   <chr>   <chr> <chr>    <chr> <dbl>
## 1 Iceland Men   National 1985   24.8
## 2 Iceland Men   Rural    1985   25.2
## 3 Iceland Men   Urban    1985   24.8
## 4 Iceland Men   National 2017   27.1
## 5 Iceland Men   Rural    2017   27.3
## 6 Iceland Men   Urban    2017   27.1
## # … with 6 more rows
```

---

We can use more than one condition. Here we select all rows where the 
Country is Iceland and the year is 2017.

```r
BMI_long %>%
  filter(Country == "Iceland", Year == 2017)
```

```
## # A tibble: 6 × 5
##   Country Sex   Region   Year    BMI
##   <chr>   <chr> <chr>    <chr> <dbl>
## 1 Iceland Men   National 2017   27.1
## 2 Iceland Men   Rural    2017   27.3
## 3 Iceland Men   Urban    2017   27.1
## 4 Iceland Women National 2017   25.5
## 5 Iceland Women Rural    2017   25.9
## 6 Iceland Women Urban    2017   25.5
```

---

More complex conditions are possible!

**Question:** In plain English, what is the code below accomplishing?

```r
BMI_long %>%
  filter(Country %in% c("Iceland", "Ireland"),
         Year == 2017 | Region == "National")
```

```
## # A tibble: 16 × 5
##   Country Sex   Region   Year    BMI
##   <chr>   <chr> <chr>    <chr> <dbl>
## 1 Iceland Men   National 1985   24.8
## 2 Iceland Men   National 2017   27.1
## 3 Iceland Men   Rural    2017   27.3
## 4 Iceland Men   Urban    2017   27.1
## 5 Iceland Women National 1985   24.3
## 6 Iceland Women National 2017   25.5
## # … with 10 more rows
```

---

The table of logical operators below will be helpful as you work with filtering.

|operator | definition |
|:----------|:--------|
| `<` |   is less than? |
| `<=` |   is less than or equal to? |
| `>` |   is greater than? |
| `>=` |   is greater than or equal to? |
| `==` |   is exactly equal to? |
| `!=` |   is not equal to? |
| `x & y` |   is x AND y? |
| x `$\mid$` y |   is x OR y? |

---

The table of logical operators below will be helpful as you work with filtering.

|operator | definition |
|:----------|:--------|
| `is.na(x)` |   is x NA? |
| `!is.na(x)` |   is x not NA? |
| `x %in% y` |   is x in y? |
| `!(x %in% y)` |   is x not in y? |
| `!x` |   is not x? |

The final operator only makes sense if `x` is logical (TRUE / FALSE).

R denotes missing values by NA. Missing values can cause biased or misleading results in any analysis, and it is important to understand why values are missing before drawing conclusions from data, especially if the fraction of missing values is not very small.

---

### `mutate()`

`mutate()` creates a new variable. Here we categorize BMI using the WHO classification.

Here's our code:

Here's the output!

```
## # A tibble: 2,400 × 2
##     BMI bmiclass    
##   <dbl> <fct>       
## 1  20.2 Normal Range
## 2  19.7 Normal Range
## 3  22.4 Normal Range
## 4  22.8 Normal Range
## 5  22.5 Normal Range
## 6  23.6 Normal Range
## # … with 2,394 more rows
```

---

Using `==` in the mutate creates a new variable `normal_range` that is TRUE if the BMI is in the normal range and FALSE if it is not.

.panelset[
.panel[.panel-name[Code]

```r
BMI_long %>%
  mutate(bmiclass = cut(
    BMI,
    breaks = c(-Inf, 18.5, 25, 30, 35, 40, Inf),
    labels = c(
      "Underweight",
      "Normal Range",
      "Overweight",
      "Obese I",
      "Obese II",
      "Obese III"
    ),
    right = FALSE
  )) %>%
  mutate(normal_range = bmiclass == "Normal Range") %>%
  print()
```
]
.panel[.panel-name[Output]

```
## # A tibble: 2,400 × 7
##   Country     Sex   Region   Year    BMI bmiclass     normal_range
##   <chr>       <chr> <chr>    <chr> <dbl> <fct>        <lgl>       
## 1 Afghanistan Men   National 1985   20.2 Normal Range TRUE        
## 2 Afghanistan Men   Rural    1985   19.7 Normal Range TRUE        
## 3 Afghanistan Men   Urban    1985   22.4 Normal Range TRUE        
## 4 Afghanistan Men   National 2017   22.8 Normal Range TRUE        
## 5 Afghanistan Men   Rural    2017   22.5 Normal Range TRUE        
## 6 Afghanistan Men   Urban    2017   23.6 Normal Range TRUE        
## # … with 2,394 more rows
```
]
]

---

### `summarize()`

`summarize` calculates summary statistics. It collapses rows into summary 
statistics and removes columns irrelevant to the calculation. Be sure to name your columns!

.panelset[
.panel[.panel-name[Code]

```r
BMI_long %>%
  mutate(bmiclass = cut(BMI,
    breaks = c(-Inf, 18.5, 25, 30, 35, 40, Inf),
    labels = c(
      "Underweight",
      "Normal Range",
      "Overweight",
      "Obese I",
      "Obese II",
      "Obese III"
    ),
    right = FALSE
  )) %>%
  mutate(normal_range = bmiclass == "Normal Range") %>%
  summarize(fracnormal = mean(normal_range))
```
]
.panel[.panel-name[Output]

```
## # A tibble: 1 × 1
##   fracnormal
##        <dbl>
## 1         NA
```
]
]

---

**Question:** The code chunk returns an NA. What is going wrong?

.panelset[
.panel[.panel-name[Code]

```r
BMI_long %>%
  mutate(bmiclass = cut(BMI,
    breaks = c(-Inf, 18.5, 25, 30, 35, 40, Inf),
    labels = c(
      "Underweight",
      "Normal Range",
      "Overweight",
      "Obese I",
      "Obese II",
      "Obese III"
    ),
    right = FALSE
  )) %>%
  count(bmiclass) %>%
  group_by(bmiclass) 
```
]
.panel[.panel-name[Output]

```
## # A tibble: 6 × 2
## # Groups:   bmiclass [6]
##   bmiclass         n
##   <fct>        <int>
## 1 Underweight      7
## 2 Normal Range  1185
## 3 Overweight    1091
## 4 Obese I         96
## 5 Obese II         3
## 6 <NA>            18
```
Ahh, count() is a very useful function!
]
]

---
Ahh, there were 18 missing values of the variable `bmiclass`!  First, let's check to be sure we didn't code that new variable incorrectly by seeing if the source variable, BMI, also has some missing values.

.panelset[
.panel[.panel-name[Code]

```r
BMI_long %>%
  count(BMI=="NA")
```
]
.panel[.panel-name[Output]

```
## # A tibble: 2 × 2
##   `BMI == "NA"`     n
##   <lgl>         <int>
## 1 FALSE          2382
## 2 NA               18
```

OK great, so the problem was with the source variable. We can just drop those few NA's when we take the mean, so that we get the mean of the observed values. In general, one would follow up on the NA values to understand why data were missing. What seems to be the issue here?
]
]

---

Let's print all 18 variables and see if we can tell what's going on.

```r
BMI_long %>%
  filter(is.na(BMI)==TRUE) %>%
  print(n=18)
```

---

```
## # A tibble: 18 × 5
##    Country               Sex   Region Year    BMI
##    <chr>                 <chr> <chr>  <chr> <dbl>
##  1 Bermuda               Men   Rural  1985     NA
##  2 Bermuda               Men   Rural  2017     NA
##  3 Bermuda               Women Rural  1985     NA
##  4 Bermuda               Women Rural  2017     NA
##  5 China (Hong Kong SAR) Men   Rural  2017     NA
##  6 China (Hong Kong SAR) Women Rural  2017     NA
##  7 Nauru                 Men   Rural  1985     NA
##  8 Nauru                 Men   Rural  2017     NA
##  9 Nauru                 Women Rural  1985     NA
## 10 Nauru                 Women Rural  2017     NA
## 11 Singapore             Men   Rural  1985     NA
## 12 Singapore             Men   Rural  2017     NA
## 13 Singapore             Women Rural  1985     NA
## 14 Singapore             Women Rural  2017     NA
## 15 Tokelau               Men   Urban  1985     NA
## 16 Tokelau               Men   Urban  2017     NA
## 17 Tokelau               Women Urban  1985     NA
## 18 Tokelau               Women Urban  2017     NA
```

This makes sense! Hong Kong, Singapore, Bermuda, and Nauru do not classify any of their population as rural; Tokelau does not classify any of its population as urban.

---

Finally, let's get the proportion of values with mean BMI in the normal range.

.panelset[
.panel[.panel-name[Code]

```r
BMI_long %>%
  mutate(bmiclass = cut(BMI,
    breaks = c(-Inf, 18.5, 25, 30, 35, 40, Inf),
    labels = c(
      "Underweight",
      "Normal Range",
      "Overweight",
      "Obese I",
      "Obese II",
      "Obese III"
    ),
    right = FALSE
  )) %>%
  mutate(normal_range = bmiclass == "Normal Range") %>%
  summarize(fracnormal = mean(normal_range,na.rm=TRUE))
```
]
.panel[.panel-name[Output]

```
## # A tibble: 1 × 1
##   fracnormal
##        <dbl>
## 1      0.497
```
]
]
---

### `group_by()`

`group_by()` is powerful when paired with
summarize to calculate summary statistics by group. Here we find the proportion of means that are in the normal range for each region.

.panelset[
.panel[.panel-name[Code]

```r
BMI_long %>%
  mutate(bmiclass = cut(BMI,
    breaks = c(-Inf, 18.5, 25, 30, 35, 40, Inf),
    labels = c(
      "Underweight",
      "Normal Range",
      "Overweight",
      "Obese I",
      "Obese II",
      "Obese III"
    ),
    right = FALSE
  )) %>%
  mutate(normal_range = bmiclass == "Normal Range") %>%
  group_by(Region) %>%
  summarize(fracnormal = mean(normal_range,na.rm=TRUE))
```
]
.panel[.panel-name[Output]

```
## # A tibble: 3 × 2
##   Region   fracnormal
##   <chr>         <dbl>
## 1 National      0.506
## 2 Rural         0.541
## 3 Urban         0.446
```
]
]

---

We can calculate more than one summary statistic in `summarize()`.

.panelset[
.panel[.panel-name[Code]

```
## # A tibble: 3 × 3
##   Region   fracnormal max_BMI
##   <chr>         <dbl>   <dbl>
## 1 National      0.52     35.3
## 2 Rural         0.559    35  
## 3 Urban         0.462    35.4
```
]
]

---

**Question:** What is the code chunk below doing?

.panelset[
.panel[.panel-name[Code]

```r
BMI_long %>%
  count(Year) %>%
  mutate(prop = n / sum(n))
```
]
.panel[.panel-name[Output]

```
## # A tibble: 2 × 3
##   Year      n  prop
##   <chr> <int> <dbl>
## 1 1985   1200   0.5
## 2 2017   1200   0.5
```
]
]

---

### distinct()

One more nifty command is the `distinct` command in R, which eliminates duplicate rows.

```r
age <- as_tibble(c(1,1,1,2,3))
age
```

```
## # A tibble: 5 × 1
##   value
##   <dbl>
## 1     1
## 2     1
## 3     1
## 4     2
## 5     3
```

```r
distinct(age)
```

```
## # A tibble: 3 × 1
##   value
##   <dbl>
## 1     1
## 2     2
## 3     3
```

---

### Additional Resources

- https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
- https://style.tidyverse.org/
- https://www.opencasestudies.org/ocs-bp-rural-and-urban-obesity/