Processing math: 100%
+ - 0:00:00
Notes for current slide
Notes for next slide

Data Wrangling I



Introduction to Global Health Data Science

Back to Website


Prof. Amy Herring

1 / 39

Main Ideas

  • Organizing our data according to a consistent set of "tidy" principles makes data easy to work with and leverages the ways R is effective.

  • Often we need to wrangle our data in order to extract meaning (including creating new variables, calculating summary statistics, subsetting data, etc).

  • Using only seven key verbs we can accomplish a wide variety of data wrangling tasks.

2 / 39

Tidy Principles

Tidy data have three related characteristics

  1. Each variable forms a column.

  2. Each observation forms a row.

  3. Each value has its own cell.

3 / 39

Rural and Urban Patterns of Global Obesity

Obesity is a major risk factor for numerous non-communicable diseases, including diabetes, cardiovascular disease, hypertension, stroke, and some cancers. We consider data from an exercise developed by the Johns Hopkins Open Case Studies Project on average body mass index around the world.

Body mass index (BMI) is calculated as weight in kg divided by the square of height in m: BMI=kgm2. While it is not a perfect surrogate for adiposity (e.g., people may have a high BMI due to muscle mass rather than body fat), it is an easily-obtained surrogate that is related to health outcomes of interest.

4 / 39

WHO Classification of BMI

5 / 39

Data

Our data include summaries of BMI from 200 countries worldwide, both at the national level and in urban and rural areas separately, for the years 1985 and 2017, separately for men and women.

6 / 39

Data Wrangling

Often we need to wrangle our data to extract meaning. This includes calculating new variables, summary statistics, grouping by variables, renaming, reordering, selecting subsets of data, filtering by various conditions, etc.

We can accomplish a great deal of wrangling by learning just seven key verbs. Each of these functions takes a data frame as input and returns a data frame as output.

  • filter
  • arrange
  • select
  • slice
  • mutate
  • summarize
  • group_by
7 / 39

We first explore the data a bit.

load("BMI_long.Rdata")
glimpse(BMI_long)
## Rows: 2,400
## Columns: 5
## $ Country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "A…
## $ Sex <chr> "Men", "Men", "Men", "Men", "Men", "Men", "Wome…
## $ Region <chr> "National", "Rural", "Urban", "National", "Rura…
## $ Year <chr> "1985", "1985", "1985", "2017", "2017", "2017",…
## $ BMI <dbl> 20.2, 19.7, 22.4, 22.8, 22.5, 23.6, 20.6, 20.1,…
names(BMI_long)
## [1] "Country" "Sex" "Region" "Year" "BMI"
8 / 39
head(BMI_long)
## # A tibble: 6 × 5
## Country Sex Region Year BMI
## <chr> <chr> <chr> <chr> <dbl>
## 1 Afghanistan Men National 1985 20.2
## 2 Afghanistan Men Rural 1985 19.7
## 3 Afghanistan Men Urban 1985 22.4
## 4 Afghanistan Men National 2017 22.8
## 5 Afghanistan Men Rural 2017 22.5
## 6 Afghanistan Men Urban 2017 23.6

The head() function returns "A tibble: 6 x 5" and then the first six rows of the BMI_long data. A tibble is a tweaked, opinionated version of the R data frame.

9 / 39

There are a few differences a tidyverse tibble and an R data frame. We discuss two primary differences.

First, a tibble provides more information than a data frame. When you print a tibble, it will show the first ten rows and all of the columns that fit on the screen, along with the type of each column. Try this with the BMI_long data.

You can modify the number of rows and columns shown using the print() function options.

Second, tibbles are somewhat more strict than data frames when it comes to subsetting data.

10 / 39

select()

The select() function picks off one or more columns by name.

Let's say we want a dataset that only contains the variable BMI.

select(BMI_long, BMI)
## # A tibble: 2,400 × 1
## BMI
## <dbl>
## 1 20.2
## 2 19.7
## 3 22.4
## 4 22.8
## 5 22.5
## 6 23.6
## # … with 2,394 more rows
11 / 39

We can also use select() to exclude variables. Let's exclude BMI but keep all other variables.

select(BMI_long, -BMI)
## # A tibble: 2,400 × 4
## Country Sex Region Year
## <chr> <chr> <chr> <chr>
## 1 Afghanistan Men National 1985
## 2 Afghanistan Men Rural 1985
## 3 Afghanistan Men Urban 1985
## 4 Afghanistan Men National 2017
## 5 Afghanistan Men Rural 2017
## 6 Afghanistan Men Urban 2017
## # … with 2,394 more rows
12 / 39

We can also use select() to select a range of variables. Here, we select the first three variables.

select(BMI_long, Country:Region)
## # A tibble: 2,400 × 3
## Country Sex Region
## <chr> <chr> <chr>
## 1 Afghanistan Men National
## 2 Afghanistan Men Rural
## 3 Afghanistan Men Urban
## 4 Afghanistan Men National
## 5 Afghanistan Men Rural
## 6 Afghanistan Men Urban
## # … with 2,394 more rows
13 / 39

arrange()

The arrange() function orders rows (observations) in specific ways.

Let's arrange the data by descending BMI, with large BMI values on top.

arrange(BMI_long, desc(BMI))
## # A tibble: 2,400 × 5
## Country Sex Region Year BMI
## <chr> <chr> <chr> <chr> <dbl>
## 1 American Samoa Women Urban 2017 35.4
## 2 American Samoa Women National 2017 35.3
## 3 American Samoa Women Rural 2017 35
## 4 Samoa Women Urban 2017 34.7
## 5 American Samoa Men Rural 2017 34.6
## 6 American Samoa Women Urban 1985 34.4
## # … with 2,394 more rows
14 / 39

Or with low BMI values on top.

arrange(BMI_long, BMI)
## # A tibble: 2,400 × 5
## Country Sex Region Year BMI
## <chr> <chr> <chr> <chr> <dbl>
## 1 Bangladesh Women Rural 1985 17.7
## 2 Bangladesh Women National 1985 18.2
## 3 Timor-Leste Women Rural 1985 18.2
## 4 Viet Nam Women Rural 1985 18.2
## 5 Nepal Women Rural 1985 18.3
## 6 Ethiopia Men Rural 1985 18.4
## # … with 2,394 more rows
15 / 39

Or by two factors...

arrange(BMI_long, desc(Country), BMI)
## # A tibble: 2,400 × 5
## Country Sex Region Year BMI
## <chr> <chr> <chr> <chr> <dbl>
## 1 Zimbabwe Men Rural 1985 20.9
## 2 Zimbabwe Men National 1985 21.2
## 3 Zimbabwe Men Rural 2017 21.9
## 4 Zimbabwe Men Urban 1985 22.1
## 5 Zimbabwe Men National 2017 22.3
## 6 Zimbabwe Men Urban 2017 23.3
## # … with 2,394 more rows
16 / 39

The Pipe

The pipe is a technique for passing information from one process to another.

BMI_long %>%
arrange(desc(BMI)) %>%
select(Country, BMI)
## # A tibble: 2,400 × 2
## Country BMI
## <chr> <dbl>
## 1 American Samoa 35.4
## 2 American Samoa 35.3
## 3 American Samoa 35
## 4 Samoa 34.7
## 5 American Samoa 34.6
## 6 American Samoa 34.4
## # … with 2,394 more rows

When reading code "in English", say "and then" whenever you see a pipe.

17 / 39

slice()

Slice selects rows based on their position.

Here we slice off the first 5 rows of the BMI_long data.

BMI_long %>%
slice(1:5)
## # A tibble: 5 × 5
## Country Sex Region Year BMI
## <chr> <chr> <chr> <chr> <dbl>
## 1 Afghanistan Men National 1985 20.2
## 2 Afghanistan Men Rural 1985 19.7
## 3 Afghanistan Men Urban 1985 22.4
## 4 Afghanistan Men National 2017 22.8
## 5 Afghanistan Men Rural 2017 22.5
18 / 39

Question: What is the code chunk below accomplishing? Guess before running the code.

BMI_long %>%
arrange(desc(BMI)) %>%
slice(1:5)
## # A tibble: 5 × 5
## Country Sex Region Year BMI
## <chr> <chr> <chr> <chr> <dbl>
## 1 American Samoa Women Urban 2017 35.4
## 2 American Samoa Women National 2017 35.3
## 3 American Samoa Women Rural 2017 35
## 4 Samoa Women Urban 2017 34.7
## 5 American Samoa Men Rural 2017 34.6
19 / 39

To add comments to code, use the pound sign. This is helpful for debugging as well - you can temporarily disable a line.

Question: What will happen if you comment out the line containing arrange() in the previous code chunk? Try it.

BMI_long %>%
#arrange(desc(BMI)) %>%
slice(1:5)
## # A tibble: 5 × 5
## Country Sex Region Year BMI
## <chr> <chr> <chr> <chr> <dbl>
## 1 Afghanistan Men National 1985 20.2
## 2 Afghanistan Men Rural 1985 19.7
## 3 Afghanistan Men Urban 1985 22.4
## 4 Afghanistan Men National 2017 22.8
## 5 Afghanistan Men Rural 2017 22.5
20 / 39

filter()

filter() selects rows satisfying certain conditions.

We can use a single condition. Here we select all rows where the country is Iceland.

BMI_long %>%
filter(Country == "Iceland")
## # A tibble: 12 × 5
## Country Sex Region Year BMI
## <chr> <chr> <chr> <chr> <dbl>
## 1 Iceland Men National 1985 24.8
## 2 Iceland Men Rural 1985 25.2
## 3 Iceland Men Urban 1985 24.8
## 4 Iceland Men National 2017 27.1
## 5 Iceland Men Rural 2017 27.3
## 6 Iceland Men Urban 2017 27.1
## # … with 6 more rows
21 / 39

We can use more than one condition. Here we select all rows where the Country is Iceland and the year is 2017.

BMI_long %>%
filter(Country == "Iceland", Year == 2017)
## # A tibble: 6 × 5
## Country Sex Region Year BMI
## <chr> <chr> <chr> <chr> <dbl>
## 1 Iceland Men National 2017 27.1
## 2 Iceland Men Rural 2017 27.3
## 3 Iceland Men Urban 2017 27.1
## 4 Iceland Women National 2017 25.5
## 5 Iceland Women Rural 2017 25.9
## 6 Iceland Women Urban 2017 25.5
22 / 39

More complex conditions are possible!

Question: In plain English, what is the code below accomplishing?

BMI_long %>%
filter(Country %in% c("Iceland", "Ireland"),
Year == 2017 | Region == "National")
## # A tibble: 16 × 5
## Country Sex Region Year BMI
## <chr> <chr> <chr> <chr> <dbl>
## 1 Iceland Men National 1985 24.8
## 2 Iceland Men National 2017 27.1
## 3 Iceland Men Rural 2017 27.3
## 4 Iceland Men Urban 2017 27.1
## 5 Iceland Women National 1985 24.3
## 6 Iceland Women National 2017 25.5
## # … with 10 more rows
23 / 39

The table of logical operators below will be helpful as you work with filtering.

operator definition
< is less than?
<= is less than or equal to?
> is greater than?
>= is greater than or equal to?
== is exactly equal to?
!= is not equal to?
x & y is x AND y?
x y is x OR y?
24 / 39

The table of logical operators below will be helpful as you work with filtering.

operator definition
is.na(x) is x NA?
!is.na(x) is x not NA?
x %in% y is x in y?
!(x %in% y) is x not in y?
!x is not x?

The final operator only makes sense if x is logical (TRUE / FALSE).

R denotes missing values by NA. Missing values can cause biased or misleading results in any analysis, and it is important to understand why values are missing before drawing conclusions from data, especially if the fraction of missing values is not very small.

25 / 39

mutate()

mutate() creates a new variable. Here we categorize BMI using the WHO classification.

Here's our code:

BMI_long %>%
mutate(bmiclass = cut(BMI,
breaks = c(-Inf, 18.5, 25, 30, 35, 40, Inf),
labels = c(
"Underweight",
"Normal Range",
"Overweight",
"Obese I",
"Obese II",
"Obese III"
),
right = FALSE #intervals [a,b)
)) %>%
select(BMI, bmiclass)
26 / 39

Here's the output!

## # A tibble: 2,400 × 2
## BMI bmiclass
## <dbl> <fct>
## 1 20.2 Normal Range
## 2 19.7 Normal Range
## 3 22.4 Normal Range
## 4 22.8 Normal Range
## 5 22.5 Normal Range
## 6 23.6 Normal Range
## # … with 2,394 more rows
27 / 39

Using == in the mutate creates a new variable normal_range that is TRUE if the BMI is in the normal range and FALSE if it is not.

BMI_long %>%
mutate(bmiclass = cut(
BMI,
breaks = c(-Inf, 18.5, 25, 30, 35, 40, Inf),
labels = c(
"Underweight",
"Normal Range",
"Overweight",
"Obese I",
"Obese II",
"Obese III"
),
right = FALSE
)) %>%
mutate(normal_range = bmiclass == "Normal Range") %>%
print()
## # A tibble: 2,400 × 7
## Country Sex Region Year BMI bmiclass normal_range
## <chr> <chr> <chr> <chr> <dbl> <fct> <lgl>
## 1 Afghanistan Men National 1985 20.2 Normal Range TRUE
## 2 Afghanistan Men Rural 1985 19.7 Normal Range TRUE
## 3 Afghanistan Men Urban 1985 22.4 Normal Range TRUE
## 4 Afghanistan Men National 2017 22.8 Normal Range TRUE
## 5 Afghanistan Men Rural 2017 22.5 Normal Range TRUE
## 6 Afghanistan Men Urban 2017 23.6 Normal Range TRUE
## # … with 2,394 more rows
28 / 39

summarize()

summarize calculates summary statistics. It collapses rows into summary statistics and removes columns irrelevant to the calculation. Be sure to name your columns!

BMI_long %>%
mutate(bmiclass = cut(BMI,
breaks = c(-Inf, 18.5, 25, 30, 35, 40, Inf),
labels = c(
"Underweight",
"Normal Range",
"Overweight",
"Obese I",
"Obese II",
"Obese III"
),
right = FALSE
)) %>%
mutate(normal_range = bmiclass == "Normal Range") %>%
summarize(fracnormal = mean(normal_range))
## # A tibble: 1 × 1
## fracnormal
## <dbl>
## 1 NA
29 / 39

Question: The code chunk returns an NA. What is going wrong?

BMI_long %>%
mutate(bmiclass = cut(BMI,
breaks = c(-Inf, 18.5, 25, 30, 35, 40, Inf),
labels = c(
"Underweight",
"Normal Range",
"Overweight",
"Obese I",
"Obese II",
"Obese III"
),
right = FALSE
)) %>%
count(bmiclass) %>%
group_by(bmiclass)
## # A tibble: 6 × 2
## # Groups: bmiclass [6]
## bmiclass n
## <fct> <int>
## 1 Underweight 7
## 2 Normal Range 1185
## 3 Overweight 1091
## 4 Obese I 96
## 5 Obese II 3
## 6 <NA> 18

Ahh, count() is a very useful function!

30 / 39
Ahh, there were 18 missing values of the variable bmi! First, let's check to be sure we didn't code that new variable incorrectly by seeing if the source variable, BMI, also has some missing values.
BMI_long %>%
count(BMI=="NA")
## # A tibble: 2 × 2
## `BMI == "NA"` n
## <lgl> <int>
## 1 FALSE 2382
## 2 NA 18

OK great, so the problem was with the source variable. We can just drop those few NA's when we take the mean, so that we get the mean of the observed values. In general, one would follow up on the NA values to understand why data were missing. What seems to be the issue here?

31 / 39

Let's print all 18 variables and see if we can tell what's going on.

BMI_long %>%
filter(is.na(BMI)==TRUE) %>%
print(n=18)
32 / 39
## # A tibble: 18 × 5
## Country Sex Region Year BMI
## <chr> <chr> <chr> <chr> <dbl>
## 1 Bermuda Men Rural 1985 NA
## 2 Bermuda Men Rural 2017 NA
## 3 Bermuda Women Rural 1985 NA
## 4 Bermuda Women Rural 2017 NA
## 5 China (Hong Kong SAR) Men Rural 2017 NA
## 6 China (Hong Kong SAR) Women Rural 2017 NA
## 7 Nauru Men Rural 1985 NA
## 8 Nauru Men Rural 2017 NA
## 9 Nauru Women Rural 1985 NA
## 10 Nauru Women Rural 2017 NA
## 11 Singapore Men Rural 1985 NA
## 12 Singapore Men Rural 2017 NA
## 13 Singapore Women Rural 1985 NA
## 14 Singapore Women Rural 2017 NA
## 15 Tokelau Men Urban 1985 NA
## 16 Tokelau Men Urban 2017 NA
## 17 Tokelau Women Urban 1985 NA
## 18 Tokelau Women Urban 2017 NA

This makes sense! Hong Kong, Singapore, Bermuda, and Nauru do not classify any of their population as rural; Tokelau does not classify any of its population as urban.

33 / 39

Finally, let's get the proportion of values with mean BMI in the normal range.

BMI_long %>%
mutate(bmiclass = cut(BMI,
breaks = c(-Inf, 18.5, 25, 30, 35, 40, Inf),
labels = c(
"Underweight",
"Normal Range",
"Overweight",
"Obese I",
"Obese II",
"Obese III"
),
right = FALSE
)) %>%
mutate(normal_range = bmiclass == "Normal Range") %>%
summarize(fracnormal = mean(normal_range,na.rm=TRUE))
## # A tibble: 1 × 1
## fracnormal
## <dbl>
## 1 0.497
34 / 39

group_by()

group_by() is powerful when paired with summarize to calculate summary statistics by group. Here we find the proportion of means that are in the normal range for each region.

BMI_long %>%
mutate(bmiclass = cut(BMI,
breaks = c(-Inf, 18.5, 25, 30, 35, 40, Inf),
labels = c(
"Underweight",
"Normal Range",
"Overweight",
"Obese I",
"Obese II",
"Obese III"
),
right = FALSE
)) %>%
mutate(normal_range = bmiclass == "Normal Range") %>%
group_by(Region) %>%
summarize(fracnormal = mean(normal_range,na.rm=TRUE))
## # A tibble: 3 × 2
## Region fracnormal
## <chr> <dbl>
## 1 National 0.506
## 2 Rural 0.541
## 3 Urban 0.446
35 / 39

We can calculate more than one summary statistic in summarize().

BMI_long %>%
mutate(bmiclass = cut(
BMI,
breaks = c(-Inf, 18.5, 25, 30, 35, 40, Inf),
labels = c(
"Underweight",
"Normal Range",
"Overweight",
"Obese I",
"Obese II",
"Obese III"
),
right = FALSE
)) %>%
mutate(normal_range = bmiclass == "Normal Range") %>%
group_by(Region) %>%
summarize(
fracnormal = mean(normal_range, na.rm = TRUE),
max_BMI = max(BMI, na.rm = TRUE))
## # A tibble: 3 × 3
## Region fracnormal max_BMI
## <chr> <dbl> <dbl>
## 1 National 0.52 35.3
## 2 Rural 0.559 35
## 3 Urban 0.462 35.4
36 / 39

Question: What is the code chunk below doing?

BMI_long %>%
count(Year) %>%
mutate(prop = n / sum(n))
## # A tibble: 2 × 3
## Year n prop
## <chr> <int> <dbl>
## 1 1985 1200 0.5
## 2 2017 1200 0.5
37 / 39

distinct()

One more nifty command is the distinct command in R, which eliminates duplicate rows.

age <- as_tibble(c(1,1,1,2,3))
age
## # A tibble: 5 × 1
## value
## <dbl>
## 1 1
## 2 1
## 3 1
## 4 2
## 5 3
distinct(age)
## # A tibble: 3 × 1
## value
## <dbl>
## 1 1
## 2 2
## 3 3
38 / 39

Main Ideas

  • Organizing our data according to a consistent set of "tidy" principles makes data easy to work with and leverages the ways R is effective.

  • Often we need to wrangle our data in order to extract meaning (including creating new variables, calculating summary statistics, subsetting data, etc).

  • Using only seven key verbs we can accomplish a wide variety of data wrangling tasks.

2 / 39
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow