Practice data wrangling skills
Explore mercury exposure levels among individuals living in the Peruvian Amazon
Artisinal and small scale gold mining has grown significantly over the past twenty years. It has the potential to provide strong socioeconomic benefits to individuals and communities. However, unlike large-scale mining, which is governed by regulatory controls and state standards, regulation of artisinal and small scale mining is often poorly-policed with a lack of enforcement, leading to human and labor rights conflicts, safety issues, and environmental degradation. One of the major associated environmental contaminants is mercury – the miners themselves are subject to mercury vapor inhalation, and contamination of fish in nearby waterways is a concern for the general population.
Professor Bill Pan in Duke’s Global Health Institute in collaboration with colleagues in the US and Peru designed and conducted a population-based mercury exposure assessment in 23 communities around the Amarakaeri Communal Reserve, which is bordered by an area of significant artisinal and small scale gold mining activity. In this lab we explore mercury levels in hair samples (lhairHg
, on the scale of log ppm), which are an indicator of chronic exposure over roughly one year.
Other variables in the dataset of interest for this assignment include hid
(household ID), withinid
(ID of the individual within the household), cid
(community ID), community
(community name), longitude
, and latitude
(these refer to the household location).
library(tidyverse)
library(sf)
library(viridis)
library(ggspatial) #for scale annotation
# the classic dark-on-light theme for ggplot2 is nice for maps
theme_set(theme_bw())
<- readr::read_csv("05/mercury_amazon.csv") mercury
By design, not all participants were asked to provide hair samples for mercury measurement. The variable lhairHg
contains log-transformed exposure variables, with units of log(ppm). Create a new dataset, named mercury2
, that excludes all participants with missing hair mercury measurements. In this dataset, create a new variable, hairHg
, that contains the original untransformed exposure variablels on the ppm scale (hint: the exp() function can be used to place lhairHg
back on its original scale of ppm). Include this code in the homework you submit on Gradescope. What fraction of the original participants are excluded due to lack of hair mercury measures?
Using the new dataset mercury2
, create two histograms of hair mercury levels: one of the log transformed data, and one on the original (ppm) scale. Be sure your histograms are professionally formatted and include informative labels. Does the log transformation succeed in its goal of making the exposure variable have a more symmetric distribution?
Provide a summary table (printed directly from R) of the number of observations from each community with measured hair mercury concentrations, arranged in descending order with the community providing the largest number of samples listed first, and the community providing the smallest number of samples listed last. This table should have number of rows equal to the number of communities.
Acceptable levels of mercury in the body have been hotly debated in the literature, given that much of the general population’s exposure comes through fish consumption, which has health benefits. The World Health Organization (WHO) in one report noted a level of 10 ppm in hair mercury of pregnant women to be a potential concern for fetal development. Create a new variable that takes value TRUE if hair mercury levels are \(\geq\) 10 ppm and FALSE if hair mercury levels are < 10 ppm. Include this code in the report you submit on Gradescope. What fraction of hair mercury levels are \(\geq\) 10 ppm?
Let’s look at the data by community. The code below reads a shape file for Peruvian waterways and plots the communities. Edit this code so that the community points are colored according to mean log mercury levels in hair samples, enlarging the points, making labels informative, and making other adaptations to improve the plot.
#shapefile courtesy of mapcruizin.com
<- st_read("05/waterways.shp")
peru_waterways
<- mercury2 %>% #you'll need to create these data in order to run this chunk
mercury3 group_by(community) %>%
summarize(
meanlat = mean(latitude),
meanlong = mean(longitude)
)
ggplot(data = peru_waterways) +
geom_sf() + # here we're plotting the waterways
coord_sf(xlim = c(-72, -69),
#restricting to the area of study
ylim = c(-14, -12),
#instead of all of Peru
expand = FALSE) + geom_point(
data = mercury3,
#add points for study communities
aes(x = meanlong, y = meanlat),
alpha = 0.7
+ annotation_scale() )
mercury2
data, restrict the data to contain observations only from the Masenawa community. Then, select the variables hid
(household ID), withinid
(indicates the person within the household), and lhairHg
. Print a table showing just these variables for all households in the community in the mercury2
data. Then, use R to print the table again, this time with one line per household ID and columns corresponding to the within household ID. Include this code in the assignment you submit to Gradescope. Explain why there are NA values in your second table.Total: 50 pts
Exercise 1: 7 pts
Exercise 2: 7 pts
Exercise 3: 7 pts
Exercise 4: 7 pts
Exercise 5: 7 pts
Exercise 6: 7 pts
Overall: 8 pts