Introduction to Modelling

# Introduction to Modelling
## <br><br> Introduction to Global Health Data Science
### <a href="https://sta198f2021.github.io/website/">Back to Website</a>
### <br> Prof. Amy Herring

---

layout: true
  
<div class="my-footer">
<span>
<a href="https://sta198f2021.github.io/website/" target="_blank">Back to website</a>
</span>
</div>

---

# What is a model?

---

## Modelling

- Use models to explain the relationship between variables and to make predictions
- For now we will focus on **linear** models (but remember there are *many* *many* other types of models too!)

.pull-left[
<img src="w8-l01-intro-model_files/figure-html/unnamed-chunk-2-1.png" width="100%" style="display: block; margin: auto;" />
]
.pull-right[
<img src="w8-l01-intro-model_files/figure-html/unnamed-chunk-3-1.png" width="100%" style="display: block; margin: auto;" />
]
---

# Data: Artisanal and Small-Scale Gold Mining in the Peruvian Amazon

---

## Mercury and Assets

Are wealthier individuals less likely to have high mercury exposures?

Today we will explore the relationship between levels of mercury in hair and an index measuring socio-economic position as a function of household assets. Note that asset-based indices are not at all a perfect proxy and may perform poorly, particularly in low income countries (the World Bank classifies Peru as an upper-middle income country).

Because the scale of this variable  is not easily interpreted, we will convert it to standard deviation units by subtracting the means and dividing by the standard deviation (like getting a z-score). Then a one-unit increase in the variable is a one standard deviation increase.

---
  
### Packages and Data Manipulation

```r
library(tidyverse)
library(tidymodels)
library(readr)
mercury <- readr::read_csv("mercury_reg.csv")
mercury <-
  mercury %>%
  # scale() subtracts the mean and divides by the SD to make the units "standard deviations" like a z-score
  mutate(assets_sc=scale(SESassets)) %>%
  #another variable we may use later
  mutate(form_min_sc=scale(FM_buffer)) %>%
  #so I don't have to remember coding
  mutate(sex,sex_cat=ifelse(sex==1,"Male","Female")) %>%
  mutate(native,native_cat=ifelse(native==1,"Native","Non-native")) 
```

---

Recall the distribution of hair mercury in ppm is skewed.

```r
ggplot(data = mercury, aes(x = exp(lhairHg))) +
  geom_histogram() +
  labs(x = "Mercury (ppm)", y = NULL)
```

---

```r
ggplot(data = mercury, aes(x = lhairHg)) +
  geom_histogram() +
  labs(x = "Mercury (log ppm)", y = NULL)
```

Because the models we will discuss to start prefer data that are approximately normally distributed, we will work with the log of hair mercury concentrations.

---

In our model, we will use assets to *predict* our *response variable*, the log of hair mercury concentrations. The regression model does not require predictors to follow any specific distributions, which is a good thing here!

```r
ggplot(data = mercury, aes(x = assets_sc)) +
  geom_histogram() +
  labs(x = "Assets (standardized)", y = NULL)
```

<img src="w8-l01-intro-model_files/figure-html/assetshist-1.png" width="40%" style="display: block; margin: auto;" />
---

## Models as functions

- We can represent relationships between variables using **functions**
- A function is a mathematical concept: the relationship between an output and one or more inputs
  - Plug in the inputs and receive back the output
  - Example: The formula `$y = 3x + 7$` is a function with input `$x$` and output `$y$`. If `$x$` is `$5$`, `$y$` is `$22$`, `$y = 3 \times 5 + 7 = 22$`

---

## Hair mercury as a function of assets

.panelset[
.panel[.panel-name[Plot]
<img src="w8-l01-intro-model_files/figure-html/unnamed-chunk-4-1.png" width="60%" style="display: block; margin: auto;" />
]
.panel[.panel-name[Code]

```r
ggplot(data = mercury, aes(x = assets_sc, y = lhairHg)) +
  geom_point(alpha=.1) +
  geom_smooth(method = "lm") +
  labs(
    title = "Hair mercury as a function of assets",
    subtitle = "Peruvian Amazon",
    x = "Household assets (standardized)",
    y = "Hair mercury (log ppm)"
  )
```
]
]

---

## With no error bars

.panelset[
.panel[.panel-name[Plot]
<img src="w8-l01-intro-model_files/figure-html/unnamed-chunk-5-1.png" width="60%" style="display: block; margin: auto;" />
]
.panel[.panel-name[Code]

```r
ggplot(data = mercury, aes(x = assets_sc, y = lhairHg)) +
  geom_point(alpha=.1) +
  geom_smooth(method = "lm", 
*             se = FALSE) +
  labs(
    title = "Hair mercury as a function of assets",
    subtitle = "Peruvian Amazon",
    x = "Household assets (standardized)",
    y = "Hair mercury (log ppm)"
  )
```
]
]
---

## With different cosmetic choices...
.panelset[
.panel[.panel-name[Plot]
<img src="w8-l01-intro-model_files/figure-html/unnamed-chunk-6-1.png" width="50%" style="display: block; margin: auto;" />
]
.panel[.panel-name[Code]

```r
ggplot(data = mercury, aes(x = assets_sc, y = lhairHg)) +
  geom_point(alpha=.1) +
  geom_smooth(method = "lm", se = FALSE,
*             color = "#8E2C90", linetype = "dashed", size = 3) +
  labs(
    title = "Hair mercury as a function of assets",
    subtitle = "Peruvian Amazon",
    x = "Household assets (standardized)",
    y = "Hair mercury (log ppm)"
  )
```
]
]

---

## Using a different smoother: GAM

.panelset[
.panel[.panel-name[Plot]
<img src="w8-l01-intro-model_files/figure-html/unnamed-chunk-7-1.png" width="60%" style="display: block; margin: auto;" />
]
.panel[.panel-name[Code]

```r
ggplot(data = mercury, aes(x = assets_sc, y = lhairHg)) +
  geom_point(alpha=.1) +
* geom_smooth(method = "gam",
              se = FALSE, color = "#8E2C90") +
  labs(
    title = "Hair mercury as a function of assets",
    subtitle = "Peruvian Amazon",
    x = "Household assets (standardized)",
    y = "Hair mercury (log ppm)"
  )
```
]
]

---

## Vocabulary

- **Response variable:** Variable whose behavior or variation you are trying to understand, on the y-axis

--
- **Explanatory variables:** Other variables that you want to use to explain the variation in the response, on the x-axis

--
- **Predicted value:** Output of the **model function**
  - The model function gives the typical (expected) value of the response variable *conditioning* on the explanatory variables
  
--
- **Residuals:** A measure of how far each case is from its predicted value (based on a particular model)
  - Residual = Observed value - Predicted value
  - Tells how far above/below the expected value each case is

---

## Residuals

.panelset[
.panel[.panel-name[Plot]
<img src="w8-l01-intro-model_files/figure-html/unnamed-chunk-8-1.png" width="60%" style="display: block; margin: auto;" />
]
.panel[.panel-name[Code]
.small[

```r
hg_asset_fit <- linear_reg() %>%
  set_engine("lm") %>%
  fit(lhairHg ~ assets_sc, data = mercury)

hg_asset_fit_tidy <- tidy(hg_asset_fit$fit) 
hg_asset_fit_aug  <- augment(hg_asset_fit$fit) %>%
  mutate(res_cat = ifelse(.resid > 0, TRUE, FALSE))

ggplot(data = hg_asset_fit_aug) +
  geom_point(aes(x = assets_sc, y = lhairHg, color = res_cat)) +
  geom_line(aes(x = assets_sc, y = .fitted), size = 0.75, color = "#8E2C90") + 
  labs(
    title = "Hair mercury by assets",
    subtitle = "Peruvian Amazon",
    x = "Household assets (standardized)",
    y = "Hair mercury (log ppm)"
  ) +
  guides(color = FALSE) +
  scale_color_manual(values = c("#260b27", "#e6b0e7")) +
  geom_text(aes(x = -2, y = 5), label = "Positive residual", color = "#e6b0e7", hjust = 0, size = 8) +
  geom_text(aes(x = 0, y = -5), label = "Negative residual", color = "#260b27", hjust = 0, size = 8)
```
]
]
]

---

## Multiple explanatory variables

.panelset[
.panel[.panel-name[Plot]
.pull-left-narrow[
.question[
How, if at all, does the relationship between hair mercury and household assets of individuals vary by whether or not they live in a town classified as native?
]
]
.pull-right-wide[
<img src="w8-l01-intro-model_files/figure-html/unnamed-chunk-9-1.png" width="80%" style="display: block; margin: auto;" />
]
]
.panel[.panel-name[Code]

```r
ggplot(data = mercury, aes(x = assets_sc, y = lhairHg, color = native_cat)) +
  geom_point(alpha = 0.1) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Hair mercury as a function of assets, by village type",
    subtitle = "Peruvian Amazon",
    x = "Household assets (standardized)",
    y = "Hair mercury (log ppm)",
    color = "NULL"
  ) +
  scale_color_manual(values = c("#E48957", "#071381"))
```
]
]

---

## Extending regression lines

The prior lines were nice in that they did not extrapolate beyond the range of observed data. Extrapolation beyond the range of data observed is quite risky and is generally best avoided, though sometimes the goal is to do so (e.g., weather forecasting, election prediction).

.panelset[
.panel[.panel-name[Plot]
<img src="w8-l01-intro-model_files/figure-html/unnamed-chunk-10-1.png" width="45%" style="display: block; margin: auto;" />
]
.panel[.panel-name[Code]

```r
ggplot(data = mercury, aes(x = assets_sc, y = lhairHg, color = native_cat)) +
  geom_point(alpha = 0.1) +
  geom_smooth(method = "lm",
              se = FALSE,
*             fullrange = TRUE) +
  labs(
    title = "Hair mercury as a function of assets, by village type",
    subtitle = "Peruvian Amazon",
    x = "Household assets (standardized)",
    y = "Hair mercury (log ppm)",
    color = "NULL"
  )  +
  scale_color_manual(values = c("#E48957", "#071381"))
```
]
]

---

## Models - upsides and downsides

- Models can sometimes reveal patterns that are not evident in a graph of the data. This is a great advantage of modeling over simple visual inspection of data. 
- There is a real risk, however, that a model is imposing structure that is not really there on the scatter of data, just as people imagine animal shapes in the stars. A skeptical approach is always warranted.

---

## Variation around the model...

is just as important as the model, if not more!

*Statistics is the explanation of variation in the context of what remains unexplained.*

- The scatter suggests that there might be other factors that account for large parts of variability in hair mercury levels, or perhaps just that randomness plays a big role.
- Adding more explanatory variables to a model can sometimes usefully reduce the size of the scatter around the model. (We'll talk more about this later.)

---

## How do we use models?

- Explanation: Characterize the relationship between `$y$` and `$x$` via *slopes* for numerical explanatory variables or *differences* for categorical explanatory variables
- Prediction: Plug in `$x$`, get the predicted `$y$`