Model Fitting and Interpretation

# Model Fitting and Interpretation
## <br><br> Introduction to Global Health Data Science
### <a href="https://sta198f2021.github.io/website/">Back to Website</a>
### <br> Prof. Amy Herring

---

layout: true
  
<div class="my-footer">
<span>
<a href="https://sta198f2021.github.io/website/" target="_blank">Back to website</a>
</span>
</div>

---

# Models with numeric explanatory variables

---

# Data: Artisanal and Small-Scale Gold Mining in the Peruvian Amazon

---

## Mercury and Assets

Are mercury concentrations in hair related to household socioeconomic position, as measured by assets?

```r
library(tidyverse)
library(tidymodels)
library(readr)
mercury <- readr::read_csv("mercury_reg.csv")
mercury <-
  mercury %>%
  # scale() subtracts the mean and divides by the SD to make the units "standard deviations" like a z-score
  mutate(assets_sc=scale(SESassets)) %>%
  #another variable we may use later
  mutate(form_min_sc=scale(FM_buffer)) %>%
  #so I don't have to remember coding
  mutate(sex,sex_cat=ifelse(sex==1,"Male","Female")) %>%
  mutate(native,native_cat=ifelse(native==1,"Native","Non-native")) 
```

---
## Goal: Predict mercury from assets

`$$\widehat{y}_{i} = \widehat{\beta}_0 + \widehat{\beta}_1 \times x_{i}$$`
Here `$y_i$` represents log hair mercury for subject `$i$`, and `$x_i$` represents standardized household income.
<img src="w8-l02-fitting-interpretation_files/figure-html/hg_asset_plot-1.png" width="60%" style="display: block; margin: auto;" />

---

---

## Step 1: Specify model

```r
linear_reg()
```

```
#> Linear Regression Model Specification (regression)
#> 
#> Computational engine: lm
```

---

## Step 2: Set model fitting *engine*

```r
linear_reg() %>%
  set_engine("lm") # lm: linear model
```

```
#> Linear Regression Model Specification (regression)
#> 
#> Computational engine: lm
```

---

## Step 3: Fit model & estimate parameters

... using **formula syntax**

```r
linear_reg() %>%
  set_engine("lm") %>%
  fit(lhairHg ~ assets_sc, data = mercury)
```

```
#> parsnip model object
#> 
#> Fit time:  1ms 
#> 
#> Call:
#> stats::lm(formula = lhairHg ~ assets_sc, data = data)
#> 
#> Coefficients:
#> (Intercept)    assets_sc  
#>      0.2795      -0.3577
```

---

## A closer look at model output

---

## A tidy look at model output

```r
linear_reg() %>%
  set_engine("lm") %>%
  fit(lhairHg ~ assets_sc, data = mercury) %>%
  tidy()
```

```
#> # A tibble: 2 × 5
#>   term        estimate std.error statistic  p.value
#>   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
#> 1 (Intercept)    0.280    0.0213      13.1 7.49e-38
#> 2 assets_sc     -0.358    0.0216     -16.5 3.90e-58
```

---

## Slope and intercept

- **Slope:** For each standard deviation increase in assets, the log hair mercury level is expected to be lower, on average, by 0.3577 log ppm.

- perhaps not a very useful statement given the scale

- **Intercept:** Individuals with household assets at the mean level (`assets_sc`=0) are expected to have hair mercury concentrations of 0.2795 log ppm, on average

- Remember `assets_sc` was standardized, so `assets_sc`=0 does not mean "no assets" but instead means "average assets" as it corresponds to an assets z-score of 0

---
class: middle

# Models with log transformation

---

## Working with logs

- Subtraction and logs: `$log(a) − log(b) = log(a / b)$`

--
- Natural logarithm: `$e^{log(x)} = x$`

--
- We can use these identities to "undo" the log transformation

---

## Interpreting the slope

The slope coefficient for the log transformed model is -0.3577, meaning the log mercury difference between people whose household incomes are one SD apart is predicted to be -0.3577 log ppm.

.question[
Using this information, and properties of logs that we just reviewed, fill in the blanks in the following alternate interpretation of the slope:

>For each additional SD the household assets are greater, the hair mercury concentration is expected to be `___` , on average, by a factor of `___`.
]

---

>For each additional increase in scaled assets, hair mercury content is expected to be `___` , on average, by a factor of `___`.

$$ log(\text{hair Hg for assets x+1}) - log(\text{hair Hg for assets x}) = -0.3577 $$

$$ log\left(\frac{\text{hair Hg for assets x+1}}{\text{hair Hg for assets x}}\right) = -0.3577 $$

$$ e^{log\left(\frac{\text{Hg for assets x+1}}{\text{Hg for assets x}}\right)} = e^{-0.3577} $$

$$ \frac{\text{Hg for assets x+1}}{\text{Hg for assets x}} \approx 0.70 $$

For each SD increase in assets, the hair Hg is expected to be lower, on average, by a factor of 0.70.

---

## Correlation does not imply causation

Remember this when interpreting model coefficients

---

# Parameter estimation

---

## Linear model with a single predictor

- We're interested in `$\beta_0$` (population parameter for the intercept) and `$\beta_1$` (population parameter for the slope) in the following model:

`$$y_{i} = \beta_0 + \beta_1~x_{i}+\varepsilon_i$$`
where `$\varepsilon$` represents random error around our mean

--
- Tough luck, you can't have them...

--
- So we use sample statistics to estimate them, using the notation  `$b$` or `$\widehat{\beta}$` to distinguish our estimates from the true parameters `$\beta$`

`$$\widehat{y}_{i} = b_0 + b_1~x_{i}$$` or `$$\widehat{y}_i = \widehat{\beta}_0 + \widehat{\beta}_1~x_i$$`

---

## Least squares regression

- The regression line minimizes the sum of squared residuals.

--
- If `$e_i = y_i - \hat{y}_i$`, then, the regression line minimizes 
`$\sum_{i = 1}^n e_i^2$`.

---

## Visualizing residuals

---

## Visualizing residuals (cont.)

---

## Visualizing residuals (cont.)

---

## Properties of least squares regression

- The regression line goes through the center of mass point, the coordinates corresponding to average `$x$` and average `$y$`, `$(\bar{x}, \bar{y})$`:

`$$\bar{y} = b_0 + b_1 \bar{x} ~ \rightarrow ~ b_0 = \bar{y} - b_1 \bar{x}$$`

--
- The slope has the same sign as the correlation coefficient: `$b_1 = r \frac{s_y}{s_x}$`

--
- The sum of the residuals is zero: `$\sum_{i = 1}^n e_i = 0$`

--
- The residuals and `$x$` values are uncorrelated

---

## Aside: Correlation Coefficient

The *correlation coefficient* describes the linear relationship between `$x$` and `$y$` on the standard deviation scale.

```r
cor(mercury$lhairHg,mercury$assets_sc,use="complete.obs")
```

```
#>           [,1]
#> [1,] -0.326132
```

Here, for each one SD increase in assets, we expect a -0.33 SD increase (so thus a 0.33 decrease) in standard deviations of log hair mercury.

---

# Models with categorical explanatory variables

---

## Categorical predictor with 2 levels

```
#> # A tibble: 2,308 × 3
#>    lhairHg assets_sc[,1] native_cat
#>      <dbl>         <dbl> <chr>     
#>  1  0.676         -0.837 Non-native
#>  2  0.0908         0.197 Non-native
#>  3  1.67          -0.280 Non-native
#>  4  0.450         -0.280 Non-native
#>  5  0.704         -0.280 Non-native
#>  6 -0.512         -0.280 Non-native
#>  7 -0.125          0.826 Non-native
#>  8 -0.103          0.826 Non-native
#>  9  0.351         -1.27  Non-native
#> 10  0.0444        -1.27  Non-native
#> 11  0.662         -0.359 Non-native
#> 12  1.01           1.29  Non-native
#> 13 -0.149          1.29  Non-native
#> 14  0.881          0.517 Non-native
#> 15  1.51           0.517 Non-native
#> 16  0.328         -0.658 Non-native
#> 17  0.0553         0.533 Non-native
#> 18  0.768          1.29  Non-native
#> 19  0.431          1.30  Non-native
#> 20  1.55           1.30  Non-native
#> # … with 2,288 more rows
```
]
]
.pull-right-wide[
- `Native`: community classified as native by Peruvian government
- `Non-native`: community does not have native classification
]

---

## Mercury & native community relationship

```r
linear_reg() %>%
  set_engine("lm") %>%
  fit(lhairHg ~ native_cat, data = mercury) %>%
  tidy()
```

```
#> # A tibble: 2 × 5
#>   term                 estimate std.error statistic   p.value
#>   <chr>                   <dbl>     <dbl>     <dbl>     <dbl>
#> 1 (Intercept)              1.12    0.0416      26.9 4.97e-139
#> 2 native_catNon-native    -1.11    0.0476     -23.2 2.94e-107
```

---

## Mercury and native communities

Define `$\text{Non-native}_i$`=1 if non-native community and 0 otherwise.

`$$\widehat{y_i} = 1.12 - 1.11~\text{Non-native}_i$$`

- **Slope:** Residents of non-native communities, on average, are expected to have Hg concentrations that are -1.11 log ppm lower, on average, than residents of native communities.  Alternatively, the hair Hg is expected to be lower, on average, by a factor of `$e^{-1.11}=0.33$` among residents of non-native communities.
  - Compares baseline level (`Native`) to the other level (`Non-native`)
    
- **Intercept:** Individuals living in native communities (`Non-native`=0) are expected to have hair mercury exposures of 1.12 log ppm.

---

## Relationship between mercury and individual community

```r
linear_reg() %>%
  set_engine("lm") %>%
  fit(lhairHg ~ community, data = mercury) %>%
  tidy()
```

```
#> # A tibble: 23 × 5
#>    term                        estimate std.error statistic  p.value
#>    <chr>                          <dbl>     <dbl>     <dbl>    <dbl>
#>  1 (Intercept)                    0.614    0.0644      9.54 3.55e-21
#>  2 communityBoca Isiriwe          1.20     0.159       7.56 5.77e-14
#>  3 communityBoca Manu             0.833    0.129       6.43 1.56e-10
#>  4 communityCaychihue            -1.10     0.132      -8.36 1.09e-16
#>  5 communityChoque               -0.778    0.117      -6.68 3.04e-11
#>  6 communityDiamante              1.06     0.109       9.67 1.00e-21
#>  7 communityHuepetuhe            -0.694    0.0732     -9.48 5.96e-21
#>  8 communityIsla de los Valles    1.12     0.245       4.57 5.18e- 6
#>  9 communityMasenawa              1.20     0.401       3.00 2.74e- 3
#> 10 communityPalotoa Teparo        0.350    0.148       2.36 1.82e- 2
#> # … with 13 more rows
```

---

## Dummy or indicator variables

- When the categorical explanatory variable has many levels, they're encoded to **dummy** or **indicator variables**
- Each coefficient describes the expected difference between log mercury concentrations in that particular community compared to the baseline community

---

## Categorical predictor with 3+ levels

<table class="table" style="margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> community </th>
   <th style="text-align:center;"> BI </th>
   <th style="text-align:center;"> BM </th>
   <th style="text-align:center;"> Cay </th>
   <th style="text-align:center;"> Cho </th>
   <th style="text-align:center;"> Dia </th>
   <th style="text-align:center;"> Hue </th>
   <th style="text-align:center;"> Isl </th>
   <th style="text-align:left;"> Mas </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Boca Colorado </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:left;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Boca Isiriwe </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(122, 209, 81, 1) !important;"> 1 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:left;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Boca Manu </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(122, 209, 81, 1) !important;"> 1 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:left;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Caychihue </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(122, 209, 81, 1) !important;"> 1 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:left;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Choque </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(122, 209, 81, 1) !important;"> 1 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:left;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Diamante </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(122, 209, 81, 1) !important;"> 1 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:left;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Huepetuhe </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(122, 209, 81, 1) !important;"> 1 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:left;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Isla de los Valles </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(122, 209, 81, 1) !important;"> 1 </td>
   <td style="text-align:left;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Masenawa </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td>
   <td style="text-align:left;width: 10em; color: white !important;background-color: rgba(122, 209, 81, 1) !important;"> 1 </td>
  </tr>
</tbody>
</table>

etc. for the other communities. Note Boca Colorado, the *referent community*, =0 for all vars

---

## Relationship between mercury and community

- **Boca Colorado** hair mercury levels are expected, on average, to be **0.614 log ppm**.
- **Boca Isiriwe** hair mercury levels are expected, on average, to be **1.20 log ppm higher** than *Boca Colorado* levels, at **1.814 log ppm** on average.
- **Boca Manu** hair mercury levels are expected, on average, to be **0.833 log ppm higher** than *Boca Colorado* levels, at **1.447 log ppm** on average.
- **Choque** hair mercury levels are expected, on average, to be **0.778 log ppm lower** than *Boca Colorado* levels, at **-0.164 log ppm** on average.
- etc. It can be nice to have a table showing the estimated mean (along with uncertainty measures -- coming soon) for each community, and a plot can also be useful.
]