Residual Analysis and Transformations

# Residual Analysis and Transformations
## <br><br> Introduction to Global Health Data Science
### <a href="https://sta198f2021.github.io/website/">Back to Website</a>
### <br> Prof. Amy Herring

---

layout: true
  
<div class="my-footer">
<span>
<a href="https://sta198f2021.github.io/website/" target="_blank">Back to website</a>
</span>
</div>

---

## Read in Data

```r
library(tidyverse)
library(tidymodels)
library(readr)
library(ggtext)
library(knitr)
library(kableExtra)
mercury <- readr::read_csv("mercury_reg.csv")
mercury <-
  mercury %>%
  # scale() subtracts the mean and divides by the SD to make the units "standard deviations" like a z-score
  mutate(assets_sc=scale(SESassets)) %>%
  mutate(hairHg=exp(lhairHg))
```

---

# Model checking

---

## "Linear" models

- We're fitting a "linear" model, which assumes a linear relationship between our explanatory and response variables.
- But how do we assess this?

---

## Graphical diagnostic: residuals plot (ppm units)

.panelset[
.panel[.panel-name[Plot]
<img src="w9-l01-resid-xform_files/figure-html/unnamed-chunk-2-1.png" width="60%" style="display: block; margin: auto;" />
]
.panel[.panel-name[Code]

```r
hg_asset_fit <- linear_reg() %>%
  set_engine("lm") %>%
  fit(hairHg ~ assets_sc, data = mercury)

*hg_asset_fit_aug <- augment(hg_asset_fit$fit)

ggplot(hg_asset_fit_aug, mapping = aes(x = .fitted, y = .resid)) +
  geom_point(alpha = 0.5) +
  geom_hline(yintercept = 0, color = "gray", lty = "dashed") +
  labs(x = "Predicted mercury (ppm)", y = "Residuals")
```
]

```r
hg_asset_fit_aug
```

```
#> # A tibble: 2,300 × 9
#>   .rownames hairHg assets_sc[,1] .fitted .resid     .hat .sigma
#>   <chr>      <dbl>         <dbl>   <dbl>  <dbl>    <dbl>  <dbl>
#> 1 1          1.97         -0.837    3.05 -1.08  0.000750   2.77
#> 2 3          1.10          0.197    2.12 -1.03  0.000452   2.77
#> 3 11         5.34         -0.280    2.55  2.79  0.000471   2.77
#> 4 13         1.57         -0.280    2.55 -0.981 0.000471   2.77
#> 5 14         2.02         -0.280    2.55 -0.528 0.000471   2.77
#> 6 15         0.599        -0.280    2.55 -1.95  0.000471   2.77
#> # … with 2,294 more rows, and 2 more variables: .cooksd <dbl>,
#> #   .std.resid <dbl>
```
]
]

---

## More on `augment()`

```r
glimpse(hg_asset_fit_aug)
```

```
#> Rows: 2,300
#> Columns: 9
#> $ .rownames  <chr> "1", "3", "11", "13", "14", "15", "16", "17"…
#> $ hairHg     <dbl> 1.9652, 1.0951, 5.3366, 1.5683, 2.0218, 0.59…
#> $ assets_sc  <dbl[,1]> <matrix[21 x 1]>
#> $ .fitted    <dbl> 3.045165, 2.124332, 2.549503, 2.549503, …
#> $ .resid     <dbl> -1.0799648, -1.0292319, 2.7870974, -0.981202…
#> $ .hat       <dbl> 0.0007495801, 0.0004516563, 0.0004705556, 0.…
#> $ .sigma     <dbl> 2.769771, 2.769779, 2.769252, 2.769787, 2.76…
#> $ .cooksd    <dbl> 5.708620e-05, 3.122263e-05, 2.385430e-04, 2.…
#> $ .std.resid <dbl> -0.3901294, -0.3717471, 1.0066782, -0.354402…
```

---

## Looking for...

- Residuals distributed randomly around 0
- With no visible pattern along the x or y axes

---

## Not looking for...

---

## Not looking for...

---

## Not looking for...

---

## Not looking for...

---

.question[
What patterns does the residuals plot reveal that should make us question whether a linear model is a good fit for modeling the relationship between mercury (ppm) and assets?
]

---

# Exploring linearity

---

## Data: Mercury

---

## Mercury vs. assets

---

## Mercury vs assets

.small[
  
.pull-left[
<img src="w9-l01-resid-xform_files/figure-html/unnamed-chunk-13-1.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right[
<img src="w9-l01-resid-xform_files/figure-html/unnamed-chunk-14-1.png" width="100%" style="display: block; margin: auto;" />
]

]

---

## Mercury and Assets, residuals

.question[
Which plot shows a residuals that are uncorrelated with predicted values from the model? Also, what is the unit of the residuals?
]
  
.pull-left[
<img src="w9-l01-resid-xform_files/figure-html/unnamed-chunk-15-1.png" width="100%" style="display: block; margin: auto;" />
]
.pull-right[
<img src="w9-l01-resid-xform_files/figure-html/unnamed-chunk-16-1.png" width="100%" style="display: block; margin: auto;" />
]

---

## Transforming the data

- We saw that `hairHg` has a right-skewed distribution, and the residuals of that model don't look great.

--
- In these situations a transformation applied to the response variable may be useful.

--
- In order to decide which transformation to use, we should examine the distribution of the response variable.

--
- The extremely right skewed distribution suggests that a log transformation may 
be useful.
    - log = natural log, `$ln$`
    - Default base of the `log` function in R is the natural log: <br>
    `log(x, base = exp(1))`
    
---

## Transformations

- Non-constant variance is one of the most common model violations, however it is usually fixable by transforming the response (y) variable.

--
- The most common transformation when the response variable is right skewed is the log transform: `$log(y)$`, especially useful when the response variable is 
(extremely) right skewed.

--
- This transformation is also useful for variance stabilization.

--
- When using a log transformation on the response variable the interpretation of 
the slope changes: *"For each unit increase in x, y is expected on average to be higher/lower <br> by a factor of `$e^{b_1}$`."*

--
- Another useful transformation is the square root: `$\sqrt{y}$`, especially 
useful when the response variable is a count.

---

## Transform, or learn more?

- Data transformations may also be useful when the relationship is non-linear
- However in those cases a polynomial regression may be more appropriate
  + This is beyond the scope of this course, but you’re welcomed to try it for your final project, and I’d be happy to provide further guidance

---

## Aside: when `$y = 0$`

In some cases the value of the response variable might be 0, and

```r
log(0)
```

```
#> [1] -Inf
```

The trick is to add a very small number to the value of the response variable for these cases so that the `log` function can still be applied:

```r
log(0 + 0.00001)
```

```
#> [1] -11.51293
```