Cross-Validation

# Cross-Validation
## <br><br> Introduction to Global Health Data Science
### <a href="https://sta198f2021.github.io/website/">Back to Website</a>
### <br> Prof. Amy Herring

---

layout: true
  
<div class="my-footer">
<span>
<a href="https://sta198f2021.github.io/website/" target="_blank">Back to website</a>
</span>
</div>

---

# Data and exploration

---

## Mercury data

```r
mercury <- readr::read_csv("mercury_reg.csv")
mercury <-
  mercury %>%
  # scale() subtracts the mean and divides by the SD to make the units "standard deviations" like a z-score
  mutate(assets_sc=scale(SESassets)) %>%
    #another variable we may use later
  mutate(form_min_sc=scale(FM_buffer)) %>%
  #so I don't have to remember coding
  mutate(sex,sex_cat=ifelse(sex==1,"Male","Female")) %>%
  mutate(native,native_cat=ifelse(native==1,"Native","Non-native"))

# limit to one observation per household (household ID=1)
mercury1 <-
  mercury %>%
  filter(withinid==1)
mercury1$an_int=mercury1$assets_sc*mercury1$native
```

---

# Modeling

---

## Train / test

- Create an initial split

```r
mercury_split <- initial_split(mercury1) # prop = 3/4 by default
```

--
.pull-left[
- Save training data

```r
mercury_train <- training(mercury_split)
dim(mercury_train)
```

```
#> [1] 844  23
```
]

--
.pull-right[
- Save testing data

```r
mercury_test  <- testing(mercury_split)
dim(mercury_test)
```

```
#> [1] 282  23
```
]

---
## Specify model

```r
mercury_fit <- linear_reg() %>%
  set_engine("lm") %>%
    fit(lhairHg ~ assets_sc * native_cat + form_min_sc + sex_cat + age + urban, data = mercury_train)
  
mp <- tidy(mercury_fit)
print(mp,n=10)
```

```
#> # A tibble: 8 × 5
#>   term                      estimate std.error statistic  p.value
#>   <chr>                        <dbl>     <dbl>     <dbl>    <dbl>
#> 1 (Intercept)                 0.509    0.177       2.87   4.23e-3
#> 2 assets_sc                  -0.331    0.112      -2.96   3.25e-3
#> 3 native_catNon-native       -0.789    0.199      -3.95   8.81e-5
#> 4 form_min_sc                 0.0227   0.0514      0.442  6.59e-1
#> 5 sex_catMale                -0.0855   0.0950     -0.901  3.68e-1
#> 6 age                         0.0120   0.00283     4.24   2.65e-5
#> 7 urban                      -0.0446   0.126      -0.355  7.22e-1
#> 8 assets_sc:native_catNon-…   0.253    0.129       1.97   4.93e-2
```

---

# Evaluate model

---

## Make predictions for training data

```r
mercury_train_pred <- predict(mercury_fit, mercury_train) %>%
  bind_cols(mercury_train %>% select(lhairHg))

mercury_train_pred
```

```
#> # A tibble: 844 × 2
#>     .pred lhairHg
#>     <dbl>   <dbl>
#> 1  0.331   -0.897
#> 2  0.183   -0.268
#> 3 -0.158   NA    
#> 4  0.0895  NA    
#> 5 -0.0276   0.510
#> 6 -0.0288  NA    
#> # … with 838 more rows
```

---

## R-squared

Percentage of variability in the hair mercury explained by the model

```r
rsq(mercury_train_pred, truth = lhairHg, estimate = .pred)
```

```
#> # A tibble: 1 × 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 rsq     standard       0.209
```

---

## RMSE

An alternative model performance statistic: **root mean square error**

$$ RMSE = \sqrt{\frac{\sum_{i = 1}^n (y_i - \hat{y}_i)^2}{n}} $$

```r
rmse(mercury_train_pred, truth = lhairHg, estimate = .pred)
```

```
#> # A tibble: 1 × 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 rmse    standard        1.00
```

---

## Interpreting RMSE

```r
rmse(mercury_train_pred, truth = lhairHg, estimate = .pred)
```

```
#> # A tibble: 1 × 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 rmse    standard        1.00
```

```r
mercury_train %>%
  summarise(min = min(lhairHg,na.rm=TRUE), max = max(lhairHg,na.rm=TRUE))
```

```
#> # A tibble: 1 × 2
#>     min   max
#>   <dbl> <dbl>
#> 1 -6.91  3.02
```

---
class: middle

---

## Make predictions for testing data

```r
mercury_test_pred <- predict(mercury_fit, mercury_test) %>%
  bind_cols(mercury_test %>% select(lhairHg))

mercury_test_pred
```

```
#> # A tibble: 282 × 2
#>     .pred lhairHg
#>     <dbl>   <dbl>
#> 1  0.140   NA    
#> 2 -0.111   NA    
#> 3  0.201    1.51 
#> 4 -0.0172   0.768
#> 5 NA       NA    
#> 6 NA       NA    
#> # … with 276 more rows
```

---

## Evaluate performance on testing data

- RMSE of model fit to testing data

```r
rmse(mercury_test_pred, truth = lhairHg, estimate = .pred)
```

```
#> # A tibble: 1 × 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 rmse    standard        1.01
```

- `$R^2$` of model fit to testing data

```r
rsq(mercury_test_pred, truth = lhairHg, estimate = .pred)
```

```
#> # A tibble: 1 × 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 rsq     standard       0.112
```

---

## Training vs. testing

<br>

|metric    | train|  test|comparison                    |
|:---------|-----:|-----:|:-----------------------------|
|RMSE      | 1.005| 1.012|RMSE lower for training       |
|R-squared | 0.209| 0.112|R-squared higher for training |

---

## Evaluating performance on training data

-  The training set does not have the capacity to be a good arbiter of performance.

- It is not an independent piece of information; predicting the training set can only reflect what the model already knows.

- Suppose you give a class a test, then give them the answers, then provide the same test. The student scores on the second test do not accurately reflect what they know about the subject; these scores would probably be higher than their results on the first test.

---

# Cross validation

---

## Cross validation

More specifically, **k-fold cross validation**:

- Shuffle your data into `$k$` partitions
- Use 1 partition for validation, and the remaining `$k-1$` partitions for training
- Repeat `$k$` times

- Note: our R function calls this **v-fold** cross-validation

---

## Cross validation

---

## Split data into folds

```r
set.seed(345)

folds <- vfold_cv(mercury_train, v = 5)
folds
```

```
#> #  5-fold cross-validation 
#> # A tibble: 5 × 2
#>   splits            id   
#>   <list>            <chr>
#> 1 <split [675/169]> Fold1
#> 2 <split [675/169]> Fold2
#> 3 <split [675/169]> Fold3
#> 4 <split [675/169]> Fold4
#> 5 <split [676/168]> Fold5
```
]
.pull-right[
<img src="img/cross-validation.png" width="100%" style="display: block; margin: auto 0 auto auto;" />
]

---

## Fit resamples

```r
mercury_mod <- linear_reg() %>%
  set_engine("lm")

mercury_rec <- recipe(lhairHg ~ assets_sc + native_cat + form_min_sc + sex_cat + age + urban + an_int, data = mercury_train)

mercury_wflow <- workflow() %>%
  add_model(mercury_mod) %>%
  add_recipe(mercury_rec)

mercury_fit_rs <- mercury_wflow %>%
  fit_resamples(folds)

mercury_fit_rs
```

```
#> # Resampling results
#> # 5-fold cross-validation 
#> # A tibble: 5 × 4
#>   splits            id    .metrics         .notes          
#>   <list>            <chr> <list>           <list>          
#> 1 <split [675/169]> Fold1 <tibble [2 × 4]> <tibble [0 × 1]>
#> 2 <split [675/169]> Fold2 <tibble [2 × 4]> <tibble [0 × 1]>
#> 3 <split [675/169]> Fold3 <tibble [2 × 4]> <tibble [0 × 1]>
#> 4 <split [675/169]> Fold4 <tibble [2 × 4]> <tibble [0 × 1]>
#> 5 <split [676/168]> Fold5 <tibble [2 × 4]> <tibble [0 × 1]>
```
]
.pull-right[
<img src="img/cross-validation-animated.gif" width="100%" style="display: block; margin: auto 0 auto auto;" />
]

---

## Collect CV metrics

```r
collect_metrics(mercury_fit_rs)
```

```
#> # A tibble: 2 × 6
#>   .metric .estimator  mean     n std_err .config             
#>   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 rmse    standard   1.01      5  0.0670 Preprocessor1_Model1
#> 2 rsq     standard   0.196     5  0.0374 Preprocessor1_Model1
```

---

## Deeper look into CV metrics

```r
collect_metrics(mercury_fit_rs, summarize = FALSE) %>%
  print(n = 10)
```

```
#> # A tibble: 10 × 5
#>    id    .metric .estimator .estimate .config             
#>    <chr> <chr>   <chr>          <dbl> <chr>               
#>  1 Fold1 rmse    standard       1.00  Preprocessor1_Model1
#>  2 Fold1 rsq     standard       0.198 Preprocessor1_Model1
#>  3 Fold2 rmse    standard       1.05  Preprocessor1_Model1
#>  4 Fold2 rsq     standard       0.117 Preprocessor1_Model1
#>  5 Fold3 rmse    standard       1.24  Preprocessor1_Model1
#>  6 Fold3 rsq     standard       0.200 Preprocessor1_Model1
#>  7 Fold4 rmse    standard       0.831 Preprocessor1_Model1
#>  8 Fold4 rsq     standard       0.330 Preprocessor1_Model1
#>  9 Fold5 rmse    standard       0.943 Preprocessor1_Model1
#> 10 Fold5 rsq     standard       0.136 Preprocessor1_Model1
```
]
.panel[.panel-name[Tidy]

|Fold  |  RMSE| R-squared|
|:-----|-----:|---------:|
|Fold1 | 1.000|     0.198|
|Fold2 | 1.046|     0.117|
|Fold3 | 1.238|     0.200|
|Fold4 | 0.831|     0.330|
|Fold5 | 0.943|     0.136|
]
]

---

## How does RMSE compare to y?

- Cross validation RMSE stats

```
#> # A tibble: 1 × 6
#>     min   max  mean   med    sd   IQR
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.831  1.24  1.01  1.00 0.150 0.103
```
- Training data mercury stats

```
#> # A tibble: 1 × 6
#>     min   max  mean   med    sd   IQR
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 -6.91  3.02 0.202 0.166  1.13  1.55
```

---

## Compare to simpler model

Let's check how accurate predictions from this model are: RMSE=1.01, `$R^2=0.196$`, compared to predictions from a simpler model with age, assets, and native community status.

```r
mercury_rec_2 <- recipe(lhairHg ~ assets_sc + native_cat + age, data = mercury_train)

mercury_wflow_2 <- workflow() %>%
  add_model(mercury_mod) %>%
  add_recipe(mercury_rec_2)

mercury_fit_rs_2 <- mercury_wflow_2 %>%
  fit_resamples(folds)

collect_metrics(mercury_fit_rs_2)
```

```
#> # A tibble: 2 × 6
#>   .metric .estimator  mean     n std_err .config             
#>   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 rmse    standard   1.00      5  0.0645 Preprocessor1_Model1
#> 2 rsq     standard   0.205     5  0.0363 Preprocessor1_Model1
```
---

## What's next?

---

## Picking a Model

When choosing between these two models, we see essentially the same `$R^2$` and RMSE, but one model is much simpler than the other. It makes sense to use the simpler model to facilitate interpretation. Let's practice interpreting results from the model, fit to the full data, now!

```r
final_fit <- linear_reg() %>%
  set_engine("lm") %>%
    fit(lhairHg ~ assets_sc + native_cat + age, data = mercury1)

final_fit

tidy(final_fit, conf.int=TRUE)
```

---

```
#> parsnip model object
#> 
#> Fit time:  1ms 
#> 
#> Call:
#> stats::lm(formula = lhairHg ~ assets_sc + native_cat + age, data = data)
#> 
#> Coefficients:
#>          (Intercept)             assets_sc  
#>             0.718469             -0.108647  
#> native_catNon-native                   age  
#>            -0.941823              0.009656
```

```
#> # A tibble: 4 × 7
#>   term   estimate std.error statistic  p.value conf.low conf.high
#>   <chr>     <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
#> 1 (Inte…  0.718     0.120        5.98 3.65e- 9  0.482      0.954 
#> 2 asset… -0.109     0.0468      -2.32 2.06e- 2 -0.201     -0.0167
#> 3 nativ… -0.942     0.114       -8.30 5.64e-16 -1.16      -0.719 
#> 4 age     0.00966   0.00240      4.03 6.19e- 5  0.00495    0.0144
```