class: center, middle, inverse, title-slide # Cross-Validation ##
Introduction to Global Health Data Science ###
Back to Website
###
Prof. Amy Herring --- layout: true <div class="my-footer"> <span> <a href="https://sta198f2021.github.io/website/" target="_blank">Back to website</a> </span> </div> --- class: middle # Data and exploration --- ## Mercury data ```r mercury <- readr::read_csv("mercury_reg.csv") mercury <- mercury %>% # scale() subtracts the mean and divides by the SD to make the units "standard deviations" like a z-score mutate(assets_sc=scale(SESassets)) %>% #another variable we may use later mutate(form_min_sc=scale(FM_buffer)) %>% #so I don't have to remember coding mutate(sex,sex_cat=ifelse(sex==1,"Male","Female")) %>% mutate(native,native_cat=ifelse(native==1,"Native","Non-native")) # limit to one observation per household (household ID=1) mercury1 <- mercury %>% filter(withinid==1) mercury1$an_int=mercury1$assets_sc*mercury1$native ``` --- # Modeling --- ## Train / test - Create an initial split ```r mercury_split <- initial_split(mercury1) # prop = 3/4 by default ``` -- .pull-left[ - Save training data ```r mercury_train <- training(mercury_split) dim(mercury_train) ``` ``` #> [1] 844 23 ``` ] -- .pull-right[ - Save testing data ```r mercury_test <- testing(mercury_split) dim(mercury_test) ``` ``` #> [1] 282 23 ``` ] --- ## Specify model ```r mercury_fit <- linear_reg() %>% set_engine("lm") %>% fit(lhairHg ~ assets_sc * native_cat + form_min_sc + sex_cat + age + urban, data = mercury_train) mp <- tidy(mercury_fit) print(mp,n=10) ``` ``` #> # A tibble: 8 × 5 #> term estimate std.error statistic p.value #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 (Intercept) 0.509 0.177 2.87 4.23e-3 #> 2 assets_sc -0.331 0.112 -2.96 3.25e-3 #> 3 native_catNon-native -0.789 0.199 -3.95 8.81e-5 #> 4 form_min_sc 0.0227 0.0514 0.442 6.59e-1 #> 5 sex_catMale -0.0855 0.0950 -0.901 3.68e-1 #> 6 age 0.0120 0.00283 4.24 2.65e-5 #> 7 urban -0.0446 0.126 -0.355 7.22e-1 #> 8 assets_sc:native_catNon-… 0.253 0.129 1.97 4.93e-2 ``` --- class: middle # Evaluate model --- ## Make predictions for training data ```r mercury_train_pred <- predict(mercury_fit, mercury_train) %>% bind_cols(mercury_train %>% select(lhairHg)) mercury_train_pred ``` ``` #> # A tibble: 844 × 2 #> .pred lhairHg #> <dbl> <dbl> #> 1 0.331 -0.897 #> 2 0.183 -0.268 #> 3 -0.158 NA #> 4 0.0895 NA #> 5 -0.0276 0.510 #> 6 -0.0288 NA #> # … with 838 more rows ``` --- ## R-squared Percentage of variability in the hair mercury explained by the model ```r rsq(mercury_train_pred, truth = lhairHg, estimate = .pred) ``` ``` #> # A tibble: 1 × 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 rsq standard 0.209 ``` -- .question[ Are models with high or low `\(R^2\)` preferable? ] --- ## RMSE An alternative model performance statistic: **root mean square error** $$ RMSE = \sqrt{\frac{\sum_{i = 1}^n (y_i - \hat{y}_i)^2}{n}} $$ -- ```r rmse(mercury_train_pred, truth = lhairHg, estimate = .pred) ``` ``` #> # A tibble: 1 × 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 rmse standard 1.00 ``` -- .question[ Are models with high or low RMSE are more preferable? ] --- ## Interpreting RMSE .question[ Is this RMSE considered low or high? ] ```r rmse(mercury_train_pred, truth = lhairHg, estimate = .pred) ``` ``` #> # A tibble: 1 × 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 rmse standard 1.00 ``` ```r mercury_train %>% summarise(min = min(lhairHg,na.rm=TRUE), max = max(lhairHg,na.rm=TRUE)) ``` ``` #> # A tibble: 1 × 2 #> min max #> <dbl> <dbl> #> 1 -6.91 3.02 ``` --- class: middle .hand[ .light-blue[ but, really, who cares about predictions on .pink[training] data? ] ] --- ## Make predictions for testing data ```r mercury_test_pred <- predict(mercury_fit, mercury_test) %>% bind_cols(mercury_test %>% select(lhairHg)) mercury_test_pred ``` ``` #> # A tibble: 282 × 2 #> .pred lhairHg #> <dbl> <dbl> #> 1 0.140 NA #> 2 -0.111 NA #> 3 0.201 1.51 #> 4 -0.0172 0.768 #> 5 NA NA #> 6 NA NA #> # … with 276 more rows ``` --- ## Evaluate performance on testing data - RMSE of model fit to testing data ```r rmse(mercury_test_pred, truth = lhairHg, estimate = .pred) ``` ``` #> # A tibble: 1 × 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 rmse standard 1.01 ``` - `\(R^2\)` of model fit to testing data ```r rsq(mercury_test_pred, truth = lhairHg, estimate = .pred) ``` ``` #> # A tibble: 1 × 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 rsq standard 0.112 ``` --- ## Training vs. testing <br> |metric | train| test|comparison | |:---------|-----:|-----:|:-----------------------------| |RMSE | 1.005| 1.012|RMSE lower for training | |R-squared | 0.209| 0.112|R-squared higher for training | --- ## Evaluating performance on training data - The training set does not have the capacity to be a good arbiter of performance. -- - It is not an independent piece of information; predicting the training set can only reflect what the model already knows. -- - Suppose you give a class a test, then give them the answers, then provide the same test. The student scores on the second test do not accurately reflect what they know about the subject; these scores would probably be higher than their results on the first test. .footnote[ .small[ Source: [tidymodels.org](https://www.tidymodels.org/start/resampling/) ] ] --- class: middle # Cross validation --- ## Cross validation More specifically, **k-fold cross validation**: - Shuffle your data into `\(k\)` partitions - Use 1 partition for validation, and the remaining `\(k-1\)` partitions for training - Repeat `\(k\)` times - Note: our R function calls this **v-fold** cross-validation --- ## Cross validation <img src="img/cross-validation.png" width="100%" style="display: block; margin: auto;" /> --- ## Split data into folds .pull-left[ ```r set.seed(345) folds <- vfold_cv(mercury_train, v = 5) folds ``` ``` #> # 5-fold cross-validation #> # A tibble: 5 × 2 #> splits id #> <list> <chr> #> 1 <split [675/169]> Fold1 #> 2 <split [675/169]> Fold2 #> 3 <split [675/169]> Fold3 #> 4 <split [675/169]> Fold4 #> 5 <split [676/168]> Fold5 ``` ] .pull-right[ <img src="img/cross-validation.png" width="100%" style="display: block; margin: auto 0 auto auto;" /> ] --- ## Fit resamples .pull-left[ ```r mercury_mod <- linear_reg() %>% set_engine("lm") mercury_rec <- recipe(lhairHg ~ assets_sc + native_cat + form_min_sc + sex_cat + age + urban + an_int, data = mercury_train) mercury_wflow <- workflow() %>% add_model(mercury_mod) %>% add_recipe(mercury_rec) mercury_fit_rs <- mercury_wflow %>% fit_resamples(folds) mercury_fit_rs ``` ``` #> # Resampling results #> # 5-fold cross-validation #> # A tibble: 5 × 4 #> splits id .metrics .notes #> <list> <chr> <list> <list> #> 1 <split [675/169]> Fold1 <tibble [2 × 4]> <tibble [0 × 1]> #> 2 <split [675/169]> Fold2 <tibble [2 × 4]> <tibble [0 × 1]> #> 3 <split [675/169]> Fold3 <tibble [2 × 4]> <tibble [0 × 1]> #> 4 <split [675/169]> Fold4 <tibble [2 × 4]> <tibble [0 × 1]> #> 5 <split [676/168]> Fold5 <tibble [2 × 4]> <tibble [0 × 1]> ``` ] .pull-right[ <img src="img/cross-validation-animated.gif" width="100%" style="display: block; margin: auto 0 auto auto;" /> ] --- ## Collect CV metrics ```r collect_metrics(mercury_fit_rs) ``` ``` #> # A tibble: 2 × 6 #> .metric .estimator mean n std_err .config #> <chr> <chr> <dbl> <int> <dbl> <chr> #> 1 rmse standard 1.01 5 0.0670 Preprocessor1_Model1 #> 2 rsq standard 0.196 5 0.0374 Preprocessor1_Model1 ``` --- ## Deeper look into CV metrics .panelset[ .panel[.panel-name[Raw] ```r collect_metrics(mercury_fit_rs, summarize = FALSE) %>% print(n = 10) ``` ``` #> # A tibble: 10 × 5 #> id .metric .estimator .estimate .config #> <chr> <chr> <chr> <dbl> <chr> #> 1 Fold1 rmse standard 1.00 Preprocessor1_Model1 #> 2 Fold1 rsq standard 0.198 Preprocessor1_Model1 #> 3 Fold2 rmse standard 1.05 Preprocessor1_Model1 #> 4 Fold2 rsq standard 0.117 Preprocessor1_Model1 #> 5 Fold3 rmse standard 1.24 Preprocessor1_Model1 #> 6 Fold3 rsq standard 0.200 Preprocessor1_Model1 #> 7 Fold4 rmse standard 0.831 Preprocessor1_Model1 #> 8 Fold4 rsq standard 0.330 Preprocessor1_Model1 #> 9 Fold5 rmse standard 0.943 Preprocessor1_Model1 #> 10 Fold5 rsq standard 0.136 Preprocessor1_Model1 ``` ] .panel[.panel-name[Tidy] |Fold | RMSE| R-squared| |:-----|-----:|---------:| |Fold1 | 1.000| 0.198| |Fold2 | 1.046| 0.117| |Fold3 | 1.238| 0.200| |Fold4 | 0.831| 0.330| |Fold5 | 0.943| 0.136| ] ] --- ## How does RMSE compare to y? - Cross validation RMSE stats ``` #> # A tibble: 1 × 6 #> min max mean med sd IQR #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 0.831 1.24 1.01 1.00 0.150 0.103 ``` - Training data mercury stats ``` #> # A tibble: 1 × 6 #> min max mean med sd IQR #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 -6.91 3.02 0.202 0.166 1.13 1.55 ``` --- ## Compare to simpler model Let's check how accurate predictions from this model are: RMSE=1.01, `\(R^2=0.196\)`, compared to predictions from a simpler model with age, assets, and native community status. ```r mercury_rec_2 <- recipe(lhairHg ~ assets_sc + native_cat + age, data = mercury_train) mercury_wflow_2 <- workflow() %>% add_model(mercury_mod) %>% add_recipe(mercury_rec_2) mercury_fit_rs_2 <- mercury_wflow_2 %>% fit_resamples(folds) collect_metrics(mercury_fit_rs_2) ``` ``` #> # A tibble: 2 × 6 #> .metric .estimator mean n std_err .config #> <chr> <chr> <dbl> <int> <dbl> <chr> #> 1 rmse standard 1.00 5 0.0645 Preprocessor1_Model1 #> 2 rsq standard 0.205 5 0.0363 Preprocessor1_Model1 ``` --- ## What's next? <img src="img/post-cv-testing.png" width="100%" style="display: block; margin: auto 0 auto auto;" /> --- ## Picking a Model When choosing between these two models, we see essentially the same `\(R^2\)` and RMSE, but one model is much simpler than the other. It makes sense to use the simpler model to facilitate interpretation. Let's practice interpreting results from the model, fit to the full data, now! ```r final_fit <- linear_reg() %>% set_engine("lm") %>% fit(lhairHg ~ assets_sc + native_cat + age, data = mercury1) final_fit tidy(final_fit, conf.int=TRUE) ``` --- ``` #> parsnip model object #> #> Fit time: 1ms #> #> Call: #> stats::lm(formula = lhairHg ~ assets_sc + native_cat + age, data = data) #> #> Coefficients: #> (Intercept) assets_sc #> 0.718469 -0.108647 #> native_catNon-native age #> -0.941823 0.009656 ``` ``` #> # A tibble: 4 × 7 #> term estimate std.error statistic p.value conf.low conf.high #> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 (Inte… 0.718 0.120 5.98 3.65e- 9 0.482 0.954 #> 2 asset… -0.109 0.0468 -2.32 2.06e- 2 -0.201 -0.0167 #> 3 nativ… -0.942 0.114 -8.30 5.64e-16 -1.16 -0.719 #> 4 age 0.00966 0.00240 4.03 6.19e- 5 0.00495 0.0144 ```