class: center, middle, inverse, title-slide # Residual Analysis and Transformations ##
Introduction to Global Health Data Science ###
Back to Website
###
Prof. Amy Herring --- layout: true <div class="my-footer"> <span> <a href="https://sta198f2021.github.io/website/" target="_blank">Back to website</a> </span> </div> --- ## Read in Data ```r library(tidyverse) library(tidymodels) library(readr) library(ggtext) library(knitr) library(kableExtra) mercury <- readr::read_csv("mercury_reg.csv") mercury <- mercury %>% # scale() subtracts the mean and divides by the SD to make the units "standard deviations" like a z-score mutate(assets_sc=scale(SESassets)) %>% mutate(hairHg=exp(lhairHg)) ``` --- class: middle # Model checking --- ## "Linear" models - We're fitting a "linear" model, which assumes a linear relationship between our explanatory and response variables. - But how do we assess this? --- ## Graphical diagnostic: residuals plot (ppm units) .panelset[ .panel[.panel-name[Plot] <img src="w9-l01-resid-xform_files/figure-html/unnamed-chunk-2-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ```r hg_asset_fit <- linear_reg() %>% set_engine("lm") %>% fit(hairHg ~ assets_sc, data = mercury) *hg_asset_fit_aug <- augment(hg_asset_fit$fit) ggplot(hg_asset_fit_aug, mapping = aes(x = .fitted, y = .resid)) + geom_point(alpha = 0.5) + geom_hline(yintercept = 0, color = "gray", lty = "dashed") + labs(x = "Predicted mercury (ppm)", y = "Residuals") ``` ] .panel[.panel-name[Augment] ```r hg_asset_fit_aug ``` ``` #> # A tibble: 2,300 × 9 #> .rownames hairHg assets_sc[,1] .fitted .resid .hat .sigma #> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 1 1.97 -0.837 3.05 -1.08 0.000750 2.77 #> 2 3 1.10 0.197 2.12 -1.03 0.000452 2.77 #> 3 11 5.34 -0.280 2.55 2.79 0.000471 2.77 #> 4 13 1.57 -0.280 2.55 -0.981 0.000471 2.77 #> 5 14 2.02 -0.280 2.55 -0.528 0.000471 2.77 #> 6 15 0.599 -0.280 2.55 -1.95 0.000471 2.77 #> # … with 2,294 more rows, and 2 more variables: .cooksd <dbl>, #> # .std.resid <dbl> ``` ] ] --- ## More on `augment()` ```r glimpse(hg_asset_fit_aug) ``` ``` #> Rows: 2,300 #> Columns: 9 #> $ .rownames <chr> "1", "3", "11", "13", "14", "15", "16", "17"… #> $ hairHg <dbl> 1.9652, 1.0951, 5.3366, 1.5683, 2.0218, 0.59… #> $ assets_sc <dbl[,1]> <matrix[21 x 1]> #> $ .fitted <dbl> 3.045165, 2.124332, 2.549503, 2.549503, … #> $ .resid <dbl> -1.0799648, -1.0292319, 2.7870974, -0.981202… #> $ .hat <dbl> 0.0007495801, 0.0004516563, 0.0004705556, 0.… #> $ .sigma <dbl> 2.769771, 2.769779, 2.769252, 2.769787, 2.76… #> $ .cooksd <dbl> 5.708620e-05, 3.122263e-05, 2.385430e-04, 2.… #> $ .std.resid <dbl> -0.3901294, -0.3717471, 1.0066782, -0.354402… ``` --- ## Looking for... - Residuals distributed randomly around 0 - With no visible pattern along the x or y axes <img src="w9-l01-resid-xform_files/figure-html/unnamed-chunk-5-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Not looking for... .large[ **Fan shapes** ] <img src="w9-l01-resid-xform_files/figure-html/unnamed-chunk-6-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Not looking for... .large[ **Groups of patterns** ] <img src="w9-l01-resid-xform_files/figure-html/unnamed-chunk-7-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Not looking for... .large[ **Residuals correlated with predicted values** ] <img src="w9-l01-resid-xform_files/figure-html/unnamed-chunk-8-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Not looking for... .large[ **Any patterns!** ] <img src="w9-l01-resid-xform_files/figure-html/unnamed-chunk-9-1.png" width="60%" style="display: block; margin: auto;" /> --- .question[ What patterns does the residuals plot reveal that should make us question whether a linear model is a good fit for modeling the relationship between mercury (ppm) and assets? ] <img src="w9-l01-resid-xform_files/figure-html/unnamed-chunk-10-1.png" width="60%" style="display: block; margin: auto;" /> --- class: middle # Exploring linearity --- ## Data: Mercury <img src="w9-l01-resid-xform_files/figure-html/unnamed-chunk-11-1.png" width="70%" style="display: block; margin: auto;" /> --- ## Mercury vs. assets <img src="w9-l01-resid-xform_files/figure-html/unnamed-chunk-12-1.png" width="70%" style="display: block; margin: auto;" /> --- ## Mercury vs assets .question[ Which plot shows a more linear relationship? ] .small[ .pull-left[ <img src="w9-l01-resid-xform_files/figure-html/unnamed-chunk-13-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="w9-l01-resid-xform_files/figure-html/unnamed-chunk-14-1.png" width="100%" style="display: block; margin: auto;" /> ] ] --- ## Mercury and Assets, residuals .question[ Which plot shows a residuals that are uncorrelated with predicted values from the model? Also, what is the unit of the residuals? ] .pull-left[ <img src="w9-l01-resid-xform_files/figure-html/unnamed-chunk-15-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="w9-l01-resid-xform_files/figure-html/unnamed-chunk-16-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Transforming the data - We saw that `hairHg` has a right-skewed distribution, and the residuals of that model don't look great. -- - In these situations a transformation applied to the response variable may be useful. -- - In order to decide which transformation to use, we should examine the distribution of the response variable. -- - The extremely right skewed distribution suggests that a log transformation may be useful. - log = natural log, `\(ln\)` - Default base of the `log` function in R is the natural log: <br> `log(x, base = exp(1))` --- ## Transformations - Non-constant variance is one of the most common model violations, however it is usually fixable by transforming the response (y) variable. -- - The most common transformation when the response variable is right skewed is the log transform: `\(log(y)\)`, especially useful when the response variable is (extremely) right skewed. -- - This transformation is also useful for variance stabilization. -- - When using a log transformation on the response variable the interpretation of the slope changes: *"For each unit increase in x, y is expected on average to be higher/lower <br> by a factor of `\(e^{b_1}\)`."* -- - Another useful transformation is the square root: `\(\sqrt{y}\)`, especially useful when the response variable is a count. --- ## Transform, or learn more? - Data transformations may also be useful when the relationship is non-linear - However in those cases a polynomial regression may be more appropriate + This is beyond the scope of this course, but you’re welcomed to try it for your final project, and I’d be happy to provide further guidance --- ## Aside: when `\(y = 0\)` In some cases the value of the response variable might be 0, and ```r log(0) ``` ``` #> [1] -Inf ``` -- The trick is to add a very small number to the value of the response variable for these cases so that the `log` function can still be applied: ```r log(0 + 0.00001) ``` ``` #> [1] -11.51293 ```