Practice assessing predictive ability of models
Practice selecting among models using cross-validation
Practice exploring model performance in a completely new data set
In this exercise, we consider predicting infant birth weight BWTG
as a function of several important predictors, including gestational age at delivery GEST
(babies tend to gain weight each week of gestation), infant gender SEX
(boys tend to weigh more on average at the same gestational age), maternal smoking status SMOKE
(smoking has been linked with lower birth weight), plurality PLURFAC
(singleton births tend to be heavier than twins, PLURFAC
=2, or triplets, PLURFAC
=3), and whether the mother has given birth previously MULTIP
(women who have given prior birth, also called multiparous women, tend to have heavier babies). Our data are a random sample of Triangle-area births over the last several years and are contained in the file areabirths.Rmd
.
The relationship between gestational week and birth weight is not linear. For example, a typical fetus may weigh 150g at the end of 15 weeks of pregnancy, but in the latter weeks of pregnancy, the fetus may be gaining 150g or more per week. Thus polynomials like week\(^2\), week\(^3\), etc. are often used to model birth weight, in order to allow the rate of fetal weight gain to vary over the course of a pregnancy.
While polynomials are often a great way to model nuanced relationships, they are not without potential drawbacks. Sometimes people complain that they hamper interpretations of results, and while that may be true for individual coefficients, e.g. the parameter corresponding to week\(^3\), it is always possible to graph a non-linear relationship, so that interpretability isn’t a big issue. Another potential problem is overfitting the data, especially near the lower and upper ends of the data range. You can see these effects in selected polynomial fits for boys and girls below (these are not adjusted for any other covariates or predictor variables, though estimates are stratified by sex in the plots).
load("areabirths.RData")
library(tidyverse)
library(infer)
library(tidymodels)
library(kableExtra)
<- durhambirth %>%
m1plot ggplot(aes(x=GEST,y=BWTG,group=SEX)) +
geom_point(aes(color=SEX),alpha=0.2) +
stat_smooth(aes(color=SEX), method='lm',formula = y~poly(x,1)) +
labs(title="Linear Term",
x = "Gestational Age (weeks)",
y = "Birth Weight (g)")
<- durhambirth %>%
m5plot ggplot(aes(x=GEST,y=BWTG,group=SEX)) +
geom_point(aes(color=SEX),alpha=0.2) +
stat_smooth(aes(color=SEX), method='lm',formula = y~poly(x,5)) +
labs(title="Fifth Order Polynomial",
x = "Gestational Age (weeks)",
y = "Birth Weight (g)")
<- durhambirth %>%
m9plot ggplot(aes(x=GEST,y=BWTG,group=SEX)) +
geom_point(aes(color=SEX),alpha=0.2) +
stat_smooth(aes(color=SEX), method='lm',formula = y~poly(x,9)) +
labs(title="Ninth Order Polynomial",
x = "Gestational Age (weeks)",
y = "Birth Weight (g)")
m1plot
m5plot
m9plot
In this lab, your job is to explore a variety of models for birth weight using cross validation.
# pick your own seed - should not be this one!
set.seed(13121972)
# before creating high-order polynomials, we should center and scale gestational age
<- durhambirth %>%
durhambirth # now GEST_CS has mean 0 and variance 1
mutate(GEST_CS=scale(GEST)) %>%
mutate(GEST2=GEST_CS*GEST_CS) %>%
mutate(GEST3=GEST2*GEST_CS) %>%
mutate(GEST4=GEST3*GEST_CS) %>%
mutate(GEST5=GEST4*GEST_CS) %>%
mutate(GEST6=GEST5*GEST_CS) %>%
mutate(GEST7=GEST6*GEST_CS) %>%
mutate(GEST8=GEST7*GEST_CS) %>%
mutate(GEST9=GEST8*GEST_CS)
# create folds
# folds <-
<- linear_reg() %>%
birth_mod set_engine("lm")
<- recipe(BWTG ~ GEST_CS+SEX+PLURFAC+SMOKE+MULTIP, data=durhambirth)
b1_rec
<- workflow() %>%
b1_wflow add_model(birth_mod) %>%
add_recipe(b1_rec)
<- b1_wflow %>%
b1_fit fit_resamples(folds)
<- collect_metrics(b1_fit, summarize = TRUE) %>%
s1 select(.metric, mean) %>%
pivot_wider(names_from = .metric, values_from = mean) %>%
add_column(degree=1,.before=TRUE)
<- recipe(BWTG ~ GEST_CS + GEST2 + GEST3++GEST4+GEST5+SEX+PLURFAC+SMOKE+MULTIP, data=durhambirth)
b5_rec
<- workflow() %>%
b5_wflow add_model(birth_mod) %>%
add_recipe(b5_rec)
<- b5_wflow %>%
b5_fit fit_resamples(folds)
<- collect_metrics(b5_fit, summarize = TRUE) %>%
s5 select(.metric, mean) %>%
pivot_wider(names_from = .metric, values_from = mean) %>%
add_column(degree=5,.before=TRUE)
<- recipe(BWTG ~ GEST_CS + GEST2 + GEST3++GEST4+GEST5+GEST6+GEST7+GEST8+GEST9+SEX+PLURFAC+SMOKE+MULTIP, data=durhambirth)
b9_rec
<- workflow() %>%
b9_wflow add_model(birth_mod) %>%
add_recipe(b9_rec)
<- b9_wflow %>%
b9_fit fit_resamples(folds)
<- collect_metrics(b9_fit, summarize = TRUE) %>%
s9 select(.metric, mean) %>%
pivot_wider(names_from = .metric, values_from = mean) %>%
add_column(degree=9,.before=TRUE)
=full_join(s1,s5)
a=full_join(a,s9)
fintab
fintab
Based on the results in the table, which model appears to be doing the best job at prediction? Refer to specific values in the table to justify your choice of a single “best” model. Then, fit this model in the full Triangle area births sample and provide a plot of the residuals versus predicted values for this model. Comment on any concerns.
Assume your residual plot looks great! Now take your “best” model and assess its performance in data from Mecklenburg County (Charlotte is in Mecklenburg County), which are contained in the file mecklenburgbirth.RData
. When you assess performance, because we feel the populations are quite similar, you can re-fit the model using data from Mecklenburg County (so that the \(\hat{beta}\)’s are estimated from the Mecklenburg County data), using the form of the model you found superior in #3. How does the \(R^2\) in this brand new sample compare to the \(R^2\) you obtained in the test samples in the Triangle area data?
load("mecklenburgbirth.RData")
<- mecklenburgbirth %>%
mecklenburgbirth mutate(GEST_CS=scale(GEST)) %>%
mutate(GEST2=GEST_CS*GEST_CS) %>%
mutate(GEST3=GEST2*GEST_CS) %>%
mutate(GEST4=GEST3*GEST_CS) %>%
mutate(GEST5=GEST4*GEST_CS) %>%
mutate(GEST6=GEST5*GEST_CS) %>%
mutate(GEST7=GEST6*GEST_CS) %>%
mutate(GEST8=GEST7*GEST_CS) %>%
mutate(GEST9=GEST8*GEST_CS)
Total: 50 pts
Exercise 1: 5 pts
Exercise 2: 10 pts
Exercise 3: 10 pts
Exercise 4: 10 pts
Exercise 5: 10 pts
Overall: 5 pts