Processing math: 100%
+ - 0:00:00
Notes for current slide
Notes for next slide

Model Fitting and Interpretation



Introduction to Global Health Data Science

Back to Website


Prof. Amy Herring

1 / 33

Models with numeric explanatory variables

2 / 33

Data: Artisanal and Small-Scale Gold Mining in the Peruvian Amazon

3 / 33

Mercury and Assets

Are mercury concentrations in hair related to household socioeconomic position, as measured by assets?

library(tidyverse)
library(tidymodels)
library(readr)
mercury <- readr::read_csv("mercury_reg.csv")
mercury <-
mercury %>%
# scale() subtracts the mean and divides by the SD to make the units "standard deviations" like a z-score
mutate(assets_sc=scale(SESassets)) %>%
#another variable we may use later
mutate(form_min_sc=scale(FM_buffer)) %>%
#so I don't have to remember coding
mutate(sex,sex_cat=ifelse(sex==1,"Male","Female")) %>%
mutate(native,native_cat=ifelse(native==1,"Native","Non-native"))
4 / 33

Goal: Predict mercury from assets

ˆyi=ˆβ0+ˆβ1×xi Here yi represents log hair mercury for subject i, and xi represents standardized household income.

5 / 33

Step 1: Specify model

linear_reg()
#> Linear Regression Model Specification (regression)
#>
#> Computational engine: lm
7 / 33

Step 2: Set model fitting engine

linear_reg() %>%
set_engine("lm") # lm: linear model
#> Linear Regression Model Specification (regression)
#>
#> Computational engine: lm
8 / 33

Step 3: Fit model & estimate parameters

... using formula syntax

linear_reg() %>%
set_engine("lm") %>%
fit(lhairHg ~ assets_sc, data = mercury)
#> parsnip model object
#>
#> Fit time: 1ms
#>
#> Call:
#> stats::lm(formula = lhairHg ~ assets_sc, data = data)
#>
#> Coefficients:
#> (Intercept) assets_sc
#> 0.2795 -0.3577
9 / 33

A closer look at model output

#> parsnip model object
#>
#> Fit time: 1ms
#>
#> Call:
#> stats::lm(formula = lhairHg ~ assets_sc, data = data)
#>
#> Coefficients:
#> (Intercept) assets_sc
#> 0.2795 -0.3577

ˆyi=0.27950.3577×xi

10 / 33

A tidy look at model output

linear_reg() %>%
set_engine("lm") %>%
fit(lhairHg ~ assets_sc, data = mercury) %>%
tidy()
#> # A tibble: 2 × 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 0.280 0.0213 13.1 7.49e-38
#> 2 assets_sc -0.358 0.0216 -16.5 3.90e-58

ˆyi=0.280.358×xi

11 / 33

Slope and intercept

ˆyi=0.27950.3577×xi

12 / 33

Slope and intercept

ˆyi=0.27950.3577×xi

  • Slope: For each standard deviation increase in assets, the log hair mercury level is expected to be lower, on average, by 0.3577 log ppm.
  • perhaps not a very useful statement given the scale
  • Intercept: Individuals with household assets at the mean level (assets_sc=0) are expected to have hair mercury concentrations of 0.2795 log ppm, on average

    • Remember assets_sc was standardized, so assets_sc=0 does not mean "no assets" but instead means "average assets" as it corresponds to an assets z-score of 0
12 / 33

Models with log transformation

13 / 33

Working with logs

  • Subtraction and logs: log(a)log(b)=log(a/b)
14 / 33

Working with logs

  • Subtraction and logs: log(a)log(b)=log(a/b)
  • Natural logarithm: elog(x)=x
14 / 33

Working with logs

  • Subtraction and logs: log(a)log(b)=log(a/b)
  • Natural logarithm: elog(x)=x
  • We can use these identities to "undo" the log transformation
14 / 33

Interpreting the slope

The slope coefficient for the log transformed model is -0.3577, meaning the log mercury difference between people whose household incomes are one SD apart is predicted to be -0.3577 log ppm.

15 / 33

Interpreting the slope

The slope coefficient for the log transformed model is -0.3577, meaning the log mercury difference between people whose household incomes are one SD apart is predicted to be -0.3577 log ppm.

Using this information, and properties of logs that we just reviewed, fill in the blanks in the following alternate interpretation of the slope:

For each additional SD the household assets are greater, the hair mercury concentration is expected to be ___ , on average, by a factor of ___.

15 / 33

For each additional increase in scaled assets, hair mercury content is expected to be ___ , on average, by a factor of ___.

log(hair Hg for assets x+1)log(hair Hg for assets x)=0.3577

16 / 33

For each additional increase in scaled assets, hair mercury content is expected to be ___ , on average, by a factor of ___.

log(hair Hg for assets x+1)log(hair Hg for assets x)=0.3577

log(hair Hg for assets x+1hair Hg for assets x)=0.3577

16 / 33

For each additional increase in scaled assets, hair mercury content is expected to be ___ , on average, by a factor of ___.

log(hair Hg for assets x+1)log(hair Hg for assets x)=0.3577

log(hair Hg for assets x+1hair Hg for assets x)=0.3577

elog(Hg for assets x+1Hg for assets x)=e0.3577

16 / 33

For each additional increase in scaled assets, hair mercury content is expected to be ___ , on average, by a factor of ___.

log(hair Hg for assets x+1)log(hair Hg for assets x)=0.3577

log(hair Hg for assets x+1hair Hg for assets x)=0.3577

elog(Hg for assets x+1Hg for assets x)=e0.3577

Hg for assets x+1Hg for assets x0.70

16 / 33

For each additional increase in scaled assets, hair mercury content is expected to be ___ , on average, by a factor of ___.

log(hair Hg for assets x+1)log(hair Hg for assets x)=0.3577

log(hair Hg for assets x+1hair Hg for assets x)=0.3577

elog(Hg for assets x+1Hg for assets x)=e0.3577

Hg for assets x+1Hg for assets x0.70

For each SD increase in assets, the hair Hg is expected to be lower, on average, by a factor of 0.70.

16 / 33

Correlation does not imply causation

Remember this when interpreting model coefficients

Source: XKCD, Cell phones

17 / 33

Parameter estimation

18 / 33

Linear model with a single predictor

  • We're interested in β0 (population parameter for the intercept) and β1 (population parameter for the slope) in the following model:

yi=β0+β1 xi+εi where ε represents random error around our mean

19 / 33

Linear model with a single predictor

  • We're interested in β0 (population parameter for the intercept) and β1 (population parameter for the slope) in the following model:

yi=β0+β1 xi+εi where ε represents random error around our mean

  • Tough luck, you can't have them...
19 / 33

Linear model with a single predictor

  • We're interested in β0 (population parameter for the intercept) and β1 (population parameter for the slope) in the following model:

yi=β0+β1 xi+εi where ε represents random error around our mean

  • Tough luck, you can't have them...
  • So we use sample statistics to estimate them, using the notation b or ˆβ to distinguish our estimates from the true parameters β

ˆyi=b0+b1 xi or ˆyi=ˆβ0+ˆβ1 xi

19 / 33

Least squares regression

  • The regression line minimizes the sum of squared residuals.
20 / 33

Least squares regression

  • The regression line minimizes the sum of squared residuals.
  • If ei=yiˆyi, then, the regression line minimizes ni=1e2i.
20 / 33

Visualizing residuals

21 / 33

Visualizing residuals (cont.)

22 / 33

Visualizing residuals (cont.)

23 / 33

Properties of least squares regression

  • The regression line goes through the center of mass point, the coordinates corresponding to average x and average y, (ˉx,ˉy):

ˉy=b0+b1ˉx  b0=ˉyb1ˉx

24 / 33

Properties of least squares regression

  • The regression line goes through the center of mass point, the coordinates corresponding to average x and average y, (ˉx,ˉy):

ˉy=b0+b1ˉx  b0=ˉyb1ˉx

  • The slope has the same sign as the correlation coefficient: b1=rsysx
24 / 33

Properties of least squares regression

  • The regression line goes through the center of mass point, the coordinates corresponding to average x and average y, (ˉx,ˉy):

ˉy=b0+b1ˉx  b0=ˉyb1ˉx

  • The slope has the same sign as the correlation coefficient: b1=rsysx
  • The sum of the residuals is zero: ni=1ei=0
24 / 33

Properties of least squares regression

  • The regression line goes through the center of mass point, the coordinates corresponding to average x and average y, (ˉx,ˉy):

ˉy=b0+b1ˉx  b0=ˉyb1ˉx

  • The slope has the same sign as the correlation coefficient: b1=rsysx
  • The sum of the residuals is zero: ni=1ei=0
  • The residuals and x values are uncorrelated
24 / 33

Aside: Correlation Coefficient

The correlation coefficient describes the linear relationship between x and y on the standard deviation scale.

cor(mercury$lhairHg,mercury$assets_sc,use="complete.obs")
#> [,1]
#> [1,] -0.326132

Here, for each one SD increase in assets, we expect a -0.33 SD increase (so thus a 0.33 decrease) in standard deviations of log hair mercury.

25 / 33

Models with categorical explanatory variables

26 / 33

Categorical predictor with 2 levels

#> # A tibble: 2,308 × 3
#> lhairHg assets_sc[,1] native_cat
#> <dbl> <dbl> <chr>
#> 1 0.676 -0.837 Non-native
#> 2 0.0908 0.197 Non-native
#> 3 1.67 -0.280 Non-native
#> 4 0.450 -0.280 Non-native
#> 5 0.704 -0.280 Non-native
#> 6 -0.512 -0.280 Non-native
#> 7 -0.125 0.826 Non-native
#> 8 -0.103 0.826 Non-native
#> 9 0.351 -1.27 Non-native
#> 10 0.0444 -1.27 Non-native
#> 11 0.662 -0.359 Non-native
#> 12 1.01 1.29 Non-native
#> 13 -0.149 1.29 Non-native
#> 14 0.881 0.517 Non-native
#> 15 1.51 0.517 Non-native
#> 16 0.328 -0.658 Non-native
#> 17 0.0553 0.533 Non-native
#> 18 0.768 1.29 Non-native
#> 19 0.431 1.30 Non-native
#> 20 1.55 1.30 Non-native
#> # … with 2,288 more rows
  • Native: community classified as native by Peruvian government
  • Non-native: community does not have native classification
27 / 33

Mercury & native community relationship

linear_reg() %>%
set_engine("lm") %>%
fit(lhairHg ~ native_cat, data = mercury) %>%
tidy()
#> # A tibble: 2 × 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 1.12 0.0416 26.9 4.97e-139
#> 2 native_catNon-native -1.11 0.0476 -23.2 2.94e-107
28 / 33

Mercury and native communities

Define Non-nativei=1 if non-native community and 0 otherwise.

^yi=1.121.11 Non-nativei

  • Slope: Residents of non-native communities, on average, are expected to have Hg concentrations that are -1.11 log ppm lower, on average, than residents of native communities. Alternatively, the hair Hg is expected to be lower, on average, by a factor of e1.11=0.33 among residents of non-native communities.

    • Compares baseline level (Native) to the other level (Non-native)
  • Intercept: Individuals living in native communities (Non-native=0) are expected to have hair mercury exposures of 1.12 log ppm.

29 / 33

Relationship between mercury and individual community

linear_reg() %>%
set_engine("lm") %>%
fit(lhairHg ~ community, data = mercury) %>%
tidy()
#> # A tibble: 23 × 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 0.614 0.0644 9.54 3.55e-21
#> 2 communityBoca Isiriwe 1.20 0.159 7.56 5.77e-14
#> 3 communityBoca Manu 0.833 0.129 6.43 1.56e-10
#> 4 communityCaychihue -1.10 0.132 -8.36 1.09e-16
#> 5 communityChoque -0.778 0.117 -6.68 3.04e-11
#> 6 communityDiamante 1.06 0.109 9.67 1.00e-21
#> 7 communityHuepetuhe -0.694 0.0732 -9.48 5.96e-21
#> 8 communityIsla de los Valles 1.12 0.245 4.57 5.18e- 6
#> 9 communityMasenawa 1.20 0.401 3.00 2.74e- 3
#> 10 communityPalotoa Teparo 0.350 0.148 2.36 1.82e- 2
#> # … with 13 more rows
30 / 33

Dummy or indicator variables

#> # A tibble: 23 × 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 0.614 0.0644 9.54 3.55e-21
#> 2 communityBoca Isiriwe 1.20 0.159 7.56 5.77e-14
#> 3 communityBoca Manu 0.833 0.129 6.43 1.56e-10
#> 4 communityCaychihue -1.10 0.132 -8.36 1.09e-16
#> 5 communityChoque -0.778 0.117 -6.68 3.04e-11
#> 6 communityDiamante 1.06 0.109 9.67 1.00e-21
#> 7 communityHuepetuhe -0.694 0.0732 -9.48 5.96e-21
#> 8 communityIsla de los Valles 1.12 0.245 4.57 5.18e- 6
#> 9 communityMasenawa 1.20 0.401 3.00 2.74e- 3
#> 10 communityPalotoa Teparo 0.350 0.148 2.36 1.82e- 2
#> # … with 13 more rows
  • When the categorical explanatory variable has many levels, they're encoded to dummy or indicator variables
  • Each coefficient describes the expected difference between log mercury concentrations in that particular community compared to the baseline community
31 / 33

Categorical predictor with 3+ levels

community BI BM Cay Cho Dia Hue Isl Mas
Boca Colorado 0 0 0 0 0 0 0 0
Boca Isiriwe 1 0 0 0 0 0 0 0
Boca Manu 0 1 0 0 0 0 0 0
Caychihue 0 0 1 0 0 0 0 0
Choque 0 0 0 1 0 0 0 0
Diamante 0 0 0 0 1 0 0 0
Huepetuhe 0 0 0 0 0 1 0 0
Isla de los Valles 0 0 0 0 0 0 1 0
Masenawa 0 0 0 0 0 0 0 1

etc. for the other communities. Note Boca Colorado, the referent community, =0 for all vars

32 / 33

Relationship between mercury and community

#> # A tibble: 23 × 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 0.614 0.0644 9.54 3.55e-21
#> 2 communityBoca Isiriwe 1.20 0.159 7.56 5.77e-14
#> 3 communityBoca Manu 0.833 0.129 6.43 1.56e-10
#> 4 communityCaychihue -1.10 0.132 -8.36 1.09e-16
#> 5 communityChoque -0.778 0.117 -6.68 3.04e-11
#> 6 communityDiamante 1.06 0.109 9.67 1.00e-21
#> 7 communityHuepetuhe -0.694 0.0732 -9.48 5.96e-21
#> 8 communityIsla de los Valles 1.12 0.245 4.57 5.18e- 6
#> 9 communityMasenawa 1.20 0.401 3.00 2.74e- 3
#> 10 communityPalotoa Teparo 0.350 0.148 2.36 1.82e- 2
#> # … with 13 more rows
  • Boca Colorado hair mercury levels are expected, on average, to be 0.614 log ppm.
  • Boca Isiriwe hair mercury levels are expected, on average, to be 1.20 log ppm higher than Boca Colorado levels, at 1.814 log ppm on average.
  • Boca Manu hair mercury levels are expected, on average, to be 0.833 log ppm higher than Boca Colorado levels, at 1.447 log ppm on average.
  • Choque hair mercury levels are expected, on average, to be 0.778 log ppm lower than Boca Colorado levels, at -0.164 log ppm on average.
  • etc. It can be nice to have a table showing the estimated mean (along with uncertainty measures -- coming soon) for each community, and a plot can also be useful.
33 / 33

Models with numeric explanatory variables

2 / 33
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow