class: center, middle, inverse, title-slide # Go to End for Aggregate Coding! ##
Introduction to Global Health Data Science ###
Back to Website
###
Prof. Amy Herring --- layout: true <div class="my-footer"> <span> <a href="https://sta198f2021.github.io/website/" target="_blank">Back to website</a> </span> </div> --- class: middle # Logistic Regression --- ### Data We consider data from the Global Adult Tobacco Survey (GATS), which is designed to provide nationally-representative data on non-institutionalized people 15 years and older. This survey is a global standard for systematically monitoring adult tobacco use and is produced by the Centers for Disease Control (CDC) in collaboration with the World Health Organization (WHO), RTI International, and Johns Hopkins University. China has the largest smoking population in the world and accounts for roughly 40% of tobacco consumption worldwide. We will focus on GATS data from China in 2018 (the most recent survey year), but note data from other countries are available from the [WHO's Microdata Repository](https://extranet.who.int/ncdsmicrodata/index.php/home). --- ```r gats <- readr::read_csv("data/gats_rev.csv") gats$RESIDENCE=factor(gats$RESIDENCE,levels=c(1,2),labels=c("Urban","Rural")) gats$PROVINCE=factor(gats$PROVINCE,levels=seq(1:31),labels=c("Beijing","Tianjin","Hebei","Shanxi","Neimenggu","Liaoning","Jilin","Heilongjiang","Shanghai","Jiangsu","Zhejiang","Anhui","Fujian","Jiangxi","Shangdong","Henan","Hubei","Hunan","Guangdong","Guangxi","Hainan","Chongqing","Sichuan","Guizhou","Yunnan","Xizang","Shaanxi","Gansu","Qinghai","Ningxia","Xinjiang")) gats$REGION6=factor(gats$REGION6,levels=seq(1:6),labels=c("North","North-East","East","South-Central","South-West","North-West")) gats$REGION3=factor(gats$REGION3,levels=seq(1:3),labels=c("East","Central","West")) gats$GENDER=factor(gats$GENDER,levels=c(1,2),labels=c("Male","Female")) gats$CURRENTSMOKE=factor(gats$CURRENTSMOKE,levels=c(1,2,7),labels=c("Yes","No","Don't Know")) gats$EDUCATION=factor(gats$EDUCATION,levels=c(1,2,3,4,5,6,7,8,77,99),labels=c("None","Less than Primary School","Primary School","Less than Secondary School","Secondary School","High School","University","Postgraduate","Don't Know","Refused")) gats$OCCUPATION=factor(gats$OCCUPATION,levels=c(1,2,3,4,5,6,7,8,9,10,77,99),labels=c("Farming","Government","Business","Teacher","Healthcare","Student","Soldier","Unemployed","Retired","Other","Don't Know","Refused")) gats$HEARDOFECIG=factor(gats$HEARDOFECIG,levels=c(1,2,9),labels=c("Yes","No","Refused")) gats$ECIGUSE=factor(gats$ECIGUSE,levels=c(1,2,3,9),labels=c("Daily","Less than Daily","Not at All","Refused")) gats$TRYSTOP=factor(gats$TRYSTOP,levels=c(1,2,9),labels=c("Yes","No","Refused")) gats$HOMESMOKERULES=factor(gats$HOMESMOKERULES,levels=c(1,2,3,4,7,9),labels=c("Allowed","Not Allowed but Exceptions","Never Allowed","No Rules","Don't Know","Refused")) gats$SMOKESICK=factor(gats$SMOKESICK,levels=c(1,2,7,9),labels=c("Yes","No","Don't Know","Refused")) gats$SMOKESICKBINARY=ifelse(gats$SMOKESICK=="Yes","Yes","No") gats$SMOKESICKBINARY=as.factor(gats$SMOKESICKBINARY) gats$SMOKECANCER=factor(gats$SMOKECANCER,levels=c(1,2,7,9),labels=c("Yes","No","Don't Know","Refused")) #scale age into decades gats$DECADES=gats$AGE/10 ``` --- Let's explore factors related to a respondent's belief that smoking makes you sick. First, filter to data to exclude anyone who did not respond to the education question or who reported they didn't know their education level. Then, conduct some initial exploratory analysis of how the belief smoking makes you sick varies across levels of gender and education. Fit a logistic regression model with gender and education as predictors. Write a paragraph or two describing your findings. ```r gats2 <- gats %>% filter(EDUCATION != "Don't Know") %>% filter(EDUCATION != "Refused") %>% filter(SMOKESICK != "Don't Know") %>% filter(SMOKESICK != "Refused") table(gats2$SMOKESICK) ``` ``` #> #> Yes No Don't Know Refused #> 15932 953 0 0 ``` --- ```r ggplot(gats2, aes(x = GENDER, fill =SMOKESICKBINARY)) + geom_bar() + labs(x = 'Gender', fill = 'Smoking Makes You Sick?') + scale_fill_manual(values = c("#E48957", "#071381","#07813e")) ``` <img src="w11-ae_files/figure-html/unnamed-chunk-2-1.png" width="60%" style="display: block; margin: auto;" /> --- ```r ggplot(gats2, aes(x = EDUCATION, fill =SMOKESICKBINARY)) + geom_bar() + labs(x = 'Education', fill = 'Smoking Makes You Sick?') + scale_fill_manual(values = c("#E48957", "#071381","#07813e")) + theme(axis.text.x = element_text(angle = 45,hjust=1)) ``` <img src="w11-ae_files/figure-html/unnamed-chunk-3-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Output (on the log OR scale) ```r sick_m1 <- logistic_reg() %>% set_engine("glm") %>% fit(SMOKESICKBINARY ~ GENDER + EDUCATION, data=gats2, family="binomial") result<-tidy(sick_m1, conf.int=TRUE) ``` --- ```r print(result[,1],n=9) ``` ``` #> # A tibble: 9 × 1 #> term #> <chr> #> 1 (Intercept) #> 2 GENDERFemale #> 3 EDUCATIONLess than Primary School #> 4 EDUCATIONPrimary School #> 5 EDUCATIONLess than Secondary School #> 6 EDUCATIONSecondary School #> 7 EDUCATIONHigh School #> 8 EDUCATIONUniversity #> 9 EDUCATIONPostgraduate ``` --- ```r print(result,n=9) ``` ``` #> # A tibble: 9 × 7 #> term estimate std.error statistic p.value conf.low conf.high #> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 (Inte… 1.83 0.105 17.5 1.09e-68 1.63 2.04 #> 2 GENDE… 0.801 0.0716 11.2 4.48e-29 0.662 0.943 #> 3 EDUCA… 0.187 0.126 1.48 1.39e- 1 -0.0618 0.434 #> 4 EDUCA… 0.513 0.127 4.04 5.36e- 5 0.263 0.761 #> 5 EDUCA… 0.551 0.163 3.37 7.46e- 4 0.235 0.876 #> 6 EDUCA… 0.760 0.118 6.46 1.05e-10 0.528 0.990 #> 7 EDUCA… 1.09 0.137 7.93 2.28e-15 0.820 1.36 #> 8 EDUCA… 1.00 0.139 7.19 6.45e-13 0.729 1.28 #> 9 EDUCA… 0.684 0.432 1.58 1.14e- 1 -0.0822 1.64 ``` --- ## Try an interaction (exponentiating output here) ```r sick_m2 <- logistic_reg() %>% set_engine("glm") %>% fit(SMOKESICKBINARY ~ GENDER + EDUCATION * GENDER*EDUCATION, data=gats2, family="binomial") result2<-tidy(sick_m2, conf.int=TRUE, exponentiate=TRUE) ``` --- ```r print(result2[,1],n=16) ``` ``` #> # A tibble: 16 × 1 #> term #> <chr> #> 1 (Intercept) #> 2 GENDERFemale #> 3 EDUCATIONLess than Primary School #> 4 EDUCATIONPrimary School #> 5 EDUCATIONLess than Secondary School #> 6 EDUCATIONSecondary School #> 7 EDUCATIONHigh School #> 8 EDUCATIONUniversity #> 9 EDUCATIONPostgraduate #> 10 GENDERFemale:EDUCATIONLess than Primary School #> 11 GENDERFemale:EDUCATIONPrimary School #> 12 GENDERFemale:EDUCATIONLess than Secondary School #> 13 GENDERFemale:EDUCATIONSecondary School #> 14 GENDERFemale:EDUCATIONHigh School #> 15 GENDERFemale:EDUCATIONUniversity #> 16 GENDERFemale:EDUCATIONPostgraduate ``` --- ```r print(result2,n=16) ``` ``` #> # A tibble: 16 × 7 #> term estimate std.error statistic p.value conf.low conf.high #> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 (Int… 6.51 0.168 11.2 5.70e-29 4.75 9.18 #> 2 GEND… 2.10 0.201 3.70 2.18e- 4 1.41 3.09 #> 3 EDUC… 1.13 0.197 0.635 5.26e- 1 0.763 1.66 #> 4 EDUC… 1.57 0.197 2.30 2.16e- 2 1.06 2.29 #> 5 EDUC… 1.67 0.230 2.22 2.62e- 2 1.06 2.62 #> 6 EDUC… 2.05 0.186 3.85 1.19e- 4 1.41 2.93 #> 7 EDUC… 2.89 0.205 5.17 2.29e- 7 1.92 4.29 #> 8 EDUC… 2.63 0.210 4.62 3.92e- 6 1.73 3.95 #> 9 EDUC… 2.71 0.617 1.62 1.06e- 1 0.940 11.5 #> 10 GEND… 1.12 0.265 0.434 6.64e- 1 0.671 1.90 #> 11 GEND… 1.12 0.268 0.440 6.60e- 1 0.670 1.92 #> 12 GEND… 1.05 0.351 0.150 8.81e- 1 0.537 2.13 #> 13 GEND… 1.07 0.249 0.264 7.92e- 1 0.660 1.75 #> 14 GEND… 1.02 0.292 0.0635 9.49e- 1 0.580 1.82 #> 15 GEND… 1.04 0.292 0.133 8.94e- 1 0.590 1.86 #> 16 GEND… 0.440 0.864 -0.950 3.42e- 1 0.0751 2.58 ``` Whew, no important interaction terms apparent here. --- ## Have aggregate data like counts and/or rates? ```r # create 0/1 version of SMOKESICKBINARY where 1=yes makes you sick and 0=no #right now 1=yes and 2=no gats2$SMOKESICKBIN2=ifelse(gats2$SMOKESICKBINARY=="Yes",1,0) ## Turn our data into aggregate counts ## Sickyes = # by gender and education who think smoking makes you sick ## TOTAL: # in each gender and education group (denominator for prevalence) ## Prevalence: probability people in group think smoking makes you sick, calc as Sickyes/TOTAL ## Sickno: TOTAL-Sickyes gats2_agg <- gats2%>%group_by(GENDER, EDUCATION) %>% summarise(TOTAL=n(), Sickyes=sum(as.numeric(SMOKESICKBIN2))) %>% ungroup() %>% mutate(prevalence=Sickyes/TOTAL) gats2_agg$Sickno <- gats2_agg$TOTAL- gats2_agg$Sickyes ``` --- ##Now that we have aggregate data, how do we fit the model? ```r modelagg<-glm(cbind(Sickyes, Sickno) ~ GENDER+EDUCATION, data=gats2_agg, family = binomial) ``` --- Compare this to the output we got from the individual observations! ```r summary(modelagg) ``` ``` #> #> Call: #> glm(formula = cbind(Sickyes, Sickno) ~ GENDER + EDUCATION, family = binomial, #> data = gats2_agg) #> #> Deviance Residuals: #> Min 1Q Median 3Q Max #> -0.81858 -0.16035 -0.00327 0.14003 0.63091 #> #> Coefficients: #> Estimate Std. Error z value #> (Intercept) 1.83277 0.10464 17.516 #> GENDERFemale 0.80144 0.07161 11.192 #> EDUCATIONLess than Primary School 0.18665 0.12628 1.478 #> EDUCATIONPrimary School 0.51271 0.12693 4.039 #> EDUCATIONLess than Secondary School 0.55071 0.16331 3.372 #> EDUCATIONSecondary School 0.76047 0.11772 6.460 #> EDUCATIONHigh School 1.08889 0.13740 7.925 #> EDUCATIONUniversity 1.00133 0.13926 7.191 #> EDUCATIONPostgraduate 0.68403 0.43250 1.582 #> Pr(>|z|) #> (Intercept) < 2e-16 *** #> GENDERFemale < 2e-16 *** #> EDUCATIONLess than Primary School 0.139399 #> EDUCATIONPrimary School 5.36e-05 *** #> EDUCATIONLess than Secondary School 0.000746 *** #> EDUCATIONSecondary School 1.05e-10 *** #> EDUCATIONHigh School 2.28e-15 *** #> EDUCATIONUniversity 6.45e-13 *** #> EDUCATIONPostgraduate 0.113741 #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> (Dispersion parameter for binomial family taken to be 1) #> #> Null deviance: 213.0720 on 15 degrees of freedom #> Residual deviance: 1.4187 on 7 degrees of freedom #> AIC: 107.13 #> #> Number of Fisher Scoring iterations: 4 ```