Go to End for Aggregate Coding!

class: center, middle, inverse, title-slide

# Go to End for Aggregate Coding!
## <br><br> Introduction to Global Health Data Science
### <a href="https://sta198f2021.github.io/website/">Back to Website</a>
### <br> Prof. Amy Herring

---

layout: true
  
<div class="my-footer">
<span>
<a href="https://sta198f2021.github.io/website/" target="_blank">Back to website</a>
</span>
</div>

---

class: middle

# Logistic Regression

---

### Data

We consider data from the Global Adult Tobacco Survey (GATS), which is designed to provide nationally-representative data on non-institutionalized people 15 years and older. This survey is a global standard for systematically monitoring adult tobacco use and is produced by the Centers for Disease Control (CDC) in collaboration with the World Health Organization (WHO), RTI International, and Johns Hopkins University.

China has the largest smoking population in the world and accounts for roughly 40% of tobacco consumption worldwide. We will focus on GATS data from China in 2018 (the most recent survey year), but note data from other countries are available from the [WHO's Microdata Repository](https://extranet.who.int/ncdsmicrodata/index.php/home).

---

```r
gats <- readr::read_csv("data/gats_rev.csv")

gats$RESIDENCE=factor(gats$RESIDENCE,levels=c(1,2),labels=c("Urban","Rural"))
gats$PROVINCE=factor(gats$PROVINCE,levels=seq(1:31),labels=c("Beijing","Tianjin","Hebei","Shanxi","Neimenggu","Liaoning","Jilin","Heilongjiang","Shanghai","Jiangsu","Zhejiang","Anhui","Fujian","Jiangxi","Shangdong","Henan","Hubei","Hunan","Guangdong","Guangxi","Hainan","Chongqing","Sichuan","Guizhou","Yunnan","Xizang","Shaanxi","Gansu","Qinghai","Ningxia","Xinjiang"))
gats$REGION6=factor(gats$REGION6,levels=seq(1:6),labels=c("North","North-East","East","South-Central","South-West","North-West"))
gats$REGION3=factor(gats$REGION3,levels=seq(1:3),labels=c("East","Central","West"))
gats$GENDER=factor(gats$GENDER,levels=c(1,2),labels=c("Male","Female"))
gats$CURRENTSMOKE=factor(gats$CURRENTSMOKE,levels=c(1,2,7),labels=c("Yes","No","Don't Know"))
gats$EDUCATION=factor(gats$EDUCATION,levels=c(1,2,3,4,5,6,7,8,77,99),labels=c("None","Less than Primary School","Primary School","Less than Secondary School","Secondary School","High School","University","Postgraduate","Don't Know","Refused"))
gats$OCCUPATION=factor(gats$OCCUPATION,levels=c(1,2,3,4,5,6,7,8,9,10,77,99),labels=c("Farming","Government","Business","Teacher","Healthcare","Student","Soldier","Unemployed","Retired","Other","Don't Know","Refused"))
gats$HEARDOFECIG=factor(gats$HEARDOFECIG,levels=c(1,2,9),labels=c("Yes","No","Refused"))
gats$ECIGUSE=factor(gats$ECIGUSE,levels=c(1,2,3,9),labels=c("Daily","Less than Daily","Not at All","Refused"))
gats$TRYSTOP=factor(gats$TRYSTOP,levels=c(1,2,9),labels=c("Yes","No","Refused"))
gats$HOMESMOKERULES=factor(gats$HOMESMOKERULES,levels=c(1,2,3,4,7,9),labels=c("Allowed","Not Allowed but Exceptions","Never Allowed","No Rules","Don't Know","Refused"))
gats$SMOKESICK=factor(gats$SMOKESICK,levels=c(1,2,7,9),labels=c("Yes","No","Don't Know","Refused"))
gats$SMOKESICKBINARY=ifelse(gats$SMOKESICK=="Yes","Yes","No")
gats$SMOKESICKBINARY=as.factor(gats$SMOKESICKBINARY)
gats$SMOKECANCER=factor(gats$SMOKECANCER,levels=c(1,2,7,9),labels=c("Yes","No","Don't Know","Refused"))
#scale age into decades
gats$DECADES=gats$AGE/10
```

---

Let's explore factors related to a respondent's belief that smoking makes you sick.  First, filter to data to exclude anyone who did not respond to the education question or who reported they didn't know their education level.

Then, conduct some initial exploratory analysis of how the belief smoking makes you sick varies across levels of gender and education. Fit a logistic regression model with gender and education as predictors. Write a paragraph or two describing your findings.

```r
gats2 <- gats %>%
  filter(EDUCATION != "Don't Know") %>%
  filter(EDUCATION != "Refused") %>%
  filter(SMOKESICK != "Don't Know") %>%
  filter(SMOKESICK != "Refused")

table(gats2$SMOKESICK)
```

```
#> 
#>        Yes         No Don't Know    Refused 
#>      15932        953          0          0
```

---

```r
ggplot(gats2, aes(x = GENDER,
                 fill =SMOKESICKBINARY)) +
  geom_bar() + 
  labs(x = 'Gender',
       fill = 'Smoking Makes You Sick?') +
   scale_fill_manual(values = c("#E48957", "#071381","#07813e"))
```

---

```r
ggplot(gats2, aes(x = EDUCATION,
                 fill =SMOKESICKBINARY)) +
  geom_bar() + 
  labs(x = 'Education',
       fill = 'Smoking Makes You Sick?') +
   scale_fill_manual(values = c("#E48957", "#071381","#07813e")) + 
  theme(axis.text.x = element_text(angle = 45,hjust=1))
```

---

## Output (on the log OR scale)

```r
sick_m1 <- logistic_reg() %>%
  set_engine("glm") %>%
  fit(SMOKESICKBINARY ~ GENDER + EDUCATION, data=gats2, family="binomial")
result<-tidy(sick_m1, conf.int=TRUE)
```

---

```r
print(result[,1],n=9)
```

```
#> # A tibble: 9 × 1
#>   term                               
#>   <chr>                              
#> 1 (Intercept)                        
#> 2 GENDERFemale                       
#> 3 EDUCATIONLess than Primary School  
#> 4 EDUCATIONPrimary School            
#> 5 EDUCATIONLess than Secondary School
#> 6 EDUCATIONSecondary School          
#> 7 EDUCATIONHigh School               
#> 8 EDUCATIONUniversity                
#> 9 EDUCATIONPostgraduate
```

---

```r
print(result,n=9)
```

```
#> # A tibble: 9 × 7
#>   term   estimate std.error statistic  p.value conf.low conf.high
#>   <chr>     <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
#> 1 (Inte…    1.83     0.105      17.5  1.09e-68   1.63       2.04 
#> 2 GENDE…    0.801    0.0716     11.2  4.48e-29   0.662      0.943
#> 3 EDUCA…    0.187    0.126       1.48 1.39e- 1  -0.0618     0.434
#> 4 EDUCA…    0.513    0.127       4.04 5.36e- 5   0.263      0.761
#> 5 EDUCA…    0.551    0.163       3.37 7.46e- 4   0.235      0.876
#> 6 EDUCA…    0.760    0.118       6.46 1.05e-10   0.528      0.990
#> 7 EDUCA…    1.09     0.137       7.93 2.28e-15   0.820      1.36 
#> 8 EDUCA…    1.00     0.139       7.19 6.45e-13   0.729      1.28 
#> 9 EDUCA…    0.684    0.432       1.58 1.14e- 1  -0.0822     1.64
```

---
## Try an interaction (exponentiating output here)

```r
sick_m2 <- logistic_reg() %>%
  set_engine("glm") %>%
  fit(SMOKESICKBINARY ~ GENDER + EDUCATION * GENDER*EDUCATION, data=gats2, family="binomial")
result2<-tidy(sick_m2, conf.int=TRUE, exponentiate=TRUE)
```

---

```r
print(result2[,1],n=16)
```

```
#> # A tibble: 16 × 1
#>    term                                            
#>    <chr>                                           
#>  1 (Intercept)                                     
#>  2 GENDERFemale                                    
#>  3 EDUCATIONLess than Primary School               
#>  4 EDUCATIONPrimary School                         
#>  5 EDUCATIONLess than Secondary School             
#>  6 EDUCATIONSecondary School                       
#>  7 EDUCATIONHigh School                            
#>  8 EDUCATIONUniversity                             
#>  9 EDUCATIONPostgraduate                           
#> 10 GENDERFemale:EDUCATIONLess than Primary School  
#> 11 GENDERFemale:EDUCATIONPrimary School            
#> 12 GENDERFemale:EDUCATIONLess than Secondary School
#> 13 GENDERFemale:EDUCATIONSecondary School          
#> 14 GENDERFemale:EDUCATIONHigh School               
#> 15 GENDERFemale:EDUCATIONUniversity                
#> 16 GENDERFemale:EDUCATIONPostgraduate
```

---

```r
print(result2,n=16)
```

```
#> # A tibble: 16 × 7
#>    term  estimate std.error statistic  p.value conf.low conf.high
#>    <chr>    <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
#>  1 (Int…    6.51      0.168   11.2    5.70e-29   4.75        9.18
#>  2 GEND…    2.10      0.201    3.70   2.18e- 4   1.41        3.09
#>  3 EDUC…    1.13      0.197    0.635  5.26e- 1   0.763       1.66
#>  4 EDUC…    1.57      0.197    2.30   2.16e- 2   1.06        2.29
#>  5 EDUC…    1.67      0.230    2.22   2.62e- 2   1.06        2.62
#>  6 EDUC…    2.05      0.186    3.85   1.19e- 4   1.41        2.93
#>  7 EDUC…    2.89      0.205    5.17   2.29e- 7   1.92        4.29
#>  8 EDUC…    2.63      0.210    4.62   3.92e- 6   1.73        3.95
#>  9 EDUC…    2.71      0.617    1.62   1.06e- 1   0.940      11.5 
#> 10 GEND…    1.12      0.265    0.434  6.64e- 1   0.671       1.90
#> 11 GEND…    1.12      0.268    0.440  6.60e- 1   0.670       1.92
#> 12 GEND…    1.05      0.351    0.150  8.81e- 1   0.537       2.13
#> 13 GEND…    1.07      0.249    0.264  7.92e- 1   0.660       1.75
#> 14 GEND…    1.02      0.292    0.0635 9.49e- 1   0.580       1.82
#> 15 GEND…    1.04      0.292    0.133  8.94e- 1   0.590       1.86
#> 16 GEND…    0.440     0.864   -0.950  3.42e- 1   0.0751      2.58
```

Whew, no important interaction terms apparent here.
---

## Have aggregate data like counts and/or rates?

```r
# create 0/1 version of SMOKESICKBINARY where 1=yes makes you sick and 0=no
#right now 1=yes and 2=no

gats2$SMOKESICKBIN2=ifelse(gats2$SMOKESICKBINARY=="Yes",1,0)

## Turn our data into aggregate counts
## Sickyes = # by gender and education who think smoking makes you sick
## TOTAL: # in each gender and education group (denominator for prevalence)
## Prevalence: probability people in group think smoking makes you sick, calc as Sickyes/TOTAL

## Sickno: TOTAL-Sickyes

gats2_agg <- gats2%>%group_by(GENDER, EDUCATION) %>% 
  summarise(TOTAL=n(), Sickyes=sum(as.numeric(SMOKESICKBIN2))) %>%
  ungroup() %>% 
  mutate(prevalence=Sickyes/TOTAL)

gats2_agg$Sickno <- gats2_agg$TOTAL- gats2_agg$Sickyes
```

---

##Now that we have aggregate data, how do we fit the model?

```r
modelagg<-glm(cbind(Sickyes, Sickno) ~ GENDER+EDUCATION, data=gats2_agg, family = binomial)
```

---

Compare this to the output we got from the individual observations!

```r
summary(modelagg)
```

```
#> 
#> Call:
#> glm(formula = cbind(Sickyes, Sickno) ~ GENDER + EDUCATION, family = binomial, 
#>     data = gats2_agg)
#> 
#> Deviance Residuals: 
#>      Min        1Q    Median        3Q       Max  
#> -0.81858  -0.16035  -0.00327   0.14003   0.63091  
#> 
#> Coefficients:
#>                                     Estimate Std. Error z value
#> (Intercept)                          1.83277    0.10464  17.516
#> GENDERFemale                         0.80144    0.07161  11.192
#> EDUCATIONLess than Primary School    0.18665    0.12628   1.478
#> EDUCATIONPrimary School              0.51271    0.12693   4.039
#> EDUCATIONLess than Secondary School  0.55071    0.16331   3.372
#> EDUCATIONSecondary School            0.76047    0.11772   6.460
#> EDUCATIONHigh School                 1.08889    0.13740   7.925
#> EDUCATIONUniversity                  1.00133    0.13926   7.191
#> EDUCATIONPostgraduate                0.68403    0.43250   1.582
#>                                     Pr(>|z|)    
#> (Intercept)                          < 2e-16 ***
#> GENDERFemale                         < 2e-16 ***
#> EDUCATIONLess than Primary School   0.139399    
#> EDUCATIONPrimary School             5.36e-05 ***
#> EDUCATIONLess than Secondary School 0.000746 ***
#> EDUCATIONSecondary School           1.05e-10 ***
#> EDUCATIONHigh School                2.28e-15 ***
#> EDUCATIONUniversity                 6.45e-13 ***
#> EDUCATIONPostgraduate               0.113741    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 213.0720  on 15  degrees of freedom
#> Residual deviance:   1.4187  on  7  degrees of freedom
#> AIC: 107.13
#> 
#> Number of Fisher Scoring iterations: 4
```