class: center, middle, inverse, title-slide # Visualising categorical data ##
Introduction to Global Health Data Science ###
Course Website
###
Prof. Amy Herring --- layout: true <div class="my-footer"> <span> <a href="https://sta198f2021.github.io/website/" target="_blank">Back to website</a> </span> </div> --- class: middle # Recap --- ## Variables - **Numerical** variables can be classified as **continuous** or **discrete** based on whether or not the variable can take on an infinite number of values. - If the variable is **categorical**, we can determine if it is **ordinal** based on whether or not the levels have a natural ordering. --- ### Data We consider data from the Global Adult Tobacco Survey (GATS), which is designed to provide nationally-representative data on non-institutionalized people 15 years and older. This survey is a global standard for systematically monitoring adult tobacco use and is produced by the Centers for Disease Control (CDC) in collaboration with the World Health Organization (WHO), RTI International, and Johns Hopkins University. China has the largest smoking population in the world and accounts for roughly 40% of tobacco consumption worldwide. We will focus on GATS data from China in 2018 (the most recent survey year), but note data from other countries are available from the [WHO's Microdata Repository](https://extranet.who.int/ncdsmicrodata/index.php/home). --- ### Data ```r glimpse(gats) ``` ``` ## Rows: 19,376 ## Columns: 18 ## $ CASEID <dbl> 601010, 601012, 601013, 601014, 601015, … ## $ RESIDENCE <fct> Urban, Urban, Urban, Urban, Urban, Urban… ## $ PROVINCE <fct> Beijing, Beijing, Beijing, Beijing, Beij… ## $ REGION6 <fct> North, North, North, North, North, North… ## $ REGION3 <fct> East, East, East, East, East, East, East… ## $ AGE <dbl> 33.95342, 35.92877, 70.52055, 56.95342, … ## $ GENDER <fct> Female, Male, Male, Male, Female, Female… ## $ CURRENTSMOKE <fct> No, No, No, No, Yes, No, No, No, No, Yes… ## $ EDUCATION <fct> High School, Postgraduate, Secondary Sch… ## $ OCCUPATION <fct> Other, Other, Retired, Other, Retired, B… ## $ AGESTART <dbl> NA, NA, NA, NA, 20, NA, NA, NA, NA, 14, … ## $ CIGS_DAY <dbl> NA, NA, NA, NA, 10, NA, NA, NA, NA, 5, N… ## $ HEARDOFECIG <fct> No, Yes, Yes, Yes, Yes, Yes, Yes, No, Ye… ## $ ECIGUSE <fct> NA, Not at All, Not at All, Not at All, … ## $ TRYSTOP <fct> NA, NA, NA, NA, Yes, NA, NA, NA, NA, No,… ## $ HOMESMOKERULES <fct> Never Allowed, Never Allowed, Never Allo… ## $ SMOKESICK <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, … ## $ SMOKECANCER <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, … ``` --- ### Selected variables <br> .midi[ variable | description ----------------|------------- `CURRENTSMOKE` | yes, no, or don't know `AGE` | computed from date of birth `EDUCATION` | highest level of education completed `GENDER` | interviewer instructions were "Record gender from observation. Ask if necessary"; options for male, female, missing/NA `PROVINCE` | residence of the individual ] Other data are also available in the file. Sample survey weights are not included but should be used to obtain nationally-representative estimates (our estimates are fairly close for the quantities we consider today). --- class: middle # Bar plot --- ## Bar plot ```r ggplot(gats, aes(x = CURRENTSMOKE)) + geom_bar() ``` <img src="w2-l02-viz-cat_files/figure-html/unnamed-chunk-3-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Stacked bar plot ```r ggplot(gats, aes(x = CURRENTSMOKE, * fill = GENDER)) + geom_bar() ``` <img src="w2-l02-viz-cat_files/figure-html/unnamed-chunk-4-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Segmented bar plot ```r ggplot(gats, aes(x = CURRENTSMOKE, fill = GENDER)) + * geom_bar(position = "fill") ``` <img src="w2-l02-viz-cat_files/figure-html/unnamed-chunk-5-1.png" width="60%" style="display: block; margin: auto;" /> --- .question[ Which bar plot is a more useful representation for visualizing the relationship between smoking and gender? ] .pull-left[ <img src="w2-l02-viz-cat_files/figure-html/unnamed-chunk-6-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="w2-l02-viz-cat_files/figure-html/unnamed-chunk-7-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Customizing bar plots .panelset[ .panel[.panel-name[Plot] <img src="w2-l02-viz-cat_files/figure-html/unnamed-chunk-8-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ```r ggplot(gats, aes(y = CURRENTSMOKE, fill = GENDER)) + geom_bar(position = "fill") + * labs ( * x = "Proportion", * y = "Current Smoker?", * fill = "Gender", * title = "Smoking by Gender", * subtitle = "2018" * ) ``` ] ] --- # Side-by-Side Bar Plot ```r ggplot(gats, aes(x = CURRENTSMOKE, * fill = GENDER)) + geom_bar(position=position_dodge()) ``` <img src="w2-l02-viz-cat_files/figure-html/sidebyside-1.png" width="60%" style="display: block; margin: auto;" /> --- class: middle # Relationships between numerical and categorical variables --- ## Already talked about... - Colouring and faceting histograms and density plots - Side-by-side box plots --- ## Violin plots Violin plots are like boxplots, but instead of showing the quartiles (25th, 50th, and 75th %iles), they show rotated density plots on each side. --- ## Violin plots ```r ggplot(gats, aes(x = AGE, y = EDUCATION)) + geom_violin() ``` <img src="w2-l02-viz-cat_files/figure-html/unnamed-chunk-9-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Ridge plots Ridge plots also show density estimates across categorical groups --- ## Ridge plots ```r library(ggridges) ggplot(gats, aes(x = AGE, y = EDUCATION, fill = EDUCATION, color = EDUCATION)) + geom_density_ridges(alpha = 0.5) ``` <img src="w2-l02-viz-cat_files/figure-html/unnamed-chunk-10-1.png" width="60%" style="display: block; margin: auto;" />