class: center, middle, inverse, title-slide # Visualising numerical data ##
Introduction to Global Health Data Science ###
Course Website
###
Prof. Amy Herring --- layout: true <div class="my-footer"> <span> <a href="https://sta198f2021.github.io/website/" target="_blank">Back to website</a> </span> </div> --- class: middle # Terminology --- ## Number of variables involved - Univariate data analysis - distribution of single variable - Bivariate data analysis - relationship between two variables - Multivariable/multivariate data analysis - relationship between many variables at once, sometimes focusing on the relationship between two while conditioning for others. (Often we reserve *multivariate* for multiple outcomes, and *multivariable* for multiple predictors.) --- ## Types of variables - **Numerical variables** can be classified as **continuous** or **discrete** based on whether or not the variable can take on an infinite number of values or only a finite number of distinct values (e.g., counts), respectively. - If the variable is **categorical**, we can determine if it is **ordinal** based on whether or not the levels have a natural ordering. --- class: middle # Data --- ## Data: Life Expectancy .pull-left-wide[ - We focus again on the IHME data on estimated life expectancy for a variety of countries and locations worldwide, in the year 2019. ```r glimpse(lifespan2019) ``` ``` ## Rows: 408 ## Columns: 5 ## $ location <chr> "Afghanistan", "Afghanistan", "Albania"… ## $ worldbankregion <chr> "South Asia", "South Asia", "Europe and… ## $ pop <dbl> 38041757, 38041757, 2854191, 2854191, 4… ## $ sex <chr> "Male", "Female", "Female", "Male", "Fe… ## $ lifeexp <dbl> 63.46025, 63.22770, 81.38516, 75.83942,… ``` ] --- ## Selected variables <br> .midi[ variable | description ----------------|------------- `location` | Country or area name `worldbankregion` | World region, as classified by World Bank `pop` | Estimated 2019 population of the location `sex` | Binary sex as reported by the country `lifeexp` | Life expectancy from infancy ] --- ## Variable types <br> variable | type ----------------|------------- `location` | categorical, not ordinal `worldbankregion` | categorical, not ordinal `pop` | numerical, discrete `sex` | categorical, not ordinal `lifeexp` | numerical, continuous In the full data (not subset to 2019), `year` is an additional numerical, discrete variable. --- class: middle # Visualizing numerical data --- ## Describing shapes of numerical distributions - shape: - skewness: right-skewed, left-skewed, symmetric (skew is to the side of the longer tail) - modality: unimodal, bimodal, multimodal, uniform - center: mean (`mean`), median (`median`), mode - spread: range (`range`), standard deviation (`sd`), variance (square of `sd`), inter-quartile range (`IQR`) - unusual observations --- ## Measures of center Consider a random variable `\(x_i\)` that is the estimated life expectancy for country `\(i\)`, `\(i=1,\ldots,n\)` among `\(n\)` countries. - mean: estimated as `\(\overline{x}=\frac{\sum_{i=1}^n x_i}{n}\)` - median: middle number/50th %ile, `\(x_{(n+1)/2}\)` - for `\(n\)` odd, the median is the middle number - for `\(n\)` even, the median is the mean of the two middle numbers - just remember it splits the ordered data in half - mode: most frequent value in data set --- ## Measures of spread - variance: average squared distance from mean - standard deviation (sd): square root of variance (on same scale as data). The sample variance `\(s^2\)` is estimated in a single sample as `\(\frac{\sum_{i=1}^n (x_i-\overline{x})^2}{n-1}\)` - range: difference between highest and lowest values, e.g. `\(x_{(n)}-x_{(1)}\)` - interquartile range: difference between 75th and 25th %iles --- class: middle # Relationships among numerical variables --- ## Scatterplot Previously we viewed a scatterplot showing the relationship of life expectancy of females to that of males in each location. ```r ggplot(lifeexpwide2019, aes(x = Female, y = Male)) + geom_point() + labs( title = "Life expectancy", subtitle = "2019", x = "Female life expectancy", y = "Male life expectancy" ) ``` <img src="w2-l01-viz-num_files/figure-html/unnamed-chunk-3-1.png" width="35%" style="display: block; margin: auto;" /> --- class: middle # Histogram --- ## Histogram ```r lifespan2019 %>% filter(sex == 'Female') %>% ggplot(aes(x = lifeexp)) + geom_histogram() ``` ``` ## `stat_bin()` using `bins = 30`. Pick better value with ## `binwidth`. ``` <img src="w2-l01-viz-num_files/figure-html/unnamed-chunk-4-1.png" width="40%" style="display: block; margin: auto;" /> --- ## Histograms and binwidth .panelset[ .panel[.panel-name[binwidth = 0.5] ```r lifespan2019 %>% filter(sex == 'Female') %>% ggplot(aes(x = lifeexp)) + geom_histogram(binwidth = 0.5) ``` <img src="w2-l01-viz-num_files/figure-html/unnamed-chunk-5-1.png" width="40%" style="display: block; margin: auto;" /> ] .panel[.panel-name[binwidth = 3] ```r lifespan2019 %>% filter(sex == 'Female') %>% ggplot(aes(x = lifeexp)) + geom_histogram(binwidth = 3) ``` <img src="w2-l01-viz-num_files/figure-html/unnamed-chunk-6-1.png" width="40%" style="display: block; margin: auto;" /> ] .panel[.panel-name[binwidth = 10] ```r lifespan2019 %>% filter(sex == 'Female') %>% ggplot(aes(x = lifeexp)) + geom_histogram(binwidth = 10) ``` <img src="w2-l01-viz-num_files/figure-html/unnamed-chunk-7-1.png" width="40%" style="display: block; margin: auto;" /> ] ] --- ## Customizing histograms .panelset[ .panel[.panel-name[Plot] <img src="w2-l01-viz-num_files/figure-html/unnamed-chunk-8-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ```r lifespan2019 %>% filter(sex == 'Female') %>% ggplot(aes(x = lifeexp)) + geom_histogram(binwidth = 3) + labs(x = "Female life expectancy (years)", y = "Frequency", title = "2019 Life Expectancy of Women") ``` ] ] --- ## Fill with a categorical variable .panelset[ .panel[.panel-name[Plot] <img src="w2-l01-viz-num_files/figure-html/unnamed-chunk-9-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ```r lifespan2019 %>% filter(sex == 'Female') %>% ggplot(aes(x = lifeexp, * fill = worldbankregion)) + geom_histogram(binwidth = 3, * alpha = 0.5) + labs(x = "Female life expectancy (years)", y = "Frequency", title = "2019 Life Expectancy of Women") ``` ] ] --- ## Facet with a categorical variable .panelset[ .panel[.panel-name[Plot] <img src="w2-l01-viz-num_files/figure-html/unnamed-chunk-10-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ```r lifespan2019 %>% filter(sex == 'Female') %>% ggplot(aes(x = lifeexp, fill = worldbankregion)) + geom_histogram(binwidth = 3, alpha = 0.5) + labs(x = "Female life expectancy (years)", y = "Frequency", title = "2019 Life Expectancy of Women") + * facet_wrap( ~ worldbankregion, nrow = 3) ``` ] ] --- ## Facet with a categorical variable (fixing labels) .panelset[ .panel[.panel-name[Plot] <img src="w2-l01-viz-num_files/figure-html/unnamed-chunk-11-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ```r lifespan2019 %>% filter(sex == 'Female') %>% ggplot(aes(x = lifeexp, fill = worldbankregion)) + geom_histogram(binwidth = 3, alpha = 0.5) + labs(x = "Female life expectancy (years)", y = "Frequency", fill = "World Bank Region", title = "2019 Life Expectancy of Women") + facet_wrap( ~ worldbankregion, nrow = 3) + theme(strip.background = element_blank(), strip.text.x = element_blank()) ``` ] ] --- class: middle # Density plot --- ## Density plot .pull-left[ ```r lifespan2019 %>% filter(sex == 'Female') %>% ggplot(aes(x = lifeexp)) + * geom_density() ``` <img src="w2-l01-viz-num_files/figure-html/unnamed-chunk-12-1.png" width="60%" style="display: block; margin: auto;" /> ] .pull-right[ A *density function* is a function whose value at any given point can be interpreted as providing a relative likelihood of values. That is, higher values of the density function indicate values of the random variable that are more likely to be observed. ] --- ## Density plots and adjusting bandwidth .panelset[ .panel[.panel-name[adjust = 0.5] ```r lifespan2019 %>% filter(sex == 'Female') %>% ggplot(aes(x = lifeexp)) + * geom_density(adjust = 0.5) ``` <img src="w2-l01-viz-num_files/figure-html/unnamed-chunk-13-1.png" width="40%" style="display: block; margin: auto;" /> ] .panel[.panel-name[adjust = 1] ```r lifespan2019 %>% filter(sex == 'Female') %>% ggplot(aes(x = lifeexp)) + geom_density(adjust = 1) #<< default bandwidth ``` <img src="w2-l01-viz-num_files/figure-html/unnamed-chunk-14-1.png" width="40%" style="display: block; margin: auto;" /> ] .panel[.panel-name[adjust = 2] ```r lifespan2019 %>% filter(sex == 'Female') %>% ggplot(aes(x = lifeexp)) + * geom_density(adjust = 2) ``` <img src="w2-l01-viz-num_files/figure-html/unnamed-chunk-15-1.png" width="40%" style="display: block; margin: auto;" /> ] ] --- ## Customizing density plots .panelset[ .panel[.panel-name[Plot] <img src="w2-l01-viz-num_files/figure-html/unnamed-chunk-16-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ```r lifespan2019 %>% filter(sex == 'Female') %>% ggplot(aes(x = lifeexp)) + geom_density(adjust = 1) + * labs( * x = "Female life expectancy (years)", * y = "Density", * title = "2019 Life Expectancy of Women") ``` ] ] --- ## Adding a categorical variable .panelset[ .panel[.panel-name[Plot] <img src="w2-l01-viz-num_files/figure-html/unnamed-chunk-17-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ```r lifespan2019 %>% filter(sex == 'Female') %>% ggplot(aes(x = lifeexp, * fill = worldbankregion)) + geom_density(adjust = 1) + labs(x = "Female life expectancy (years)", y = "Density", title = "2019 Life Expectancy of Women", * fill = "Region") ``` ] ] --- class: middle # Box plot --- ## Box plot ```r lifespan2019 %>% filter(sex == 'Female') %>% ggplot(aes(x = lifeexp)) + * geom_boxplot() ``` <img src="w2-l01-viz-num_files/figure-html/boxp-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Box plot .pull-left[ <img src="w2-l01-viz-num_files/figure-html/unnamed-chunk-18-1.png" width="60%" style="display: block; margin: auto;" /> ] .pull-right[ - Technical specs vary across software packages - Median: line in middle of box - Hinges (25th and 75th %iles): edges of box - Upper whisker extends to largest data point no more than `\(1.5 \times IQR\)` from the hinge (75th %ile); similar definition for lower whisker - Outliers (beyond whiskers) plotted individually ] --- ## Box plot The distribution of population is highly skewed... ```r lifespan2019 %>% filter(sex == 'Female') %>% * ggplot(aes(x = pop)) + geom_boxplot() ``` <img src="w2-l01-viz-num_files/figure-html/unnamed-chunk-19-1.png" width="50%" style="display: block; margin: auto;" /> --- ## Box plot Here we plot the natural logarithm of the population, due to the heavy skew (China, India). Now the outliers are the small locations. ```r lifespan2019 %>% filter(sex == 'Female') %>% * ggplot(aes(x = log(pop))) + geom_boxplot() ``` <img src="w2-l01-viz-num_files/figure-html/unnamed-chunk-20-1.png" width="50%" style="display: block; margin: auto;" /> --- ## Customizing box plots .panelset[ .panel[.panel-name[Plot] <img src="w2-l01-viz-num_files/figure-html/unnamed-chunk-21-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ```r lifespan2019 %>% filter(sex == 'Female') %>% ggplot(aes(x = lifeexp)) + geom_boxplot() + * labs( * x = "Female life expectancy (years)", * y = NULL, title = "2019 Life Expectancy of Women") + #<<) * theme( * axis.ticks.y = element_blank(), * axis.text.y = element_blank()) ``` ] ] --- ## Adding a categorical variable .panelset[ .panel[.panel-name[Plot] <img src="w2-l01-viz-num_files/figure-html/unnamed-chunk-22-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ```r lifespan2019 %>% filter(sex == 'Female') %>% ggplot(aes(x = lifeexp, * y = worldbankregion)) + geom_boxplot() + labs( x = "Female life expectancy (years)", y = NULL, title = "2019 Life Expectancy of Women", * subtitle = "By region" ) ``` ] ]