Visualising numerical data

# Visualising numerical data
## <br><br> Introduction to Global Health Data Science
### <a href="https://sta198f2021.github.io/website/">Course Website</a>
### <br> Prof. Amy Herring

---

layout: true
  
<div class="my-footer">
<span>
<a href="https://sta198f2021.github.io/website/" target="_blank">Back to website</a>
</span>
</div>

---

# Terminology

---

## Number of variables involved

- Univariate data analysis - distribution of single variable
- Bivariate data analysis - relationship between two variables
- Multivariable/multivariate data analysis - relationship between many variables at once, sometimes focusing on the relationship between two while conditioning for others. (Often we reserve *multivariate* for multiple outcomes, and *multivariable* for multiple predictors.)

---

## Types of variables

- **Numerical variables** can be classified as **continuous** or **discrete** based on whether or not the variable can take on an infinite number of values or only a finite number of distinct values (e.g., counts), respectively.
- If the variable is **categorical**, we can determine if it is **ordinal** based on whether or not the levels have a natural ordering.

---

# Data

---

## Data: Life Expectancy

- We focus again on the IHME data on estimated life expectancy for a variety of countries and locations worldwide, in the year 2019.

```r
glimpse(lifespan2019)
```

```
## Rows: 408
## Columns: 5
## $ location        <chr> "Afghanistan", "Afghanistan", "Albania"…
## $ worldbankregion <chr> "South Asia", "South Asia", "Europe and…
## $ pop             <dbl> 38041757, 38041757, 2854191, 2854191, 4…
## $ sex             <chr> "Male", "Female", "Female", "Male", "Fe…
## $ lifeexp         <dbl> 63.46025, 63.22770, 81.38516, 75.83942,…
```

]

---

## Selected variables

<br>

.midi[
variable        | description
----------------|-------------
`location`      |	Country or area name
`worldbankregion` |	World region, as classified by World Bank
`pop`	          | Estimated 2019 population of the location
`sex`	          | Binary sex as reported by the country
`lifeexp`       |	Life expectancy from infancy
]

---

## Variable types

<br>

variable        | type
----------------|-------------
`location`   |	categorical, not ordinal
`worldbankregion` |	categorical, not ordinal
`pop`	        | numerical, discrete
`sex`	        | categorical, not ordinal
`lifeexp`         |	numerical, continuous

In the full data (not subset to 2019), `year` is an additional numerical, discrete variable.

---

# Visualizing numerical data

---

## Describing shapes of numerical distributions

- shape:
    - skewness: right-skewed, left-skewed, symmetric (skew is to the side of the longer tail)
    - modality: unimodal, bimodal, multimodal, uniform
- center: mean (`mean`), median (`median`), mode 
- spread: range (`range`), standard deviation (`sd`), variance (square of `sd`), inter-quartile range (`IQR`)
- unusual observations

---

## Measures of center

Consider a random variable `$x_i$` that is the estimated life expectancy for country `$i$`, `$i=1,\ldots,n$` among `$n$` countries.

- mean: estimated as `$\overline{x}=\frac{\sum_{i=1}^n x_i}{n}$`

- median: middle number/50th %ile, `$x_{(n+1)/2}$`
  - for `$n$` odd, the median is the middle number
  - for `$n$` even, the median is the mean of the two middle numbers
  - just remember it splits the ordered data in half
  
- mode: most frequent value in data set

---

## Measures of spread

- variance: average squared distance from mean

- standard deviation (sd): square root of variance (on same scale as data). The sample variance `$s^2$` is estimated in a single sample as `$\frac{\sum_{i=1}^n (x_i-\overline{x})^2}{n-1}$`

- range: difference between highest and lowest values, e.g. `$x_{(n)}-x_{(1)}$`

- interquartile range: difference between 75th and 25th %iles

---

# Relationships among numerical variables

---

## Scatterplot

Previously we viewed a scatterplot showing the relationship of life expectancy of females to that of males in each location.

```r
ggplot(lifeexpwide2019, aes(x = Female, y = Male)) +
  geom_point() +
  labs(
    title = "Life expectancy",
    subtitle = "2019",
    x = "Female life expectancy",
    y = "Male life expectancy"
  )
```

---

# Histogram

---

## Histogram

```r
lifespan2019 %>%
  filter(sex == 'Female') %>%
  ggplot(aes(x = lifeexp)) +
  geom_histogram() 
```

```
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
```

---

## Histograms and binwidth

```r
lifespan2019 %>%
  filter(sex == 'Female') %>%
  ggplot(aes(x = lifeexp)) +
  geom_histogram(binwidth = 0.5) 
```

<img src="w2-l01-viz-num_files/figure-html/unnamed-chunk-5-1.png" width="40%" style="display: block; margin: auto;" />
]
.panel[.panel-name[binwidth = 3]

```r
lifespan2019 %>%
  filter(sex == 'Female') %>%
  ggplot(aes(x = lifeexp)) +
  geom_histogram(binwidth = 3) 
```

<img src="w2-l01-viz-num_files/figure-html/unnamed-chunk-6-1.png" width="40%" style="display: block; margin: auto;" />
]
.panel[.panel-name[binwidth = 10]

```r
lifespan2019 %>%
  filter(sex == 'Female') %>%
  ggplot(aes(x = lifeexp)) +
  geom_histogram(binwidth = 10) 
```

<img src="w2-l01-viz-num_files/figure-html/unnamed-chunk-7-1.png" width="40%" style="display: block; margin: auto;" />
]
]

---

## Customizing histograms

.panelset[
.panel[.panel-name[Plot]
<img src="w2-l01-viz-num_files/figure-html/unnamed-chunk-8-1.png" width="60%" style="display: block; margin: auto;" />
]
.panel[.panel-name[Code]

```r
lifespan2019 %>%
  filter(sex == 'Female') %>%
  ggplot(aes(x = lifeexp)) +
  geom_histogram(binwidth = 3) +
  labs(x = "Female life expectancy (years)",
       y = "Frequency",
       title = "2019 Life Expectancy of Women")
```
]
]

---

## Fill with a categorical variable

.panelset[
.panel[.panel-name[Plot]
<img src="w2-l01-viz-num_files/figure-html/unnamed-chunk-9-1.png" width="60%" style="display: block; margin: auto;" />
]
.panel[.panel-name[Code]

```r
lifespan2019 %>%
  filter(sex == 'Female') %>%
  ggplot(aes(x = lifeexp,
*            fill = worldbankregion)) +
  geom_histogram(binwidth = 3,
*                alpha = 0.5) +
  labs(x = "Female life expectancy (years)",
       y = "Frequency",
       title = "2019 Life Expectancy of Women")
```
]
]

---

## Facet with a categorical variable

.panelset[
.panel[.panel-name[Plot]
<img src="w2-l01-viz-num_files/figure-html/unnamed-chunk-10-1.png" width="60%" style="display: block; margin: auto;" />
]
.panel[.panel-name[Code]

```r
lifespan2019 %>%
  filter(sex == 'Female') %>%
  ggplot(aes(x = lifeexp,
             fill = worldbankregion)) +
  geom_histogram(binwidth = 3,
                 alpha = 0.5) +
  labs(x = "Female life expectancy (years)",
       y = "Frequency",
       title = "2019 Life Expectancy of Women") +
* facet_wrap( ~ worldbankregion, nrow = 3)
```
]
]

---

## Facet with a categorical variable (fixing labels)

.panelset[
.panel[.panel-name[Plot]
<img src="w2-l01-viz-num_files/figure-html/unnamed-chunk-11-1.png" width="60%" style="display: block; margin: auto;" />
]
.panel[.panel-name[Code]

```r
lifespan2019 %>%
  filter(sex == 'Female') %>%
  ggplot(aes(x = lifeexp,
             fill = worldbankregion)) +
  geom_histogram(binwidth = 3,
                 alpha = 0.5) +
  labs(x = "Female life expectancy (years)",
       y = "Frequency",
       fill = "World Bank Region",
       title = "2019 Life Expectancy of Women") +
  facet_wrap( ~ worldbankregion, nrow = 3) +
  theme(strip.background = element_blank(),
        strip.text.x = element_blank())
```
]
]

---

# Density plot

---

## Density plot

```r
lifespan2019 %>%
  filter(sex == 'Female') %>%
  ggplot(aes(x = lifeexp)) +
* geom_density()
```

<img src="w2-l01-viz-num_files/figure-html/unnamed-chunk-12-1.png" width="60%" style="display: block; margin: auto;" />
]
.pull-right[
A *density function* is a function whose value at any given point can be interpreted as providing a relative likelihood of values.  That is, higher values of the density function indicate values of the random variable that are more likely to be observed.
]
---

## Density plots and adjusting bandwidth

```r
lifespan2019 %>%
  filter(sex == 'Female') %>%
  ggplot(aes(x = lifeexp)) +
* geom_density(adjust = 0.5)
```

<img src="w2-l01-viz-num_files/figure-html/unnamed-chunk-13-1.png" width="40%" style="display: block; margin: auto;" />
]
.panel[.panel-name[adjust = 1]

```r
lifespan2019 %>%
  filter(sex == 'Female') %>%
  ggplot(aes(x = lifeexp)) +
  geom_density(adjust = 1)  #<< default bandwidth
```

<img src="w2-l01-viz-num_files/figure-html/unnamed-chunk-14-1.png" width="40%" style="display: block; margin: auto;" />
]
.panel[.panel-name[adjust = 2]

```r
lifespan2019 %>%
  filter(sex == 'Female') %>%
  ggplot(aes(x = lifeexp)) +
* geom_density(adjust = 2)
```

<img src="w2-l01-viz-num_files/figure-html/unnamed-chunk-15-1.png" width="40%" style="display: block; margin: auto;" />
]
]

---

## Customizing density plots

.panelset[
.panel[.panel-name[Plot]
<img src="w2-l01-viz-num_files/figure-html/unnamed-chunk-16-1.png" width="60%" style="display: block; margin: auto;" />
]
.panel[.panel-name[Code]

```r
lifespan2019 %>%
  filter(sex == 'Female') %>%
  ggplot(aes(x = lifeexp)) +
  geom_density(adjust = 1) +
* labs(
*   x = "Female life expectancy (years)",
*   y = "Density",
*   title = "2019 Life Expectancy of Women")
```
]
]

---

## Adding a categorical variable

.panelset[
.panel[.panel-name[Plot]
<img src="w2-l01-viz-num_files/figure-html/unnamed-chunk-17-1.png" width="60%" style="display: block; margin: auto;" />
]
.panel[.panel-name[Code]

```r
lifespan2019 %>%
  filter(sex == 'Female') %>%
  ggplot(aes(x = lifeexp,
*            fill = worldbankregion)) +
  geom_density(adjust = 1) +
  labs(x = "Female life expectancy (years)",
       y = "Density",
       title = "2019 Life Expectancy of Women",
*      fill = "Region")
```
]
]

---

# Box plot

---

## Box plot

```r
lifespan2019 %>%
  filter(sex == 'Female') %>%
  ggplot(aes(x = lifeexp)) +
* geom_boxplot()
```

---

## Box plot

<img src="w2-l01-viz-num_files/figure-html/unnamed-chunk-18-1.png" width="60%" style="display: block; margin: auto;" />
]

.pull-right[
- Technical specs vary across software packages
- Median: line in middle of box
- Hinges (25th and 75th %iles): edges of box
- Upper whisker extends to largest data point no more than `$1.5 \times IQR$` from the hinge (75th %ile); similar definition for lower whisker
- Outliers (beyond whiskers) plotted individually
]

---

## Box plot

The distribution of population is highly skewed...

```r
lifespan2019 %>%
  filter(sex == 'Female') %>%
* ggplot(aes(x = pop)) +
  geom_boxplot()
```

---

## Box plot

Here we plot the natural logarithm of the population, due to the heavy skew (China, India). Now the outliers are the small locations.

```r
lifespan2019 %>%
  filter(sex == 'Female') %>%
* ggplot(aes(x = log(pop))) +
  geom_boxplot()
```

---

## Customizing box plots

.panelset[
.panel[.panel-name[Plot]
<img src="w2-l01-viz-num_files/figure-html/unnamed-chunk-21-1.png" width="60%" style="display: block; margin: auto;" />
]
.panel[.panel-name[Code]

```r
lifespan2019 %>%
  filter(sex == 'Female') %>%
  ggplot(aes(x = lifeexp)) +
  geom_boxplot() +
* labs(
*   x = "Female life expectancy (years)",
*   y = NULL,
    title = "2019 Life Expectancy of Women") + #<<)
*   theme(
*     axis.ticks.y = element_blank(),
*     axis.text.y = element_blank())
```
]
]

---

## Adding a categorical variable

.panelset[
.panel[.panel-name[Plot]
<img src="w2-l01-viz-num_files/figure-html/unnamed-chunk-22-1.png" width="60%" style="display: block; margin: auto;" />
]
.panel[.panel-name[Code]

```r
lifespan2019 %>%
  filter(sex == 'Female') %>%
  ggplot(aes(x = lifeexp,
*            y = worldbankregion)) +
  geom_boxplot() +
  labs(
    x = "Female life expectancy (years)",
    y = NULL, 
    title = "2019 Life Expectancy of Women",
*   subtitle = "By region"
  ) 
```
]
]