Processing math: 100%
+ - 0:00:00
Notes for current slide
Notes for next slide

Survival Analysis



Introduction to Global Health Data Science

Back to Website


Prof. Amy Herring

1 / 38

Survival Data

2 / 38

Goal

Survival analysis is a complex topic! The goal of our coverage is to give you the skills you need to understand results of simple descriptive statistics in this setting, and we will not have time to discuss more complex modeling of survival data in this course. Come see me in the future if you need to analyze survival data!

3 / 38

Examples

In many studies, the outcome of interest is the amount of time from an initial observation until the occurrence of some event of interest, e.g.

  • Time from transplant surgery until new organ failure
  • Time to death in a pancreatic cancer trial
  • Time to first sex
  • Time to menarche
  • Time to divorce
  • Time to graduation

Typically, the event of interest is called a failure (even if it is a good thing). The time interval between a starting point and the failure is known as the survival time and is often represented by t.

4 / 38

Characteristics of Survival Data

Certain aspects of survival data make data analysis particularly challenging.

  • Typically, not all the individuals are observed until their times of failure
    • An organ transplant recipient may die in an automobile accident before the new organ fails
    • A student may withdraw to start a multi-billion dollar health company
    • Not everyone gets divorced
    • A pancreatic cancer patient may move to Aitutaki instead of undergoing further treatment
  • In this case, an observation is said to be right censored at the last point of contact

When you hear someone talk about censoring, typically they mean right censoring, but there are other types of censoring that you may encounter.

5 / 38

Censoring

Data may be censored in multiple ways:

  • Right censored: the usual type of censoring in health data, accommodated by standard models. In this case you see a study participant at the beginning of the study, but you may not follow them until the outcome occurs (or the outcome may never occur)

  • Left censored: a study participant may have experienced the event before the study begins. For example, you may want to study when kindergarteners reach a certain reading level, but some may be proficient readers already at the start of your study.

  • Interval censored: you don't know exactly when an event happened, only that it happened between two times. For example, if a patient enrolled in a cancer screening study is cancer-free at the time of the first screening, t1 but has cancer at the time of the second screening, t2, you only know that cancer developed sometime after t1 but before t2

6 / 38

Calendar Time and Participant Time

It is important to distinguish between calendar time and participant time. In this plot, an "X" at the end of a line represents a failure, and an "O" represents a censored observation.

7 / 38

  • A study may start enrolling patients in September and continue until all 500 patients have been enrolled
  • This is likely to take months or years
  • Time is typically converted from calendar time to participant time (time between enrollment and failure or censoring) before analysis
8 / 38

The distribution of survival times is characterized by the survival function, represented by S(t). For a continuous random variable T, S(t)=Pr(T>t) and S(t) represents the proportion of individuals who have not yet failed.

The graph of S(t) versus t is called a survival curve. The survival curve shows the proportion of survivors at any given time.

9 / 38

Survival of Children in Burkina Faso by Vaccination Status

10 / 38

Estimating Survival Curves

11 / 38

Small Study of 10 Patients

Patient Event time x Event type
1 4.5 Death
2 7.5 Death
3 8.5 Censored
4 11.5 Death
5 13.5 Censored
6 15.5 Death
7 16.5 Death
8 17.5 Censored
9 19.5 Death
10 21.5 Censored

How do we estimate the survival curve for these data?

12 / 38

Kaplan-Meier Estimate

Perhaps the most popular estimate of a survival curve is the Kaplan-Meier or product-limit estimate. This method is actually fairly intuitive. Define the following quantities.

  • It: # at risk of failure at time t; those who did not fail before t and those who were not censored before t; also known as the risk set
  • dt: # who fail at time t
  • qt=dtIt: estimated probability of failing at time t
  • S(t): cumulative probability of surviving beyond time t, estimated as ˆS(t)=tit(1dtiIti).
  • the symbol is for multiplication, e.g. 3i=1xi=x1x2x3 and 5i=1i=1×2×3×4×5.
13 / 38

Intutive how?

ˆS(t)=tit(1dtiIti)

At each time t, the probability of surviving is just 1Pr(failing) (remember the complement rule, also know as the law of total probability!). Before there are any failures in the data, our estimated ˆS(t)=1. At the time of the first failure, this probability falls below 1 and is simply one minus the probability of failing at that time, or 1# failures# at risk of failing.

After the first failure, things get more complicated. At the time of the second failure, you can calcuate 1# failures# at risk of failing, but this doesn't provide the whole picture, as someone else has already died. In fact, this is the conditional probability of surviving now that you've made it past the time of the first failure.

14 / 38

Welcome Back, Probability!

I promised you'd see probability laws again! :)

Coming back to you next: the multiplicative rule!

15 / 38

Multiplicative Rule Saves the Day!

ˆS(t)=tit(1dtiIti)

How do you then calculate the total (unconditional) probability of survival? That is just the product of the probability of surviving past the first failure times the conditional probability of surviving beyond the second failure given that you made it past the first, or

Pr(survive past first and second times) =Pr(survive past first time)Pr(survive past second timesurvive past first time) =(1# failures at failure time 1# at risk of failing at failure time 1)(1# of failures at failure time 2# at risk of failing at failure time 2)

If someone is censored, they are no longer at risk of failing at the next failure time and are taken out of the risk set and out of the calculation.

16 / 38

Kaplan-Meier (KM) Estimate

t # Failed: dt # Censored # Left: It+1 ˆS(t)
0.0 0 0
4.5 1 0
7.5 1 0
8.5 0 1
11.5 1 0
13.5 0 1
15.5 1 0
16.5 1 0
17.5 0 1
19.5 1 0
21.5 0 1

ˆS(t)=tit(1dtiIti) Remember dt is the # who fail at time t, and It is the # at risk of failure at time t (have not been censored and have not failed)

17 / 38
t # Failed: dt # Censored # Left: It+1 ˆS(t)
0.0 0 0

10

1

4.5 1 0
7.5 1 0
8.5 0 1
11.5 1 0
13.5 0 1
15.5 1 0
16.5 1 0
17.5 0 1
19.5 1 0
21.5 0 1

ˆS(t)=tit(1dtiIti)

ˆS(0)=1

18 / 38
t # Failed: dt # Censored # Left: It+1 ˆS(t)
0.0 0 0 10 1
4.5 1 0

9

1×(1110=0.9)

7.5 1 0
8.5 0 1
11.5 1 0
13.5 0 1
15.5 1 0
16.5 1 0
17.5 0 1
19.5 1 0
21.5 0 1

ˆS(t)=tit(1dtiIti)

Pr(survive past time 4.5) =Pr(survive past time 0) ×Pr(survive past time 4.5survive past time 0)

19 / 38
t # Failed: dt # Censored # Left: It+1 ˆS(t)
0.0 0 0 10 1
4.5 1 0 9 0.9
7.5 1 0

8

0.9×(119)=0.8

8.5 0 1
11.5 1 0
13.5 0 1
15.5 1 0
16.5 1 0
17.5 0 1
19.5 1 0
21.5 0 1

ˆS(t)=tit(1dtiIti)

Pr(survive past time 7.5) =Pr(survive past time 4.5) ×Pr(survive past time 7.5survive past time 4.5)

20 / 38
t # Failed: dt # Censored # Left: It+1 ˆS(t)
0.0 0 0 10 1
4.5 1 0 9 0.9
7.5 1 0 8 0.8
8.5 0 1

7

0.8×(108)=0.8

11.5 1 0
13.5 0 1
15.5 1 0
16.5 1 0
17.5 0 1
19.5 1 0
21.5 0 1

ˆS(t)=tit(1dtiIti)

Pr(survive past time 8.5) =Pr(survive past time 7.5) ×Pr(survive past time 8.5survive past time 7.5)

21 / 38
t # Failed: dt # Censored # Left: It+1 ˆS(t)
0.0 0 0 10 1
4.5 1 0 9 0.9
7.5 1 0 8 0.8
8.5 0 1 7 0.8
11.5 1 0

6

0.8×(117)=0.69

13.5 0 1
15.5 1 0
16.5 1 0
17.5 0 1
19.5 1 0
21.5 0 1

ˆS(t)=tit(1dtiIti)

Pr(survive past time 11.5) =Pr(survive past time 8.5) ×Pr(survive past time 11.5survive past time 8.5)

22 / 38
t # Failed: dt # Censored # Left: It+1 ˆS(t)
0.0 0 0 10 1
4.5 1 0 9 0.9
7.5 1 0 8 0.8
8.5 0 1 7 0.8
11.5 1 0 6 0.69
13.5 0 1

6

0.69×(106)=0.69

15.5 1 0
16.5 1 0
17.5 0 1
19.5 1 0
21.5 0 1

ˆS(t)=tit(1dtiIti)

Pr(survive past time 13.5) =Pr(survive past time 11.5) ×Pr(survive past time 13.5survive past time 11.5)

23 / 38
t # Failed: dt # Censored # Left: It+1 ˆS(t)
0.0 0 0 10 1
4.5 1 0 9 0.9
7.5 1 0 8 0.8
8.5 0 1 7 0.8
11.5 1 0 6 0.69
13.5 0 1 5 0.69
15.5 1 0

4

0.69×(115)=0.552

16.5 1 0
17.5 0 1
19.5 1 0
21.5 0 1

ˆS(t)=tit(1dtiIti)

Pr(survive past time 15.5) =Pr(survive past time 13.5) ×Pr(survive past time 15.5survive past time 13.5)

24 / 38
t # Failed: dt # Censored # Left: It+1 ˆS(t)
0.0 0 0 10 1
4.5 1 0 9 0.9
7.5 1 0 8 0.8
8.5 0 1 7 0.8
11.5 1 0 6 0.69
13.5 0 1 5 0.69
15.5 1 0 4 0.552
16.5 1 0

3

0.552×(114)=0.414

17.5 0 1
19.5 1 0
21.5 0 1

ˆS(t)=tit(1dtiIti)

Pr(survive past time 16.5) =Pr(survive past time 15.5) ×Pr(survive past time 16.5survive past time 15.5)

25 / 38
t # Failed: dt # Censored # Left: It+1 ˆS(t)
0.0 0 0 10 1
4.5 1 0 9 0.9
7.5 1 0 8 0.8
8.5 0 1 7 0.8
11.5 1 0 6 0.69
13.5 0 1 5 0.69
15.5 1 0 4 0.552
16.5 1 0 3 0.414
17.5 0 1

2

0.414×(103)=0.414

19.5 1 0
21.5 0 1

ˆS(t)=tit(1dtiIti)

Pr(survive past time 17.5) =Pr(survive past time 16.5) ×Pr(survive past time 17.5survive past time 16.5)

26 / 38
t # Failed: dt # Censored # Left: It+1 ˆS(t)
0.0 0 0 10 1
4.5 1 0 9 0.9
7.5 1 0 8 0.8
8.5 0 1 7 0.8
11.5 1 0 6 0.69
13.5 0 1 5 0.69
15.5 1 0 4 0.552
16.5 1 0 3 0.414
17.5 0 1 2 0.414
19.5 1 0

1

0.414×(112)=0.207

21.5 0 1

ˆS(t)=tit(1dtiIti)

Pr(survive past time 19.5) =Pr(survive past time 17.5) ×Pr(survive past time 19.5survive past time 17.5)

27 / 38
t # Failed: dt # Censored # Left: It+1 ˆS(t)
0.0 0 0 10 1
4.5 1 0 9 0.9
7.5 1 0 8 0.8
8.5 0 1 7 0.8
11.5 1 0 6 0.69
13.5 0 1 5 0.69
15.5 1 0 4 0.552
16.5 1 0 3 0.414
17.5 0 1 2 0.414
19.5 1 0 1 0.207
21.5 0 1

1

0.207×(101)=0.207

ˆS(t)=tit(1dtiIti)

Pr(survive past time 21.5) =Pr(survive past time 19.5) ×Pr(survive past time 21.5survive past time 19.5)

28 / 38

KM Estimate

In between failure times, the KM estimate does not change but is constant. This gives the estimated survival function its step-like appearance (we call this type of function a step function).

29 / 38

Tumors in Children, 2012 Neuro-Oncology

ATCT is an imaging-based biomarker of tumor prognosis.

  • Which biomarker values are associated with the best survival?
  • Which values are associated with the worst survival?
  • What is the median survival time in the group with the smallest ATCT values?
    • Median survival is the time at which ˆS(t)=0.5
  • If a child is in the group with the largest ATCT values, what is their estimated 5-year survival probability?
30 / 38

Lung Cancer Data

The lung dataset is available from the survival package in R. The data contain subjects with advanced lung cancer. Variables include the following.

time: Survival time in days

status: censoring status 1=censored, 2=dead (failure) (note: another common coding recognized by R is to let 0=censored and 1=failure)

ph.ecog: Eastern Cooperative Oncology Group (ECOG) performance score, where 0=asymptomatic, 1=symptomatic but ambulatory, 2=in bed <50% of day, 3=in bed >50% of day but not bedbound, 4=bedbound

31 / 38

KM Plot for Lung Cancer Data

library(survival)
library(survminer)
ggsurvplot(
fit = survfit(Surv(time, status)~ph.ecog, data=lung),
title = "Survival by Performance Score",
xlab = "Days",
ylab = "Survival probability"
)
32 / 38

Estimating Median Survival

One quantity of interest is the median survival time.

survfit(Surv(time, status) ~ ph.ecog, data = lung)
#> Call: survfit(formula = Surv(time, status) ~ ph.ecog, data = lung)
#>
#> 1 observation deleted due to missingness
#> n events median 0.95LCL 0.95UCL
#> ph.ecog=0 63 37 394 348 574
#> ph.ecog=1 113 82 306 268 429
#> ph.ecog=2 50 44 199 156 288
#> ph.ecog=3 1 1 118 NA NA

Note: because there was only 1 bedridden patient, the median survival time is the survival time of that patient, and there is no confidence interval provided.

33 / 38

Comparing Survival Across Groups

The log-rank test is a standard test for comparing groups when we have survival data. Here we use it to test the null hypothesis that there is no difference in survival between the groups, versus the alternative that there is a difference in survival.

survdiff(Surv(time,status)~ph.ecog, data=lung)
#> Call:
#> survdiff(formula = Surv(time, status) ~ ph.ecog, data = lung)
#>
#> n=227, 1 observation deleted due to missingness.
#>
#> N Observed Expected (O-E)^2/E (O-E)^2/V
#> ph.ecog=0 63 37 54.153 5.4331 8.2119
#> ph.ecog=1 113 82 83.528 0.0279 0.0573
#> ph.ecog=2 50 44 26.147 12.1893 14.6491
#> ph.ecog=3 1 1 0.172 3.9733 4.0040
#>
#> Chisq= 22 on 3 degrees of freedom, p= 7e-05

Here we see p<0.0001 and reject the null hypothesis, concluding that there is evidence in the data of a difference in survival across the groups.

34 / 38

Cox Proportional Hazards Model

Suppose we have multiple covariates of interest. For example, in the lung cancer data, we also have covariates like wt.loss (weight loss in the past 6 months) and age. If we want to fit a model to survival data, the Cox proportional hazards model is a popular choice.

The Cox proportional hazards model has a couple of important assumptions that are beyond the scope of this course (non-informative censoring, meaning censoring times are unrelated to the unobserved failure time, and the proportional hazards assumption, meaning that the hazards of failure are proportional across groups). You should explore these more if you plan to model survival data.

35 / 38

Cox Model for Lung Cancer Data

library(gtsummary)
coxph(Surv(time, status) ~
as.factor(ph.ecog)+
wt.loss+age,
data = lung) %>%
tbl_regression(exp = TRUE)
Characteristic HR1 95% CI1 p-value
as.factor(ph.ecog)
0
1 1.47 0.98, 2.20 0.064
2 2.50 1.52, 4.10 <0.001
3 9.86 1.29, 75.5 0.028
wt.loss 0.99 0.98, 1.01 0.3
age 1.01 0.99, 1.03 0.2

1 HR = Hazard Ratio, CI = Confidence Interval

The estimates labeled exp(coef) are interpreted as hazard ratios. That is, patients who spend >50% of their time in bed but who are not bedridden (group 2) have 2.5 times the hazard of failure as patients with no symptoms, conditional on age and weight loss.

36 / 38

Hazard Ratio (HR)

  • The HR represents the ratio of hazards between two groups at any particular point in time.

  • The HR compares instantaneous rates of occurrence of the event of interest in those who are still at risk for the event.

  • HR < 1 indicates reduced hazard of death; HR > 1 indicates an increased hazard of death

So our HR = 2.5 implies that around 2.5 times as many people who are bedridden are dying as those who are asymptomatic, at any given time, conditional on age and weight loss.

37 / 38

Interpretation Summary (No Interactions in Model)

  • Linear regression: interpret estimate ˆβ1 corresponding to covariate x1 as the expected increase in yi corresponding to a one-unit increase in x1, holding all other factors constant

  • Logistic regression: interpret estimate exp(ˆβ1) corresponding to covariate x1 as the ratio of odds of yi=1 comparing those with x1=c+1 to those with x1=c (or corresponding to a 1-unit increase in the value of x1), holding all other factors constant

  • Cox proportional hazards (survival) model: interpret estimate exp(ˆβ1) corresponding to covariate x1 as the ratio of hazards of failure comparing those with x1=c+1 to those with x1=c (or corresponding to a 1-unit increase in the value of x1), holding all other factors constant
38 / 38

Survival Data

2 / 38
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow