Survival analysis is a complex topic! The goal of our coverage is to give you the skills you need to understand results of simple descriptive statistics in this setting, and we will not have time to discuss more complex modeling of survival data in this course. Come see me in the future if you need to analyze survival data!
In many studies, the outcome of interest is the amount of time from an initial observation until the occurrence of some event of interest, e.g.
Typically, the event of interest is called a failure (even if it is a good thing). The time interval between a starting point and the failure is known as the survival time and is often represented by t.
Certain aspects of survival data make data analysis particularly challenging.
When you hear someone talk about censoring, typically they mean right censoring, but there are other types of censoring that you may encounter.
Data may be censored in multiple ways:
Right censored: the usual type of censoring in health data, accommodated by standard models. In this case you see a study participant at the beginning of the study, but you may not follow them until the outcome occurs (or the outcome may never occur)
Left censored: a study participant may have experienced the event before the study begins. For example, you may want to study when kindergarteners reach a certain reading level, but some may be proficient readers already at the start of your study.
Interval censored: you don't know exactly when an event happened, only that it happened between two times. For example, if a patient enrolled in a cancer screening study is cancer-free at the time of the first screening, t1 but has cancer at the time of the second screening, t2, you only know that cancer developed sometime after t1 but before t2
It is important to distinguish between calendar time and participant time. In this plot, an "X" at the end of a line represents a failure, and an "O" represents a censored observation.
The distribution of survival times is characterized by the survival function, represented by S(t). For a continuous random variable T, S(t)=Pr(T>t) and S(t) represents the proportion of individuals who have not yet failed.
The graph of S(t) versus t is called a survival curve. The survival curve shows the proportion of survivors at any given time.
Patient | Event time x | Event type |
---|---|---|
1 | 4.5 | Death |
2 | 7.5 | Death |
3 | 8.5 | Censored |
4 | 11.5 | Death |
5 | 13.5 | Censored |
6 | 15.5 | Death |
7 | 16.5 | Death |
8 | 17.5 | Censored |
9 | 19.5 | Death |
10 | 21.5 | Censored |
How do we estimate the survival curve for these data?
Perhaps the most popular estimate of a survival curve is the Kaplan-Meier or product-limit estimate. This method is actually fairly intuitive. Define the following quantities.
ˆS(t)=∏ti≤t(1−dtiIti)
At each time t, the probability of surviving is just 1−Pr(failing) (remember the complement rule, also know as the law of total probability!). Before there are any failures in the data, our estimated ˆS(t)=1. At the time of the first failure, this probability falls below 1 and is simply one minus the probability of failing at that time, or 1−# failures# at risk of failing.
After the first failure, things get more complicated. At the time of the second failure, you can calcuate 1−# failures# at risk of failing, but this doesn't provide the whole picture, as someone else has already died. In fact, this is the conditional probability of surviving now that you've made it past the time of the first failure.
I promised you'd see probability laws again! :)
Coming back to you next: the multiplicative rule!
ˆS(t)=∏ti≤t(1−dtiIti)
How do you then calculate the total (unconditional) probability of survival? That is just the product of the probability of surviving past the first failure times the conditional probability of surviving beyond the second failure given that you made it past the first, or
Pr(survive past first and second times) =Pr(survive past first time)Pr(survive past second time∣survive past first time) =(1−# failures at failure time 1# at risk of failing at failure time 1)(1−# of failures at failure time 2# at risk of failing at failure time 2)
If someone is censored, they are no longer at risk of failing at the next failure time and are taken out of the risk set and out of the calculation.
t | # Failed: dt | # Censored | # Left: It+1 | ˆS(t) |
---|---|---|---|---|
0.0 | 0 | 0 | ||
4.5 | 1 | 0 | ||
7.5 | 1 | 0 | ||
8.5 | 0 | 1 | ||
11.5 | 1 | 0 | ||
13.5 | 0 | 1 | ||
15.5 | 1 | 0 | ||
16.5 | 1 | 0 | ||
17.5 | 0 | 1 | ||
19.5 | 1 | 0 | ||
21.5 | 0 | 1 |
ˆS(t)=∏ti≤t(1−dtiIti) Remember dt is the # who fail at time t, and It is the # at risk of failure at time t (have not been censored and have not failed)
t | # Failed: dt | # Censored | # Left: It+1 | ˆS(t) |
---|---|---|---|---|
0.0 | 0 | 0 | 10 |
1 |
4.5 | 1 | 0 | ||
7.5 | 1 | 0 | ||
8.5 | 0 | 1 | ||
11.5 | 1 | 0 | ||
13.5 | 0 | 1 | ||
15.5 | 1 | 0 | ||
16.5 | 1 | 0 | ||
17.5 | 0 | 1 | ||
19.5 | 1 | 0 | ||
21.5 | 0 | 1 |
ˆS(t)=∏ti≤t(1−dtiIti)
ˆS(0)=1
t | # Failed: dt | # Censored | # Left: It+1 | ˆS(t) |
---|---|---|---|---|
0.0 | 0 | 0 | 10 | 1 |
4.5 | 1 | 0 | 9 |
1×(1−110=0.9) |
7.5 | 1 | 0 | ||
8.5 | 0 | 1 | ||
11.5 | 1 | 0 | ||
13.5 | 0 | 1 | ||
15.5 | 1 | 0 | ||
16.5 | 1 | 0 | ||
17.5 | 0 | 1 | ||
19.5 | 1 | 0 | ||
21.5 | 0 | 1 |
ˆS(t)=∏ti≤t(1−dtiIti)
Pr(survive past time 4.5) =Pr(survive past time 0) ×Pr(survive past time 4.5∣survive past time 0)
t | # Failed: dt | # Censored | # Left: It+1 | ˆS(t) |
---|---|---|---|---|
0.0 | 0 | 0 | 10 | 1 |
4.5 | 1 | 0 | 9 | 0.9 |
7.5 | 1 | 0 | 8 |
0.9×(1−19)=0.8 |
8.5 | 0 | 1 | ||
11.5 | 1 | 0 | ||
13.5 | 0 | 1 | ||
15.5 | 1 | 0 | ||
16.5 | 1 | 0 | ||
17.5 | 0 | 1 | ||
19.5 | 1 | 0 | ||
21.5 | 0 | 1 |
ˆS(t)=∏ti≤t(1−dtiIti)
Pr(survive past time 7.5) =Pr(survive past time 4.5) ×Pr(survive past time 7.5∣survive past time 4.5)
t | # Failed: dt | # Censored | # Left: It+1 | ˆS(t) |
---|---|---|---|---|
0.0 | 0 | 0 | 10 | 1 |
4.5 | 1 | 0 | 9 | 0.9 |
7.5 | 1 | 0 | 8 | 0.8 |
8.5 | 0 | 1 | 7 |
0.8×(1−08)=0.8 |
11.5 | 1 | 0 | ||
13.5 | 0 | 1 | ||
15.5 | 1 | 0 | ||
16.5 | 1 | 0 | ||
17.5 | 0 | 1 | ||
19.5 | 1 | 0 | ||
21.5 | 0 | 1 |
ˆS(t)=∏ti≤t(1−dtiIti)
Pr(survive past time 8.5) =Pr(survive past time 7.5) ×Pr(survive past time 8.5∣survive past time 7.5)
t | # Failed: dt | # Censored | # Left: It+1 | ˆS(t) |
---|---|---|---|---|
0.0 | 0 | 0 | 10 | 1 |
4.5 | 1 | 0 | 9 | 0.9 |
7.5 | 1 | 0 | 8 | 0.8 |
8.5 | 0 | 1 | 7 | 0.8 |
11.5 | 1 | 0 | 6 |
0.8×(1−17)=0.69 |
13.5 | 0 | 1 | ||
15.5 | 1 | 0 | ||
16.5 | 1 | 0 | ||
17.5 | 0 | 1 | ||
19.5 | 1 | 0 | ||
21.5 | 0 | 1 |
ˆS(t)=∏ti≤t(1−dtiIti)
Pr(survive past time 11.5) =Pr(survive past time 8.5) ×Pr(survive past time 11.5∣survive past time 8.5)
t | # Failed: dt | # Censored | # Left: It+1 | ˆS(t) |
---|---|---|---|---|
0.0 | 0 | 0 | 10 | 1 |
4.5 | 1 | 0 | 9 | 0.9 |
7.5 | 1 | 0 | 8 | 0.8 |
8.5 | 0 | 1 | 7 | 0.8 |
11.5 | 1 | 0 | 6 | 0.69 |
13.5 | 0 | 1 | 6 |
0.69×(1−06)=0.69 |
15.5 | 1 | 0 | ||
16.5 | 1 | 0 | ||
17.5 | 0 | 1 | ||
19.5 | 1 | 0 | ||
21.5 | 0 | 1 |
ˆS(t)=∏ti≤t(1−dtiIti)
Pr(survive past time 13.5) =Pr(survive past time 11.5) ×Pr(survive past time 13.5∣survive past time 11.5)
t | # Failed: dt | # Censored | # Left: It+1 | ˆS(t) |
---|---|---|---|---|
0.0 | 0 | 0 | 10 | 1 |
4.5 | 1 | 0 | 9 | 0.9 |
7.5 | 1 | 0 | 8 | 0.8 |
8.5 | 0 | 1 | 7 | 0.8 |
11.5 | 1 | 0 | 6 | 0.69 |
13.5 | 0 | 1 | 5 | 0.69 |
15.5 | 1 | 0 | 4 |
0.69×(1−15)=0.552 |
16.5 | 1 | 0 | ||
17.5 | 0 | 1 | ||
19.5 | 1 | 0 | ||
21.5 | 0 | 1 |
ˆS(t)=∏ti≤t(1−dtiIti)
Pr(survive past time 15.5) =Pr(survive past time 13.5) ×Pr(survive past time 15.5∣survive past time 13.5)
t | # Failed: dt | # Censored | # Left: It+1 | ˆS(t) |
---|---|---|---|---|
0.0 | 0 | 0 | 10 | 1 |
4.5 | 1 | 0 | 9 | 0.9 |
7.5 | 1 | 0 | 8 | 0.8 |
8.5 | 0 | 1 | 7 | 0.8 |
11.5 | 1 | 0 | 6 | 0.69 |
13.5 | 0 | 1 | 5 | 0.69 |
15.5 | 1 | 0 | 4 | 0.552 |
16.5 | 1 | 0 | 3 |
0.552×(1−14)=0.414 |
17.5 | 0 | 1 | ||
19.5 | 1 | 0 | ||
21.5 | 0 | 1 |
ˆS(t)=∏ti≤t(1−dtiIti)
Pr(survive past time 16.5) =Pr(survive past time 15.5) ×Pr(survive past time 16.5∣survive past time 15.5)
t | # Failed: dt | # Censored | # Left: It+1 | ˆS(t) |
---|---|---|---|---|
0.0 | 0 | 0 | 10 | 1 |
4.5 | 1 | 0 | 9 | 0.9 |
7.5 | 1 | 0 | 8 | 0.8 |
8.5 | 0 | 1 | 7 | 0.8 |
11.5 | 1 | 0 | 6 | 0.69 |
13.5 | 0 | 1 | 5 | 0.69 |
15.5 | 1 | 0 | 4 | 0.552 |
16.5 | 1 | 0 | 3 | 0.414 |
17.5 | 0 | 1 | 2 |
0.414×(1−03)=0.414 |
19.5 | 1 | 0 | ||
21.5 | 0 | 1 |
ˆS(t)=∏ti≤t(1−dtiIti)
Pr(survive past time 17.5) =Pr(survive past time 16.5) ×Pr(survive past time 17.5∣survive past time 16.5)
t | # Failed: dt | # Censored | # Left: It+1 | ˆS(t) |
---|---|---|---|---|
0.0 | 0 | 0 | 10 | 1 |
4.5 | 1 | 0 | 9 | 0.9 |
7.5 | 1 | 0 | 8 | 0.8 |
8.5 | 0 | 1 | 7 | 0.8 |
11.5 | 1 | 0 | 6 | 0.69 |
13.5 | 0 | 1 | 5 | 0.69 |
15.5 | 1 | 0 | 4 | 0.552 |
16.5 | 1 | 0 | 3 | 0.414 |
17.5 | 0 | 1 | 2 | 0.414 |
19.5 | 1 | 0 | 1 |
0.414×(1−12)=0.207 |
21.5 | 0 | 1 |
ˆS(t)=∏ti≤t(1−dtiIti)
Pr(survive past time 19.5) =Pr(survive past time 17.5) ×Pr(survive past time 19.5∣survive past time 17.5)
t | # Failed: dt | # Censored | # Left: It+1 | ˆS(t) |
---|---|---|---|---|
0.0 | 0 | 0 | 10 | 1 |
4.5 | 1 | 0 | 9 | 0.9 |
7.5 | 1 | 0 | 8 | 0.8 |
8.5 | 0 | 1 | 7 | 0.8 |
11.5 | 1 | 0 | 6 | 0.69 |
13.5 | 0 | 1 | 5 | 0.69 |
15.5 | 1 | 0 | 4 | 0.552 |
16.5 | 1 | 0 | 3 | 0.414 |
17.5 | 0 | 1 | 2 | 0.414 |
19.5 | 1 | 0 | 1 | 0.207 |
21.5 | 0 | 1 | 1 |
0.207×(1−01)=0.207 |
ˆS(t)=∏ti≤t(1−dtiIti)
Pr(survive past time 21.5) =Pr(survive past time 19.5) ×Pr(survive past time 21.5∣survive past time 19.5)
In between failure times, the KM estimate does not change but is constant. This gives the estimated survival function its step-like appearance (we call this type of function a step function).
ATCT is an imaging-based biomarker of tumor prognosis.
The lung
dataset is available from the survival
package in R. The data contain subjects with advanced lung cancer. Variables include the following.
time
: Survival time in days
status
: censoring status 1=censored, 2=dead (failure) (note: another common coding recognized by R is to let 0=censored and 1=failure)
ph.ecog
: Eastern Cooperative Oncology Group (ECOG) performance score, where 0=asymptomatic, 1=symptomatic but ambulatory, 2=in bed <50% of day, 3=in bed >50% of day but not bedbound, 4=bedbound
library(survival)library(survminer)ggsurvplot( fit = survfit(Surv(time, status)~ph.ecog, data=lung), title = "Survival by Performance Score", xlab = "Days", ylab = "Survival probability")
One quantity of interest is the median survival time.
survfit(Surv(time, status) ~ ph.ecog, data = lung)
#> Call: survfit(formula = Surv(time, status) ~ ph.ecog, data = lung)#> #> 1 observation deleted due to missingness #> n events median 0.95LCL 0.95UCL#> ph.ecog=0 63 37 394 348 574#> ph.ecog=1 113 82 306 268 429#> ph.ecog=2 50 44 199 156 288#> ph.ecog=3 1 1 118 NA NA
Note: because there was only 1 bedridden patient, the median survival time is the survival time of that patient, and there is no confidence interval provided.
The log-rank test is a standard test for comparing groups when we have survival data. Here we use it to test the null hypothesis that there is no difference in survival between the groups, versus the alternative that there is a difference in survival.
survdiff(Surv(time,status)~ph.ecog, data=lung)
#> Call:#> survdiff(formula = Surv(time, status) ~ ph.ecog, data = lung)#> #> n=227, 1 observation deleted due to missingness.#> #> N Observed Expected (O-E)^2/E (O-E)^2/V#> ph.ecog=0 63 37 54.153 5.4331 8.2119#> ph.ecog=1 113 82 83.528 0.0279 0.0573#> ph.ecog=2 50 44 26.147 12.1893 14.6491#> ph.ecog=3 1 1 0.172 3.9733 4.0040#> #> Chisq= 22 on 3 degrees of freedom, p= 7e-05
Here we see p<0.0001 and reject the null hypothesis, concluding that there is evidence in the data of a difference in survival across the groups.
Suppose we have multiple covariates of interest. For example, in the lung cancer data, we also have covariates like wt.loss
(weight loss in the past 6 months) and age
. If we want to fit a model to survival data, the Cox proportional hazards model is a popular choice.
The Cox proportional hazards model has a couple of important assumptions that are beyond the scope of this course (non-informative censoring, meaning censoring times are unrelated to the unobserved failure time, and the proportional hazards assumption, meaning that the hazards of failure are proportional across groups). You should explore these more if you plan to model survival data.
library(gtsummary)coxph(Surv(time, status) ~ as.factor(ph.ecog)+ wt.loss+age, data = lung) %>% tbl_regression(exp = TRUE)
Characteristic | HR1 | 95% CI1 | p-value |
---|---|---|---|
as.factor(ph.ecog) | |||
0 | — | — | |
1 | 1.47 | 0.98, 2.20 | 0.064 |
2 | 2.50 | 1.52, 4.10 | <0.001 |
3 | 9.86 | 1.29, 75.5 | 0.028 |
wt.loss | 0.99 | 0.98, 1.01 | 0.3 |
age | 1.01 | 0.99, 1.03 | 0.2 |
1
HR = Hazard Ratio, CI = Confidence Interval
|
The estimates labeled exp(coef)
are interpreted as hazard ratios. That is, patients who spend >50% of their time in bed but who are not bedridden (group 2) have 2.5 times the hazard of failure as patients with no symptoms, conditional on age and weight loss.
The HR represents the ratio of hazards between two groups at any particular point in time.
The HR compares instantaneous rates of occurrence of the event of interest in those who are still at risk for the event.
HR < 1 indicates reduced hazard of death; HR > 1 indicates an increased hazard of death
So our HR = 2.5 implies that around 2.5 times as many people who are bedridden are dying as those who are asymptomatic, at any given time, conditional on age and weight loss.
Linear regression: interpret estimate ˆβ1 corresponding to covariate x1 as the expected increase in yi corresponding to a one-unit increase in x1, holding all other factors constant
Logistic regression: interpret estimate exp(ˆβ1) corresponding to covariate x1 as the ratio of odds of yi=1 comparing those with x1=c+1 to those with x1=c (or corresponding to a 1-unit increase in the value of x1), holding all other factors constant
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |