Bootstrapping LOWESS

In chapter 1 of their 2016 book Computer Age Statistical Inference, Bradley Efron and Trevor Hastie use as an example a study of kidney function, comparing results from linear regression and lowess (locally weighted scatterplot smoothing). The purpose of this blog post is to reproduce their results using R.

I will not copy the plots from the book here (for copyright reasons), but you can download a pdf of the book for free from a website created by the authors: https://web.stanford.edu/~hastie/CASI/index.html. I am specifically interested in Figures 1.1 and 1.2, as well as Table 1.1.

Let’s first grab the kidney function data (also kindly provided by the authors on their website):

library(tidyverse)

kidney_data <- read_delim('https://web.stanford.edu/~hastie/CASI_files/DATA/kidney.txt', delim = ' ')

kidney_data
## # A tibble: 157 x 2
##      age   tot
##    <dbl> <dbl>
##  1    18  2.44
##  2    19  3.86
##  3    19 -1.22
##  4    20  2.3 
##  5    21  0.98
##  6    21 -0.5 
##  7    21  2.74
##  8    22 -0.12
##  9    22 -1.21
## 10    22  0.99
## # ... with 147 more rows

Just getting the linear regression line is straightforward:

base_plot <- kidney_data %>%
  ggplot(aes(x = age, y = tot)) +
  geom_point(shape = 8, color = 'blue', size = 1) +
  theme_bw() +
  scale_x_continuous(breaks = seq(20, 90, 10)) +
  scale_y_continuous(breaks = seq(-6, 4, 2)) +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank())

linear_plot <- base_plot +
  geom_smooth(method = 'lm', color = 'green', se = F)

linear_plot

Here I separated out the plot of just the data (base_plot), so that I can use it for both the linear regression plot and the lowess plot. I also went to a little bit of trouble to make the plot in a similar style as Figure 1.1 in the book. It is not an exact copy, but close enough.

I removed the standard error from geom_smooth, since the default is to plot the error as a continuous curve, and Figure 1.1 only shows the error at a few discrete points. Let’s next calculate the standard error at those points:

library(broom)

linear_model <- lm(tot ~ age, data = kidney_data)

error_points <- data.frame(age = seq(20, 80, 10))

linear_error <- augment(linear_model, newdata = error_points)

linear_error
## # A tibble: 7 x 3
##     age .fitted .se.fit
## * <dbl>   <dbl>   <dbl>
## 1    20   1.29    0.207
## 2    30   0.502   0.155
## 3    40  -0.284   0.147
## 4    50  -1.07    0.189
## 5    60  -1.86    0.258
## 6    70  -2.64    0.337
## 7    80  -3.43    0.420

Now add the error bars to the linear plot:

linear_plot +
  geom_segment(data = linear_error, 
               aes(x = age, xend = age,
                   y = .fitted - 2*.se.fit, yend = .fitted + 2*.se.fit))

The plot shows \(\pm2\) standard errors at each of the 7 selected age values. This is a good replication of Figure 1.1.

For Figure 1.2, the authors use the lowess algorithm in R, which stands for ‘locally weighted scatterplot smoothing.’ There is an updated version in R named loess, and I assume there is some mapping between the two, but it is not immediately obvious, so I will stick with lowess.

Again, obtaining just the lowess fit is fairly straightforward, except for the fact that geom_smooth uses loess and not lowess, so we have to calculate the fit ‘by hand’ (i.e. outside of ggplot) and then add it to the plot:

lowess_model <- lowess(kidney_data$age, kidney_data$tot, 1/3) %>%
  as_tibble()

lowess_plot <- base_plot +
  geom_line(data = lowess_model, aes(x = x, y = y), color = 'green')

lowess_plot

The ‘1/3’ in lowess is a parameter governing the smoothness of the fit. Now, to get the standard errors for the lowess fit, the authors use the bootstrap. This is not the place for a full-fledged introduction to the bootstrap, but in brief it is a way of approximating the standard error by repeatedly sampling with replacement from the original data. There are many ways to run a bootstrap sample in R: here I’ll uses bootstraps from the rsample package:

library(rsample)

bootstrap_samples <- bootstraps(kidney_data, times = 25)

bootstrap_samples
## # Bootstrap sampling 
## # A tibble: 25 x 2
##    splits           id         
##    <list>           <chr>      
##  1 <split [157/65]> Bootstrap01
##  2 <split [157/57]> Bootstrap02
##  3 <split [157/57]> Bootstrap03
##  4 <split [157/59]> Bootstrap04
##  5 <split [157/54]> Bootstrap05
##  6 <split [157/58]> Bootstrap06
##  7 <split [157/62]> Bootstrap07
##  8 <split [157/55]> Bootstrap08
##  9 <split [157/62]> Bootstrap09
## 10 <split [157/60]> Bootstrap10
## # ... with 15 more rows

Each sample (split) contains an “analysis” data set that is the same length as the original kidney_data, and an “assessement”" data set containing all the original data points that did not get included in the bootstrap sampling with replacement. Now, for each of these bootstrap samples, we want to apply a lowess fit, so let’s write a function that does that and then map it over the bootstraps:

lowess_fit <- function(split){
  lowess(analysis(split)$age, analysis(split)$tot, 1/3)
}

bootstrap_samples %>%
  mutate(lowess_points = map(splits, lowess_fit)) %>%
  select(id, lowess_points) %>%
  mutate_if(is.list, map, as_data_frame) %>%
  unnest()
## # A tibble: 3,925 x 3
##    id              x     y
##    <chr>       <dbl> <dbl>
##  1 Bootstrap01    18 1.21 
##  2 Bootstrap01    19 1.16 
##  3 Bootstrap01    19 1.16 
##  4 Bootstrap01    20 1.07 
##  5 Bootstrap01    21 0.955
##  6 Bootstrap01    21 0.955
##  7 Bootstrap01    21 0.955
##  8 Bootstrap01    21 0.955
##  9 Bootstrap01    21 0.955
## 10 Bootstrap01    22 0.953
## # ... with 3,915 more rows

Let’s plot these 25 bootstrapped replications of the lowess fit, along with the original data. I’ll add a distinct because I don’t care about multiple identical values for plotting purposes:

lowess_25 <- bootstrap_samples %>%
  mutate(lowess_points = map(splits, lowess_fit)) %>%
  select(id, lowess_points) %>%
  mutate_if(is.list, map, as_data_frame) %>%
  unnest() %>%
  distinct()

base_plot +
  geom_line(data = lowess_25, aes(x = x, y = y, color = id),
            alpha = 0.5) +
  theme(legend.position = 'none')

This gives a rough idea of the standard error of the lowess fit. Now, as did Efron and Hastie, we’ll take 250 bootstrap samples and calculate the standard error. First, we need a function that will give the lowess fit at the same 7 points where we calculated the standard error of the linear model. Since lowess outputs a series of \(x\) and \(y\) values, I’ll just use linear interpolation between these values:

lowess_predict <- function(fit){
  approx(fit %>% as_tibble() %>% distinct(), xout = seq(20, 80, 10), rule = 2)
}

(rule = 2 allows extrapolation outside the input data range, which is necessary since the bootstrap sample won’t always contain the extrema of the original data set.)

Now, for the full 250 replications. I’ll break this into 2 steps so it is more clear what is going on.

step_1 <- bootstraps(kidney_data, times = 250) %>%
  mutate(lowess_points = map(splits, lowess_fit)) %>%
  mutate(prediction = map(lowess_points, lowess_predict)) %>%
  select(id, prediction) %>%
  mutate_if(is.list, map, as_data_frame) %>%
  unnest()

step_1
## # A tibble: 1,750 x 3
##    id               x      y
##    <chr>        <dbl>  <dbl>
##  1 Bootstrap001    20  1.74 
##  2 Bootstrap001    30  0.723
##  3 Bootstrap001    40 -0.383
##  4 Bootstrap001    50 -1.30 
##  5 Bootstrap001    60 -2.22 
##  6 Bootstrap001    70 -3.13 
##  7 Bootstrap001    80 -4.02 
##  8 Bootstrap002    20  1.84 
##  9 Bootstrap002    30  0.595
## 10 Bootstrap002    40 -0.933
## # ... with 1,740 more rows
step_2 <- step_1 %>%
  group_by(x) %>%
  summarise(mean = round(mean(y), 2),
            sd = round(sd(y), 2))

step_2
## # A tibble: 7 x 3
##       x  mean    sd
##   <dbl> <dbl> <dbl>
## 1    20  1.56  0.7 
## 2    30  0.65  0.25
## 3    40 -0.65  0.34
## 4    50 -1.28  0.33
## 5    60 -1.92  0.38
## 6    70 -2.69  0.49
## 7    80 -3.48  0.71

These mean and standard deviation values match fairly well with the values given in Table 1.1 of Computer Age Statistical Inference, with the caveat that the authors only did 250 bootstrap replications, and we only did 250 bootstrap replications, so we shouldn’t expect exact agreement.

Let’s add the bootstrap standard errors to the plot of the lowess fit:

lowess_plot + 
  geom_segment(data = step_2, 
               aes(x = x, xend = x,
                   y = mean - 2*sd, yend = mean + 2*sd))

This provides a good replication of Figure 1.2 in the book.

As always, I am sure that there are other ways to accomplish the same result in R, likely including ones that are more concise and clear. If you have a nice approach (or if you find a mistake in what I’ve done), please send me an email and let me know!

Avatar
Landon Lehman
Data Scientist

My research interests include data science, statistics, physics, and applied math.

Related