# Difference between revisions of "Bootstrap resampling"

## Bootstrap Resampling

Bootstrap resampling is a statistical technique to measure the error in a given statistic that has been computed from a sample population. It is a simple yet powerful methord that relies heavily on computational power. The basic premise is that instead of using a theoretical or mathematical model for the parent distribution from which our observed samples were drawn from, we can use the distribution of the observed samples as an approximation for the parent distribution.

### The Algorithm

Let’s say we observe ${\displaystyle N}$ data samples, denoted as ${\displaystyle {\vec {x}}=(x_{1},x_{2},x_{3},...,x_{N})}$, and we want to compute a statistic ${\displaystyle {\hat {\theta }}=s({\vec {x}})}$. This statistic could be the mean or median of our samples, but could also be something much more complex. In measuring ${\displaystyle {\hat {\theta }}}$ from our data, we want to know how close our estimator is to the true value of ${\displaystyle \theta }$, so we need to compute an error estimate for ${\displaystyle {\hat {\theta }}}$. This can be done using the following bootstrap resampling algorithm:

1. Make a bootstrap sample ${\displaystyle x^{\star }}$ by sampling with replacement from the original data samples. This bootstrap sample should also be of length ${\displaystyle N}$ and may contain repetitions of the same data sample (since we sampled with replacement).
2. Repeat this process and create ${\displaystyle B}$ bootstrap samples. Generally, ${\displaystyle B=1000-10000}$, in order to reduce the amount of random scatter in the measurement of the bootstrap error.
3. Compute the same desired statistic for each of the bootstrap samples, ${\displaystyle {\hat {\theta }}^{\star ,b}=s(x^{\star ,b})}$, where ${\displaystyle b}$ ranges from 1 to ${\displaystyle B}$. We will call the quantities ${\displaystyle {\hat {\theta }}^{\star ,b}}$ our bootstrap replications.
4. From the ${\displaystyle B}$ bootstrap replications, compute the bootstrap variance of the measured value ${\displaystyle {\hat {\theta }}}$ as
${\displaystyle \sigma _{\mathrm {boot} }^{2}=\sum _{b=1}^{B}\left[{\hat {\theta }}^{\star ,b}-\langle {\hat {\theta }}^{\star }\rangle \right]^{2}/(B-1),\,\!}$

where the mean of the bootstrap replications is given by

${\displaystyle \langle {\hat {\theta }}^{\star }\rangle =\sum _{b=1}^{B}{\hat {\theta }}^{\star ,b}/B.\,\!}$

### Confidence Intervals

In the limit as ${\displaystyle N\rightarrow \infty }$, the distribution of the bootstrap replications will asymptote to a normal or Gaussian distribution. So, in the limit of large ${\displaystyle N}$, the standard deviation measured from the distribution of bootstrap replications, ${\displaystyle \sigma _{\mathrm {boot} }}$ can be treated as the standard deviation of a normal distribution. In particular, ${\displaystyle {\hat {\theta }}\pm \sigma _{\mathrm {boot} }}$ will mark the 68.3% confidence interval on the measurement of ${\displaystyle {\hat {\theta }}}$, as is usual for confidence intervals of a normal distribution. However, in the limit of small ${\displaystyle N}$, the distribution of the bootstrap replications does not have to resemble a normal distribution, and it will likely not. In this case, ${\displaystyle \sigma _{\mathrm {boot} }}$ cannot be interpreted as marking the 68.3% confidence interval. In this case, the cumulative probability distribution (CDF) of the bootstrap replications can be used in order to measure confidence intervals. For example, for a 90% confidence interval, the CDF can be used to find the bounds ${\displaystyle [{\hat {\theta }}_{\mathrm {low} },{\hat {\theta }}_{\mathrm {high} }]}$ such that 5% of the bootstrap replications are below ${\displaystyle {\hat {\theta }}_{\mathrm {low} }}$ and 5% are greater than ${\displaystyle {\hat {\theta }}_{\mathrm {high} }}$. This method of computing confidence intervals using bootstrap resampling is the most basic. In recent years, there have been new and better methods developed that will produce correct results over a broader range of problems. These methods are beyond the scope of this description, but for those interested, one of the main algorithms used is known as the bias-corrected and accelerated algorithm.