# Difference between revisions of "Bootstrap resampling"

(12 intermediate revisions by the same user not shown) | |||

Line 3: | Line 3: | ||

===Short topical video=== | ===Short topical video=== | ||

− | * [http://www.youtube.com/watch?v= | + | * [http://www.youtube.com/watch?v=ZCXg64l9R_4&feature=youtu.be Bootstrap Resampling (by Nick Hand)] |

===Reference Material=== | ===Reference Material=== | ||

Line 12: | Line 12: | ||

\documentclass[11pt]{article} | \documentclass[11pt]{article} | ||

\usepackage{graphicx} | \usepackage{graphicx} | ||

− | \usepackage{amsmath} | + | \usepackage{amsmath, amssymb} |

\usepackage{fullpage} | \usepackage{fullpage} | ||

\begin{document} | \begin{document} | ||

\section*{Bootstrap Resampling} | \section*{Bootstrap Resampling} | ||

− | Bootstrap resampling is a statistical technique to measure the error in a given statistic that has been computed from a sample population. | + | Bootstrap resampling is a statistical technique to measure the error in a given statistic that has been computed from a sample population. It is a simple yet powerful methord that relies heavily on computational power. The basic premise is that instead of using a theoretical or mathematical model for the parent distribution from which our observed samples were drawn from, we can use the distribution of the observed samples as an approximation for the parent distribution. This allows us to estimate the standard error and confidence intervals of an estimator without any knowledge of the shape of the parent distribution. This proves especially useful when the theoretical distribution of the statistic is complex or even unknown. |

+ | |||

+ | \subsection*{The Algorithm} | ||

+ | Let's say we observe $N$ data samples, denoted as $\vec{x} = (x_1, x_2, x_3, ..., x_N)$, and we want to compute a statistic $\hat{\theta} = s(\vec{x})$. This statistic could be the mean or median of our samples, but could also be something much more complex. In measuring $\hat{\theta}$ from our data, we want to know how close our estimator is to the true value of $\theta$, so we need to compute | ||

+ | an error estimate for $\hat{\theta}$. This can be done using the following bootstrap resampling algorithm: | ||

+ | |||

+ | \begin{enumerate} | ||

+ | \item Make a bootstrap sample $x^{\star}$ by sampling with replacement from the original data samples. This bootstrap sample should also be of length $N$ and may contain repetitions of the same | ||

+ | data sample (since we sampled with replacement). | ||

+ | \item Repeat this process and create $B$ bootstrap samples. Generally, $B = 1000 - 10000$, in order to reduce the amount of random scatter in the measurement of the bootstrap error. | ||

+ | \item Compute the same desired statistic for each of the bootstrap samples, $\hat{\theta}^{\star, b} = s(x^{\star, b})$, where $b$ ranges from 1 to $B$. We will call the quantities | ||

+ | $\hat{\theta}^{\star, b}$ our \emph{bootstrap replications}. | ||

+ | \item From the $B$ bootstrap replications, compute the bootstrap variance of the measured value $\hat{\theta}$ as | ||

+ | $$ | ||

+ | \sigma_\mathrm{boot}^2 = \sum_{b=1}^{B} \left [ \hat{\theta}^{\star, b} - \langle \hat{\theta}^{\star} \rangle \right]^2 / (B-1), | ||

+ | $$ | ||

+ | where the mean of the bootstrap replications is given by | ||

+ | $$ | ||

+ | \langle \hat{\theta}^{\star} \rangle = \sum_{b=1}^{B} \hat{\theta}^{\star, b} / B. | ||

+ | $$ | ||

+ | |||

+ | The use of sampling with replacement works because we are approximating the true parent distribution of our sample with the actual sample values. However, the method essentially swaps out statistically independent samples with values that are correlated with each other (i.e., the same value more than once). Thus, the resampling algorithm cannot reproduce the same noise absolutely, but in | ||

+ | the large $N$ limit, the resampled estimate of $\sigma_\mathrm{boot}$ will asymptote to its true value. It is also important to note that the method of sampling with replacement relies on the assumption that each of the samples is drawn from an identical parent distributions and are statistically independent. In the case of dependent data, more sophisticated resampling techniques are needed. | ||

+ | \end{enumerate} | ||

+ | |||

+ | \subsection*{Confidence Intervals} | ||

+ | |||

+ | In the limit as $N \rightarrow \infty$, the distribution of the bootstrap replications will asymptote to a normal or Gaussian distribution. So, in the limit of large $N$, the standard deviation measured | ||

+ | from the distribution of bootstrap replications, $\sigma_\mathrm{boot}$ can be treated as the standard deviation of a normal distribution. In particular, $\hat{\theta} \pm \sigma_\mathrm{boot}$ will mark the 68.3\% confidence interval on the measurement of $\hat{\theta}$, as is usual for confidence intervals of a normal distribution. However, in the limit of small $N$, the distribution of the bootstrap replications does not have to resemble a normal distribution, and it will likely not. In this case, $\sigma_\mathrm{boot}$ cannot be interpreted as marking the 68.3\% confidence interval. In this case, the cumulative probability distribution (CDF) of the bootstrap replications can be used in order to measure confidence intervals. For example, for a 90\% confidence interval, the CDF can be used to find the bounds $[\hat{\theta}_{\mathrm{low}}, \hat{\theta}_{\mathrm{high}} ]$ such that 5\% of the bootstrap replications are below $\hat{\theta}_{\mathrm{low}}$ and 5\% are greater than | ||

+ | $\hat{\theta}_{\mathrm{high}}$. This method of computing confidence intervals using bootstrap resampling is the most basic. It has been shown that, on average, this simple method of estimating confidence intervals will results in intervals that are slightly too narrow. There have been new and better methods developed that will produce correct results over a broader range of problems. These methods are beyond the scope of this description, but for those interested, one of the main algorithms used is known as the bias-corrected and accelerated algorithm. | ||

+ | |||

+ | \subsection*{Assessing the error in $\sigma_\mathrm{boot}$} | ||

+ | In general, there will be two sources of error associated with the measurement of $\sigma_\mathrm{boot}$ using the bootstrap resampling algorithm. The first source is sampling variability caused by the fact that we are sampling the parent distribution with a finite number of samples, denoted by $N$. The second source of error is due to resampling variability caused by the fact that we only take $B$ resamples and not an infinite number. It can be shown that the variance of $\sigma_\mathrm{boot}$ will have two terms | ||

+ | $$ | ||

+ | \sigma_\mathrm{boot} \propto \frac{1}{N^2} + \frac{1}{NB}. | ||

+ | $$ | ||

+ | In the limit of small sample size, the variance of the bootstrap error will be larger, due simply to sample variability. We can reduce the resampling variability by choosing large $B$, and generally one chooses $B=1000-10000$. In practice, the choice of $B$ should be the largest possible value possible given time/computational constraints. | ||

</latex> | </latex> |

## Latest revision as of 18:33, 13 December 2012

### Prerequisites[edit]

### Short topical video[edit]

### Reference Material[edit]

- Efron, Bradley; Tibshirani, Robert J. (1993). An introduction to the bootstrap
- Bootstrapping (wikipedia)

## Bootstrap Resampling

Bootstrap resampling is a statistical technique to measure the error in a given statistic that has been computed from a sample population. It is a simple yet powerful methord that relies heavily on computational power. The basic premise is that instead of using a theoretical or mathematical model for the parent distribution from which our observed samples were drawn from, we can use the distribution of the observed samples as an approximation for the parent distribution. This allows us to estimate the standard error and confidence intervals of an estimator without any knowledge of the shape of the parent distribution. This proves especially useful when the theoretical distribution of the statistic is complex or even unknown.

### The Algorithm

Let’s say we observe data samples, denoted as , and we want to compute a statistic . This statistic could be the mean or median of our samples, but could also be something much more complex. In measuring from our data, we want to know how close our estimator is to the true value of , so we need to compute an error estimate for . This can be done using the following bootstrap resampling algorithm:

- Make a bootstrap sample by sampling with replacement from the original data samples. This bootstrap sample should also be of length and may contain repetitions of the same data sample (since we sampled with replacement).
- Repeat this process and create bootstrap samples. Generally, , in order to reduce the amount of random scatter in the measurement of the bootstrap error.
- Compute the same desired statistic for each of the bootstrap samples, , where ranges from 1 to . We will call the quantities our
*bootstrap replications*. - From the bootstrap replications, compute the bootstrap variance of the measured value as

where the mean of the bootstrap replications is given by

The use of sampling with replacement works because we are approximating the true parent distribution of our sample with the actual sample values. However, the method essentially swaps out statistically independent samples with values that are correlated with each other (i.e., the same value more than once). Thus, the resampling algorithm cannot reproduce the same noise absolutely, but in the large limit, the resampled estimate of will asymptote to its true value. It is also important to note that the method of sampling with replacement relies on the assumption that each of the samples is drawn from an identical parent distributions and are statistically independent. In the case of dependent data, more sophisticated resampling techniques are needed.

### Confidence Intervals

In the limit as , the distribution of the bootstrap replications will asymptote to a normal or Gaussian distribution. So, in the limit of large , the standard deviation measured from the distribution of bootstrap replications, can be treated as the standard deviation of a normal distribution. In particular, will mark the 68.3% confidence interval on the measurement of , as is usual for confidence intervals of a normal distribution. However, in the limit of small , the distribution of the bootstrap replications does not have to resemble a normal distribution, and it will likely not. In this case, cannot be interpreted as marking the 68.3% confidence interval. In this case, the cumulative probability distribution (CDF) of the bootstrap replications can be used in order to measure confidence intervals. For example, for a 90% confidence interval, the CDF can be used to find the bounds such that 5% of the bootstrap replications are below and 5% are greater than . This method of computing confidence intervals using bootstrap resampling is the most basic. It has been shown that, on average, this simple method of estimating confidence intervals will results in intervals that are slightly too narrow. There have been new and better methods developed that will produce correct results over a broader range of problems. These methods are beyond the scope of this description, but for those interested, one of the main algorithms used is known as the bias-corrected and accelerated algorithm.

### Assessing the error in

In general, there will be two sources of error associated with the measurement of using the bootstrap resampling algorithm. The first source is sampling variability caused by the fact that we are sampling the parent distribution with a finite number of samples, denoted by . The second source of error is due to resampling variability caused by the fact that we only take resamples and not an infinite number. It can be shown that the variance of will have two terms

In the limit of small sample size, the variance of the bootstrap error will be larger, due simply to sample variability. We can reduce the resampling variability by choosing large , and generally one chooses . In practice, the choice of should be the largest possible value possible given time/computational constraints.