24. Bootstrapping Tools

from datascience import *
import numpy as np

%matplotlib inline

import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

from scipy import stats

Bootstrap Confidence Intervals

Steps in bootstrapping

Step

Example

1.

Ask Question

What is the weight of babies born to smoking mothers?

2.

Get sample data

Get weights of 100 babies born to smokers

3.

Choose a statistic

Mean baby weight

4.

If population data available, calculate the statistic of the original sample

The mean weight of the 100 babies is 114 oz.

5.

Bootstrap the sample and calculate the statistic on the bootstrap sample

The mean weight of the 100 babies in bootstrap sample 1 is 115.3 oz.

6.

Repeat over many bootstrapped samples and record the results

Usually 1k samples is more than enough.

7.

Plot the distrubution and observe the uncertainty inherent from samples of a given size

8.

Find the 2.5 and 97.5 percentiles of the statistic over all of the bootstrap samples to compute a 95% confidence interval for the population parameter.

Creating boot_one function to generate a single bootstrap sample

The function below will resample from the first column of a table. Ensure your data table has a numeric variables in its first column.

def boot_one(table, samp_size):
    resample = table.sample(samp_size)
    return np.average(resample.column(0))

Let’s see how it works by loading sample data. Let’s resample the ‘Perfectionism’ variable.

pers = Table.read_table('http://faculty.ung.edu/rsinn/perfnarc.csv')
pers.show(5)
Sex G21 Greek AccDate Stress1 Stress2 Perf Narc
F N N N 9 7 99 3
F Y N Y 11 13 86 2
F N Y N 15 14 118 4
F N N Y 16 15 113 2
F Y N Y 17 17 107 8

... (143 rows omitted)

We will feed the function a one-column table with the perfectionism scores. The function will resample (n = 25) and return the mean of that sample. Re-execute the code block several times to see what typical values in the distribution look like.

boot_one(pers.select('Perf'),25)
115.24

Step 1. Ask Question

What is the mean weight of babies born to smokers?

Step 2. Get Sample Data

Read in the Data. The data might be population data or sample data. Be aware of the diiference. If the data is population data, you will need to first create a sample from the population data, then resample it.

# This is sample data
births = Table.read_table('http://faculty.ung.edu/rsinn/baby.csv')
births.show(5)
Birth Weight Gestational Days Maternal Age Maternal Height Maternal Pregnancy Weight Maternal Smoker
120 284 27 62 100 False
113 282 33 64 135 False
128 279 28 64 115 True
108 282 23 67 125 True
136 286 25 62 93 False

... (1169 rows omitted)

Step 3. Choose a statistic (mean)

The default statistic is the mean. If we want to estimate a different statistic like the median, then we would need to take the function boot_one and replace the np.average with np.median. In almost every case, especially in an introductory statistics course, we will estimate the mean.

Select the the birth weight of babies born to smokers as a one-column table of numeric values.

smoked = births.where('Maternal Smoker',True).select('Birth Weight')
smoked.show(5)
Birth Weight
128
108
143
144
141

... (454 rows omitted)

Step 4. Calculate the statistic of the original sample

Our original data is a sample, not a population, so every time we sample that sample we are resampling which is another name for bootstrapping.

Step 5. Bootstrap the sample and calculate the statistic on the bootstrap sample

Choose a resample size and sample with replacement.

boot_one(smoked,100)
113.46

Step 6. Repeat over many bootstrap samples

We use our standard for loop setup with a blank results array and np.append to add new results to it. Our resample size is 50, but we need 500 - 1,000 resamples for an accurate bootstrap distribution.

boot_samp = make_array()
resamp_size = 50

# Never need more than 1k reps, use 500 or fewer if working the cloud.
resample_reps = 1000

for i in range(resample_reps):
    new_boot = boot_one(smoked,resamp_size)
    boot_samp = np.append(boot_samp, new_boot)
    
# Remove the hashtag comment symbol to see the boot_samp results array
#boot_samp
def boot_hist (array):
    left = round(percentile(2.5, array),2)
    right = round(percentile(97.5, array),2)
    avg = round(np.average(array),2)
    tab = Table().with_column('Bootstrapped Sample',array)
    tab.hist(0)
    _ = plots.title('95% Confidence Interval')
    _ = plots.plot([left, left], [0, 0.1], color='red', lw=2)
    _ = plots.plot([right, right], [0, 0.1], color='red', lw=2)
    _ = plots.scatter(avg, 0, color="gold", s = 200,zorder=2);
    print("The 95% confidence interval lies between ", left," and ", right, ",")
    print("and the gold dot at x = ", avg, " is the mean of the bootstrapped sample distribution.")
boot_hist(boot_samp)
The 95% confidence interval lies between  108.3  and  118.88 ,
and the gold dot at x =  113.91  is the mean of the bootstrapped sample distribution.
../_images/4_22_1.png

The 95% confidence interval endpoints are shown above the histogram: we are 95% confident that the true mean lies between 108.9 and 118.8.

Compare to resamples of size 200

Note that this resample size is 4 times larger than we used above. The law of large numbers indicates this larger (re)sample size should lead to a more accurate estimate.

boot_samp_2 = make_array()
resamp_size = 200

# Never need more than 1k reps, use 500 or fewer if working the cloud.
resample_reps = 1000

for i in range(resample_reps):
    new_boot = boot_one(smoked,resamp_size)
    boot_samp_2 = np.append(boot_samp_2, new_boot)
    
# Remove the hashtag comment symbol to see the boot_samp results array
#boot_samp
boot_hist(boot_samp_2)
The 95% confidence interval lies between  111.38  and  116.1 ,
and the gold dot at x =  113.86  is the mean of the bootstrapped sample distribution.
../_images/4_26_1.png

Compared to the confidence interval generated with a resample size of \(n=50\), the width of the new confidence interval is approximately 5 units less.