Sampling Distributions, the Central Limit Theorem and the Law of Large Numbers#

A key connection between probability and statistics is the concept of sampling distributions.

Sampling#

Before we dig into sampling distributions and statistical theory, we need to understand sampling. We have several pre-packaged commands in R that allow is to quickly generate sample data from a specific distribution. We can also sample rows from a dataframe. The pages in this section will demonstrate how to use these functions together with specific examples.

Sampling Distributions#

Definition. For a fixed population and fixed sample size, the collection of all possible values of the mean over all possible samples of that size, forms what we call the sampling distribution.

The statistical theory involved includes:

  • Central Limit Theorem (CLT)

  • Law of Large Numbers

Law of Large Numbers#

If \(\overline{X}_n\) is the average of \(n\) many \(x_i\) all drawn from the same population/distribution with mean \(\mu\) then as \(n\) increases, \(\overline{X}_n\) will approach \(\mu\).

when our sample size is very large, you can have some confidence that the sample average is “pretty close” to the population average.

Central Limit Theorem#

Assume \(\overline{X}_n\) is the average of \(n\) many \(x_i\) all drawn from the same population/distribution with mean \(\mu\) and population standard deviation \(\sigma\). Then \(\overline{X}_n\) is a member of a sampling distribution. For large values of \(n\), this sampling distribution can be assumed approximately normal. Specifically, the sampling distribution can be assumed to be

\[N\left(\mu, \frac{\sigma}{\sqrt{n}} \right)\]

Getting Started#

To prepare for the examples and demonstrations, we two things. First, we need data to work with. Second, we need our main sampling function: sample.data.frame.

Run the cell below to load 4 data sets.

united <- read.csv('http://faculty.ung.edu/rsinn/data/united.csv')
p <- read.csv('http://faculty.ung.edu/rsinn/data/personality.csv')
airports <- read.csv('http://faculty.ung.edu/rsinn/data/airports.csv')
births <-  read.csv('http://faculty.ung.edu/rsinn/data/baby.csv')

Now that we have data sets to sample from, we will need the function that actually performs the sampling. Again, this code is adapted from the documentation of the classic mosaic package which is still available in R given that you have the correct versioning for R and all mosaic’s required dependencies.

Run the cell below to activate the function:

sample.data.frame

sample.data.frame <- function(x, size, replace = FALSE, prob = NULL, groups=NULL, 
                              orig.ids = TRUE, fixed = names(x), shuffled = c(),
                              invisibly.return = NULL, ...) {
  if( missing(size) ) size = nrow(x)
  if( is.null(invisibly.return) ) invisibly.return = size>50 
  shuffled <- intersect(shuffled, names(x))
  fixed <- setdiff(intersect(fixed, names(x)), shuffled)
  n <- nrow(x)
  ids <- 1:n
  groups <- eval( substitute(groups), x )
  newids <- sample(n, size, replace=replace, prob=prob, ...)
  origids <- ids[newids]
  result <- x[newids, , drop=FALSE]
  
  idsString <- as.character(origids)
  
  for (column in shuffled) {
    cids <- sample(newids, groups=groups[newids])
    result[,column] <- x[cids,column]
    idsString <- paste(idsString, ".", cids, sep="")
  }
  
  result <-  result[ , union(fixed,shuffled), drop=FALSE]
  if (orig.ids) result$orig.id <- idsString
  
  
  if (invisibly.return) { return(invisible(result)) } else {return(result)}
}

Example: Estimating Narcissism#

Let’s work with an example from the personality data set: narcissism. Let’s generate many, many samples of the same size. We’ll find the averages from each sample and use them to estimate the average level of narcissism for students at UNG.

First Step: Generating Samples of Size \(n=10\)#

Let’s beging with the R commands necessary to sample the Narc column in the personality data frame. We will use the

sample.data.frame()

function to draw a sample.

Run the cell below to see how this works, and notice:

  • The function inputs:

  1. Name of the data frame to sample from.

  2. Sample size to be drawn.

  • The output: 10 rows from the data frame with all columns present.

s <- sample.data.frame(p, 10, orig.ids = FALSE)
head(s,15)
AgeYrSexG21CorpsResGreekVarsAthHonorGPA...PerfOCDPlayExtroNarcHSAFHSSEHSAGHSSDPHS
11220 3 F N N 3 N N N 2.50... 107 13 140 3 1 38 28 25 24 AG
8220 2 M N N 1 N Y N 3.40... 106 0 146 15 4 46 44 40 36 AG
2921 3 M Y N 2 N Y N 3.40... 110 8 112 13 7 48 34 43 40 AG
4220 2 F N Y 1 Y N N 3.02... 112 10 147 10 6 33 21 31 33 SD
7022 4 F Y N 2 N N N 3.78... 112 11 117 4 4 43 37 28 34 SD
12125 2 F Y N 3 N N N 2.27... 110 7 147 7 1 40 38 32 35 SD
322 3 M Y N 2 N N N 3.06... 73 1 134 15 11 48 42 44 29 AG
8717 1 F N N 3 N N N 3.47... 99 12 130 3 3 44 29 26 19 AF
3219 1 F N N 1 N N N 2.33... 123 4 133 12 7 39 32 29 29 AG
9520 2 F N N 1 N N N 2.51... 104 5 147 6 3 47 19 19 21 AF

We can find the average narcissism for these 10 persons by subsetting our sample data frame s.

mean(s[ , 'Narc'])
4.7

Putting it Together. Eventually, we want to run a loop that does this a thousand or more times. Thus, we prefer a single line of code that will do it for us all at once. We wrap the sample.data.frame() function inside the mean function as shown below.

Run the code below multiple times to see how we’re sampling plus finding the average Narcissism level for each.

mean(sample.data.frame(p, 10, orig.ids = F)[ , 'Narc'])
4.5

Step 2: Creating a for Loop#

The steps make sense if we consider them separately:

  1. Create all_means, an initially empty vector where we plan to store our sample means.

  2. Create a for loop that will a thousand times.

  3. Inside the loop, we will:

  • Gather a sample of size \(n=10\).

  • Calcuate the mean.

  • Add this value to the all_means vector.

all_means <- c()                                         #Empty vector to store all the sample means
for (count in 1:1000){
    sample <- sample.data.frame(p, 10, orig.ids = F)     #Generate a sample (size n=10)
    all_means[count] <- mean(sample[ , 'Narc'])          #Save the mean of this sample in my list
}

Notice that we now have a vector all_means, so we display the distribution in a histogram and caculate various statistics.

summary(all_means)
hist(all_means)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.000   4.100   4.700   4.708   5.300   7.700 
_images/fb5de1d4ee78c2631eae0f1661107b18ad6c87a90c15aa5cdd9f012e006730c4.png

Step 3: The Middle 90% of the Distribution#

Because we intend to use the sampling distributions to estimate the population average, we need a way to gather an interval. This interval will be our estimated range of values. For the moment, let’s use the middle 90% of the all_means vector. We will need the endpoints, e.g. the 5th and 95th percentiles from the vector.

lower <- quantile(all_means, prob = 0.05)     # Calcuate the 5th percentile.
upper <- quantile(all_means, prob = 0.95)     # Calcuate the 95th percentile.
cat('The middle 90% of the all_means vector is (',lower,',',upper,').')
The middle 90% of the all_means vector is ( 3.3 , 6.1 ).

Step 4: The Histogram with Vertical Lines Showing the 5th and 95th Percentiles#

We use the function abline() to superimpose vertical lines onto our histogram. We’ve already calculated the values for the 5th and 95th percentiles. We need only to use the option v which draws a vertical line at the value indicated. The color option is not vital for our purposes, but a splash of color is visually appealing.

As we go forward, we will see that increased sample size will lead to a narrower bell-shape. In other words, the size of the standard deviation will become important, so let’s include that in the text we print out using the cat() function.

cat("Standard deviation of sampling distribution:", sd(all_means), '\nThe middle 90% of the sampling distribution: is (',lower,',',upper,').')
hist(all_means)
abline( v = lower, col="blue")
abline(v = upper, col="blue")
Standard deviation of sampling distribution: 0.8697034 
The middle 90% of the sampling distribution: is ( 3.3 , 6.1 ).
_images/2a1314c86b0ed554e215dc821acb836d05b0a91ce62caffc4c188a23d9995e79.png

Step 5: Performing all Tasks in 1 Code Block#

Now that we have unpacked each command line needed, we can put it all together into one code block. We have also added the parameters reps and samp_size as the top 2 lines to make it easy to set them to a single value. Doing these tasks will help to quickly generate different sampling distributions for different sample sizes n.

reps = 1000          # Number of repetitions of FOR loop
samp_size = 10       # Sample size to be drawn
all_means <- c()     # Empty vector to store all the sample means

for (count in 1:reps){
    sample <- sample.data.frame(p, 10, orig.ids = F)
    all_means[count] <- mean(sample[ , 'Narc'])
}

upper <- quantile(all_means, prob = 0.95)
lower <- quantile(all_means, prob = 0.05)
cat("Standard deviation of sampling distribution:", sd(all_means), '\nThe middle 90% of sampling distribution: (',lower,',',upper,').')
hist(all_means)
abline( v = lower, col="blue")
abline(v = upper, col="blue")
Standard deviation of sampling distribution: 0.8247401 
The middle 90% of sampling distribution: ( 3.3 , 6.005 ).
_images/d54eae7809ae8e81b967f082644780075a5a6887141dacbe999f08768985856b.png