Lab 6: Central Limit Theorem

Learning Objectives

Be able to create sampling distributions for a sample mean and sample proportion in R
Visualize how sampling distributions change as we increase the size of our sample

The Central Limit Theorem tells us that sampling distributions for means and proportions will become more bell-shaped and symmetric (i.e., more Normal) as we increase the size of our sample. The standard deviation of the sampling distribution (i.e., our standard error) will also decrease as our sample size increases.

Don’t forget to “set the seed” of the random number generator in R so that each time you render the lab template you get the same answer (it helps with writing up your answers :).

Lab Exercises

Counting Polar Bears

Today, we will explore a data set constructed by Seth Stapleton and Michelle LaRue, both of whom received their PhD’s from the University of Minnesota. The data consist of counts of polar bears within roughly 3 x 3 km quadrats on Rowley island in northern Foxe Basin, Nunavut (see Figure 1 [right]).

Picture of a polar bear and Map of Rowley Island. — **Figure 1:** Left, a photo of a polar bear taken by Seth Stapleton. Right, Rowley Island: bears are indicated by red dots, gray lines show quadrats. Data for today’s lab consist of the number of bears contained within each of 164 quadrats intersecting the island.

You might wonder how they count polar bears…Seth and Michelle have pioneered a method using high resolution satellite imagery. You can read about the technique here. They are able to distinguish bears from other light-colored spots by comparing images collected on multiple dates. However, the digital imagery is costly (not to mention that scanning the images for bears is very time intensive!). To save time and money, one could consider viewing images from a random sample of quadrats rather than scanning all quadrats intersecting the island. An estimate of the total number of bears on the island could then be obtained by multiplying the sample mean (i.e., mean number of bears in a quadrat) by the total number of quadrats (sampled and unsampled) that intersect the island. There are 124 quadrats, so we can estimate the population size using:

\[\hat{N} = 164\bar{x}\]

Seth and Michelle have actually used simulation approaches to explore the sampling distribution of \(\hat{N}\) for different sampling efforts (number of quadrats randomly sampled). Today, you will get a chance to do this too!

The data are contained in a data set named bears.csv; R code to read in the data set is included in your template for today’s lab. Explore the distribution of the number of bears/quadrat (for the 164 surveyed quadrats) by calculating some summary statistics using favstats(~Num.Bears, data=bdat).
Construct a histogram of the sample data using the gf_histogram function. How would you describe the distribution of bears in the different quadrats?

The sampling distribution tells us just how much variability we should expect (across different samples of size \(n\) quadrats) when estimating the \(N\) (the total number of bears).

In this lab, we have access to the entire population, but this is rarely the case in real life. Gathering information on an entire population is often extremely costly or even impossible. Because of this, we often take a smaller sample of the population, and use bootstrapping to approximate the sampling distribution (as in the previous set of exercises).

Because we have data on the whole population, we can actually explore the real sampling distribution, and also explore how the sampling distribution changes as we increase the size of our sample. For the exercises, below, use a combination of do and sample(datasetname, size=[given sample size], replace=FALSE) to explore sampling distributions for a range of sample sizes. I’ve included the code for creating the sampling distribution when taking a sample of size 5. Make sure you understand what it is doing! Then, create sampling distributions for successively larger sample sizes.

Create a sampling distribution of \(\hat{N}\) by taking 5000 different samples of size \(n = 5\). Plot the distribution using gf_dhistogram(~x, data=) %>% gf_fitdistr(dist = "dnorm").

NOTE: we need to use gf_dhistogram rather than gf_histogram if we want to compare the sampling distribution to that of a normal distribution. Adding the d scales the counts so that the total area under the histogram is 1. It may be instructive to compare the y-axis between plots constructed using gf_histogram and gf_dhistogram.

Create a sampling distribution of \(\hat{N}\) by taking 5000 different samples of size \(n = 30\). Plot the distribution, using gf_dhistogram with gf_fitdstr to overlay a normal distribution.

NOTE: if you don’t see the normal distribution, make sure you used gf_dhistogram rather than gf_histogram!

Create a sampling distribution of \(\hat{N}\) by taking 5000 different samples of size \(n = 75\). Plot the distribution, using gf_dhistogram with gf_fitdstr to overlay a normal distribution.
For each sample size, describe the sampling distribution. Consider its shape and center. How does the sampling distribution change as you increase the sample size?

Central Limit Theorem (CLT)

The Central Limit Theorem (CLT) states that, for a large enough sample size, the distribution of the sample mean is normally distributed (and, from this result, it follows that the distribution of \(\hat{N}\) will also be normally distributed). The CLT is one of the most important theorems in statistics!

Reflect on your results in this section. Does \(n = 5\) appear to be large enough for the CLT to apply? What about \(n\) = 30? \(n\) = 75?

Now, lets mimic what happens in real life - we take a single sample and estimate the population size, \(N\), along with a confidence interval for \(N\).

Take a single random sample of size \(n = 75\) quadrats and store the data in a dataset called sample.dat: i.e., sample.dat<-sample(datasetname, size=75, replace=FALSE).
Estimate the mean number of bears per quadrat, \(\mu\), using the mean of this sample, \(\bar{x}\): mean(~Num.Bears, data=sample.dat). Also, estimate the population size using \(\hat{N}=164\bar{x}\), i.e., 164*mean(~Num.Bears, data=sample.dat).
Create a bootstrap distribution for \(\hat{N}\) by resampling the 75 quadrats in sample.dat 5000 times, calculating the sample mean (for each resampled data set), and then \(\hat{N} = 164\bar{x}\) (for each resampled data set). Calculate the standard error of the bootstrap distribution for \(\hat{N}\). HINT: to create your bootstrap distribution, your code should look something like:

do(5000)*{ 
# Calculate the mean number of bears in a quadrat  
 mean.boot<-mean(~variable, data=resample(dataset))  
# Estimate the total population size by multiplying this mean
# by the number of quadrats on the island.
 N.boot<-164*mean.boot 
}

Or, you can use a single line of code as in:

do(5000)*{ 
# Estimate N
 164*mean(~variable, data=resample(dataset))  
}

Create a 95% confidence for \(N\) using the bootstrap distribution above. Does your interval contain the true population size = 92 bears? Are you surprised by your result - why or why not?

Depending on which 75 quadrats you selected, your estimate of \(N\) could be a bit above or a bit below the true population size of 92 bears. This is why we usually give an interval estimate, rather than just a point estimate. But in general, the sample mean turns out to be a pretty good estimate of the population mean, and thus, \(\hat{N}\) does a good job of estimating \(N\). If we had 100 lab groups, we would expect all but 5 of the intervals would contain \(N\). These results are pretty incredible, I think, given that we only sampled \(\approx\) 45% of the population!

The Central Limit Theorem also applies to proportions. Based on a survey conducted in 2012, the US Census Bureau reported that 4.1% of the population in Minneapolis bikes to work. The estimated population size in Minneapolis in 2013 was estimated to be 400,070.

Using this information, lets create our population of bikers and non-bikers.

library(mosaic)
options(digits = 2) #number of significant digits for output
nbikers<-round(400700*0.041) # Number of bikers in Minneapolis
Bike.Y.N<-data.frame(bike=c(rep("Yes", nbikers), rep("No", 400700-nbikers))) # bikers & non-bikers
tally(~bike, data=Bike.Y.N, format = "proportion")

bike
   No   Yes 
0.959 0.041

Now, let’s simulate the process of randomly sampling 50 different people from Minneapolis and asking if they bike to work so that we can get a \(\hat{p}\). Note - you will get a different answer everytime you type this!

prop(~bike, data=sample(Bike.Y.N, size=50, replace=FALSE), success="Yes")

Important: if we try to use this code (i.e., the prop function along with do) to create our sampling distributions for different sample sizes, we will occasionally get a sample without any bikers and this will give R fits. So, instead, we will calculate the sampling distibution for the proportion of individuals that do not bike using:

prop(~bike, sample(Bike.Y.N, size=50, replace=FALSE), success = "No")

prop_No 
      1

Use the above code (along with the do function) to generate the sampling distributions for the questions below.

Generate a sampling distribution of \(\hat{p}\) using a sample size of 50. Use gf_dhistogram(~x, data=) %>% gf_fitdistr(dist="dnorm") to plot the sampling distribution for the proportion of bikers. Don’t forget to see what R names the column of \(\hat{p}\)’s so that you can replace x with the correct variable name!

NOTE: if you don’t see the normal distribution, make sure you used gf_dhistogram rather than gf_histogram!

Repeat with a sample size of 100.
Repeat with a sample size of 300.
For each sample size, describe the sampling distribution. Consider its shape and center. How does the sampling distribution change as you increase the sample size?

A first draft of this lab was adapted from a lab created by Dr. Kari Lock-Morgan (which I can no longer find or access). In addition to changing much of the text, I have used a different data set and modified the coding exercises.

The lab is released under a Creative Commons Attribution-ShareAlike 3.0 Unported.