Lab 4: Confidence Intervals

Learning Objectives

Today, we will construct two types of bootstrap confidence intervals:

Using the standard error of the bootstrap distribution
Using percentile-based methods

We will also use R to simulate capture-mark-recapture data to evaluate the Lincoln-Petersen estimator.

Before we get started, don’t forget to set the seed of R’s random number generator. Doing so will make sure you get the same results every time you render your document.

Lab Exercises

Quoted from an article published in the MN Conservation Volunteer

In response to Minnesota’s lack of large pike, the DNR launched a study in 1989 to examine whether special regulations could produce more large pike in Minnesota lakes. The study experimented with several types of length limits that restricted which pike anglers could keep. For example, a slot limit of either 20 to 30 inches or 22 to 30 inches was implemented on five north-central lakes (all pike within the slot had to be released).

One of these lakes was Medicine Lake near Bemidji; a slot limit of 22–30 inches was instituted in 1989. Data from this lake, collected using trap nets, were highlighted in the article (see the graphic, below):

Picture showing the distribution of pike lengths pre and post regulation.

The “post-regulation” length data in the figure come from trap nets set in 2002-2003. Today, we will look to see if changes were evident as early as 1993. Length data for fish sampled in 1988 and 1993 are in a data file called Pikedata.csv.

I’ve added a line in your template file to read in the data, pike<-read.csv("Pikedata.csv"). I’ve also included the statement: pike$year<-as.factor(pike$year). This will tell R that we want to treat year as a categorical variable. Run this code, then look at the first several rows using the head function to see the different variable names.

Before analyzing data, it is always good to explore whether there are any strange observations. In this case, there is an observation with a recorded length of 0. We can see this observation if we use favstats(~lenght.inches, data=pike). Before you begin the exercises, below, add a line of code to drop this observation using the filter function.

For each exercise, include any relevant output (tables, summary statistics, plots) in your answer. Doing this is easy! Use the template provided, lab4.qmd. Place any relevant R code in a code chunk, and click Render.

Produce side-by-side histograms showing the length data from 1988 and 1993, similar to the graphic above (but without the fish ;). Use: gf_histogram(~y | x , data=). Remember, the | creates 2 panels, one for each group. R will put position the two histograms side-by-side. To stack them on top of each other, we could use: gf_histogram(~length.inches, data=pike) + facet_wrap(~year, ncol=1) (facet_wrap is a function in the ggplot2 library that makes it easy to create multi-panel plots). This also works: gf_histogram(~length.inches, data=pike) %>% gf_facet_wrap(~year, ncol=1).
Another good way to visualize the data is to use the gf_density function. In this case, use the fill=~z argument to overlay the two distributions on the same plot: i.e., gf_density(~y, fill=~z, data= ). You could also consider side-by-side boxplots using: gf_boxplot(y~x, data=).
What is the mean length (in inches) of pike sampled in 1988 and in 1993? Remember to use: mean(y~x, data=).
The diffmean function can be used to calculate the difference in means. Specifically, diffmean(y~x, data=) will calculate the difference in the mean of y for each category of x. Use diffmean to see if the average size of fish has changed over time.
Create 5000 bootstrap statistics (i.e., 5000 differences in means, each calculated using a different bootstrap data set). You will need to use the do function. We will want to resample the observations separately for each year (to mimic how the data were collected). We can do this by specifying the groups= argument with the resample function. I.e., use resample(pike, groups=year).
Use the bootstrap standard error method to create a 95% confidence interval for the difference in mean lengths. For this question, use sd(~x, data=); (where you fill in x; hint: if you are not sure what to plug in for x, look at the names of the variables associated with the object you created in step 5).
Use the percentile method to create a 95% confidence interval for the mean difference in mean lengths. To calculate the endpoints (quantiles) of the bootstrap distribution, use qdata(~x, p=c(0.025, 0.975), data=). Save the result to an object called int.95 using: int.95<-qdata(...) (filling in the appropriate information where you see ...). You will use this object in the next exercise.

Note: you can save AND print the result from running a line of code by surrounding the code in a set of parentheses, e.g.:

(int.95<-qdata(...))

Just make sure you have an equal number of parentheses on both sides. If you don’t, you will likely see a + sign in the R console (see below). This tells you that R is waiting for additional input. To get unstuck, click the “Esc” (escape) button! Then add any missing parentheses. Picture illustrating when R is looking for additional input and where the escape button allows you to recover the ability to write code in the console.

Plot a histogram of your bootstrap statistics using: gf_histogram(~x, data=) %>% gf_vline(xintercept = int.95, col="red"). R will return two warnings in the console, but you can ignore these. If successful, you will see your interval overlaid on the bootstrap distribution! Note, this works because we saved the output from qdata in the previous exercise in an object called int.95. Some details - R saves the output from qdata as a “numeric vector” (essentially, two numbers) - which you can see if you type class(int.95). By contrast, R will save the output from the confint function as a “data.frame”. This has implications, as we will see, if we want to overlay an interval calculated using the confint function (we will need slightly different syntax).
How do your answers in [6] and [7] compare? Choose one of the two intervals, above, and interpret the result in the context of the problem.
The mosaic library has a function confint that can also be used to calculate bootstrap confidence intervals once you have created an objecting holding all your bootstrap statistics. Try out the following: confint(object, level=0.95, method="stderr"), where object is the name of the object you created when computing the bootstrap distribution in step 5. Also, try changing the method to percentile. Do your results agree with your answers in parts 6 and 7? What happens if you change the level argument to 0.99?

Note: although confint is a handy function for calculating confidence intervals, it is slightly more difficult to add the confidence intervals to our bootstrap distribution when using confint. This is because the upper and lower values are stored in the same row of a data.frame (rather than as a numeric vector as we saw when using the qdata function). Instead, we can use:

conf<-confint(object, level=0.95, method="percentile")

gf_histogram(~diffmean, data=object) %>% gf_vline(xintercept=~lower, data=conf, col="red") %>% gf_vline(xintercept=~upper, data=conf, col="red")

Do these results prove that the regulation has led to larger pike? Why or why not?

One of the simplest approaches to estimating abundance of wildlife populations is the Lincoln-Petersen estimator. The approach requires two independent samples of the population. The first sample is used to ``mark’’ individuals. The second sample is used to estimate the proportion of marked individuals in the population, which facilitates estimation of the total population size, $N$. These steps are summarized, below.

Step 1: An initial sample of $n_1$ individuals is taken. All of these individuals are marked (e.g. by tagging, radiocollaring, paintball marking, or some other means to distinguish marked and unmarked individuals). They are then returned to the population.

Step 2 A second, independent sample of $n_2$ individuals is taken (usually at a later point in time, assuming marked and unmarked individuals have had time to “mix” together). From this sample, we determine:

$m_2$ = the number marked individuals in the second sample
$n_2$ = the total number of individuals in the second sample

The proportion of marked individuals in the second sample is used to estimate the proportion of marked individuals in the population ($p = \frac{n_1}{N}$). I.e.:
\[\hat{p} =\frac{m_2}{n_2}\]

Step 3: Setting $\hat{p}$ equal to $p$, motivates the Lincoln-Petersen estimator: $\hat{p} = p \Rightarrow \frac{m_2}{n_2} = \frac{n_1}{N} \Rightarrow \hat{N} = \frac{n_1n_2}{m_2}$.

Note, if we do not see any marks in the second sample ($m_2=0$), then $\hat{N} = \infty$. To avoid this problem, we usually use a slightly modified estimator (the Chapman estimator):

\[\hat{N} = \frac{(n_1+1)(n_2+1)}{(m_2+1)}-1\]

I wrote a function to simulate the process of data collection and abundance estimation using the Chapman version of the Lincoln-Petersen estimator. The function has 3 arguments:

N = population size
$n_1$ = size of the first sample (all will be marked and released)
$n_2$ = size of the second sample

lp.est<-function(N,n1,n2){
# Create a population of n1 marked individuals and N-n1 unmarked individuals 
# for the second sampling occassion
  pop.data<-c(rep("M",n1), rep("U", N-n1)) 

# Take a sample of n2 individuals (without replacement) from this population  
  samp.data<-sample(pop.data, size=n2, replace=FALSE) 

# Estimate p
  m2<-sum(samp.data=="M")
  N.hat<-(n1+1)*(n2+1)/(m2+1)-1
  return(N.hat)
}

This code is contained in thelpfuncs.R file (contained in your lab4 directory). To be able to use this function, you have to run this code (this can be done by “sourcing” the code (source("lpfuncs.R")). I.e., source(file) will run the code contained in file.

You can generate “data” and an estimate of N by calling this function (with specific values of $N, n_1$, and $n_2$). For example, assume: a) the population consists of 2000 individuals, b) in our first sample we capture and mark 300 individuals, and c) in the second sample, we capture 300 individuals ($N =2000, n_1 = 300$, and $n_2 = 300$). To generate an estimate of $N$ under these assumptions and sampling effort, we would use:

lp.est(N=2000, n1=300, n2=300)

[1] 2058.114

We see that lp.est has generated an estimate of $N$ equal to 2058 (slightly bigger than the true population size, $N=2000$). Note: you will get a different estimate each time you use this function, since it generates the samples randomly. Simulation (e.g., using R!) is a great way to evaluate statistical estimators and also the effect of sample size (e.g., for planning purposes).

Create a sampling distribution for $\hat{N}$, assuming a population size of 2000 individuals, an initial sample of 50 individuals, and a second sample of 100 individuals. Specifically, generate 5000 different estimates using do and the lpest function with $N=2000$, $n_1$ = 50, and $n_2$ = 100. Then calculate the mean $\hat{N}$ and plot a histogram of the resulting estimates.
Repeat the process, but with $N=2000$, $n_1$ = 300, and $n_2$ = 300.
Compare your results from [1] and [2]. What happened when we increased the size of the samples, $n_1$ and $n_2$?

Simulation is also a great way to explore the effect of assumption violations on estimator peformance. In the case of the Lincoln-Petersen/Chapman estimator, we make the following assumptions:

The population is closed to additions (births, immigrants) and deletions (deaths, emigrants).
All animals are equally likely to be captured in each sample.
Marks are not lost and are not overlooked by the observer.

I wrote a second function to explore what might happen if the second assumption was violated. In many cases, it is plausible that animals might change their behavior after being caught, making it more difficult to capture them again. When this happens, we say the animals become “trap-shy”. In other cases, some individuals may be easier to catch after having been previously caught, in which case we say these animals are “trap-happy”.

The lp.est.het function, below, allows us to generate data that violate this second assumption by passing an additional argument, ps that gives the relative probability of sampling marked and unmarked individuals in the second sample. For example: ps = c(2,1) suggests that marked individuals are twice as likely to be captured as unmarked individuals on the second sampling occasion.

lp.est.het<-function(N,n1,n2, ps){
# Create a population of n1 marked individuals and N-n1 unmarked individuals 
# for the second sampling occassion
  pop.data<-c(rep("M",n1), rep("U", N-n1)) 

# Take a sample of n2 individuals (without replacement) from this population  
# ps = relative probability of capturing marked and unmarked individuals.  
# For example: ps = c(2,1) suggests that marked individuals are twice as likely to 
# be captured as unmarked individuals on the second sampling occassion.
  samp.data<-sample(pop.data, size=n2, replace=FALSE, prob=c(rep(ps[1],n1), rep(ps[2],N-n1))) 

# Estimate p
  m2<-sum(samp.data=="M")
  N.hat<-(n1+1)*(n2+1)/(m2+1)-1
  return(N.hat)
}

Use the lp.est.het function to generate 1000 samples, and thus a sampling distribution for the case where $N=2000$, $n_1$ = 300, $n_2$ = 300, and ps = c(2,1) (animals are trap-happy). Calculate the mean $\hat{N}$ and plot a histogram of the resulting estimates.
Repeat the process, but in this case with ps = c(1,2) (animals are trap-shy).
One way that we evaluate estimators is in terms of Bias = mean(estimate) - true parameter. Calculate the bias for the Lincoln-Petersen estimator in the case of trap-shy and trap-happy individuals simulated in [1] and [2], above. How does heterogeneous detection probabilities (trap-happy or trap-shy individuals) impact our estimates?
Come up with at least one reason why animals might exhibit a trap-happy response.

A first draft of this lab was adapted from a lab created by Dr. Kari Lock-Morgan (which I can no longer find or access). In addition to changing much of the text, I have used a different data set and modified the coding exercises.

The lab is released under a Creative Commons Attribution-ShareAlike 3.0 Unported.