Lab 5: Hypothesis Tests

Learning Objectives

  • Be able to create randomization distributions to conduct hypothesis tests using R
  • Understand how confidence intervals and hypothesis tests complement each other

Lab Exercises

Pseudoscorpions

For the first part of this lab, we will be exploring an example from Whitlock and Schluter’s Analysis of Biological Data (p. 545).

Picture of a pseudoscorpion.

Pseudoscorpions of the species Cordylochernes scorpiodes live in tropical forests where they ride on the backs of harlequin beetles to reach the rotting figs where they feed. Females of the species are promiscuous and mate with multiple males over their short lifetimes. It is unclear what advantages there are for a female to mate multiple times, because the males don’t help care for her young, and mating just once provides all the sperm she needs to fertilize her eggs.

One possible advantage is that the sperm of some males is genetically incompatible with a given female and, by mating multiple times, a female increases the chances of mating with at least one male whose sperm is compatible with her. To investigate this idea, Newcomer et al. (1999) recorded the number of successful broods by female pseudoscorpions randomly assigned to one of the two treatments. One group of females was each mated to two different males (DM), whereas females in the other group were each mated twice to the same male (SM). By mating each female twice, the same total amount of sperm was provided in both treatments, but DM females received genetically more diverse sperm than SM females. The researchers compared the mean number of successful broods in each treatment group.

You can access the data by loading the abd library and typing data(Pseudoscorpions).

  1. Calculate the mean number of successful broods in each group and the difference in means (DM - SM). Remember, mean and diffmean functions! Note that by default, R calculates SM-DM. If you want DM-SM you can add a minus sign out front: -diffmean(y~ x, data=) Just be consistent (if you use a minus sign here, also use one when creating the randomization distribution in the next exercise). We will want to refer back to the difference in means during step [4], so assign the difference in means to a named object. For example, sample.diffmean = diffmean(y~x, data=).

Remember, you can save AND print the result from running a line of code by surrounding the code in a set of parentheses, e.g.:

(sample.diffmean = diffmean(y~x, data=))

Just make sure you have an equal number of parentheses on both sides!

  1. Using the do function, create a randomization distribution for the difference in means. Plot the randomization distribution. Hint: you will want to use the shuffle function: diffmean(y~shuffle(x), data=).
  2. Where is the distribution centered? Why?
  3. Calculate the p-value. Is the difference statistically significant? If it is statistically significant, can you infer that the observed difference is caused by the treatment (DM versus SM)? To answer this question, you need to consider how the data were collected.

Hint: You can calculate the p-value using the prop function. It is important to make sure you plot the randomization distribution and overlay the statistic for your original data set. In this case, since the original statistic is negative and the randomization distribution is centered on 0, we can use 2*prop(~ diffmean <= object, data=), where object is the name of the object holding the difference in means from step [1] (this will tell us the proportion of the randomization distribution with more extreme values than our sample statistic). If the original statistic was positive, then we would use 2*prop(~ diffmean >= object, data=). For other possibilities, including p-values for one-sided tests, see the examples contained near the end of the slides for Section 4.2.

Home range overlap: comparing VHF and GPS data

For this next set of exercises, we will consider locations of white-tailed deer collected by Chris Kochanny in the early 2000’s during his Masters thesis in the Department of Fisheries, Wildlife, and Conservation Biology here at the UMN. Chris collared 14 deer with GPS collars; the collars also sent out a VHF signal, allowing Chris to obtain locations via triangulation. Doing so provided a means to compare home ranges derived from the two different data collection methods (automated GPS data collection and manually collected VHF data). Chris conducted this study when GPS collars were first showing up on the market, and it was not clear how well they would perform relative to VHF collars that were more common at the time. Nighttime locations can be difficult to obtain with VHF data. Thus, he also compared GPS daytime and GPS nighttime locations to help understand the potential implications of daytime-only sampling when using VHF collars.

For today’s lab, we will use data from his study to compare GPS- and VHF-based home ranges, using daytime locations only. We will explore two different methods for generating an outer home-range boundary:

  • Minimum convex polygon (MCP)
  • Kernel density estimate (KDE)

The minimum convex polygon (MCP) method essentially connects outer points, with the restriction that the resulting polygon is convex (Figure 1).

Multi-panel plot showing MCP-based home ranges for each deer (using GPS and VHF data)

Figure 1. MCP-based home range contours for white-tailed deer with locations collected using GPS or VHF. Each panel represents a different individual.

The kernel density estimator (KDE) essentially produces a smooth histogram (we saw this in chapter 2 with the gf_density function). For home range estimation, one starts by creating a smooth 2-dimensional histogram using x and y coordinates of the animal locations. This smooth density curve can then be used to determine an outer boundary that encompasses 95% of the volume above the (x, y) plane (Figure 2). This boundary can be interpreted as enclosing the area where we are (most) likely to find the animal 95% of the time during the period of data collection.

Multi-panel plot showing KDE-based home ranges for each deer (using GPS and VHF data)

Figure 2. KDE-based home range contours for white-tailed deer with locations collected using GPS or VHF. Each panel represents a different individual.

If you take FW5603 Habitats and Regulation of Wildlife, then you will likely explore these approaches in more detail. The adehabitatHR and rhr libraries in R have methods for determining home range contours using both of these methods (and several others as well). For today’s lab, we will work with home ranges that were generated using R code and functions in the adehabitHR library. The data are contained in the file hrests.csv.

Comparison of GPS and VHF estimates of home range size

Lets begin by considering just the KDE-based home ranges (Figure 2). Are the home ranges similar in size when using GPS and VHF data collection methods? We can explore this question, by calculating the mean of the differences in home range size (across individuals). Lets start by creating a new variable that contains, for each individual, the difference between the GPS and VHF estimate of home range size when using a kernel density estimator to determine home range.

  hrests<-mutate(hrests, GPSminusVHF.KDE=gps.kde-vhf.kde)   
  mean(~GPSminusVHF.KDE, data=hrests) 
[1] -11.02764

We see that the GPS home ranges are slightly smaller on average - but is this difference statistically significant? Let’s assume we can treat these individuals as though they are a random sample from a much larger population of individuals at Camp Ripley that could be monitored using these 2 methods. How could we test whether the population mean difference (between GPS-based and VHF-based home ranges) is 0?

When creating our randomization distribution, we need to account for the fact that each individual has 2 home ranges (one generated from VHF locations and one from GPS locations) - i.e., the data come from a paired design. In addition, we need to generate data that are consistent with the Null Hypothesis (of no difference). One way we could do this would be to randomly shuffle the observations within each row. This would preserve the structure of the data (paired observations). If the Null hypothesis is true, then the labels for “GPS” and “VHF” do not mean anything (which is why we permute the observations within each row). Now, we just have to figure out how to do this with R!

It turns out that shuffling the data within the rows is equivalent to randomly choosing the sign of the difference in home range size for each individual. To see this, consider the first 3 rows of the data set:

hrests[1:3,c(1:3,6)] 
    ID  gps.kde  vhf.kde GPSminusVHF.KDE
1 001A 173.4233 197.9480       -24.52469
2 001B 163.8496 181.2088       -17.35923
3 020A  87.7458 179.1769       -91.43109

If we swap the GPS and VHF measurements, we just change the sign of the difference:

In other words, we can generate samples from the randomization distribution by first randomly choosing the sign of each difference (GPS - VHF), then multiply these signs by the actual differences as follows:

# Generate random signs for each difference
signs<-sample(c(-1,1), size=nrow(hrests), replace=T)
# Multiply these signs by the original differences to create a data set
# generated under the null hypothesis
(randdat<-signs*hrests$GPSminusVHF.KDE)
 [1]   24.5246919  -17.3592305   91.4310851   19.2262663   -3.3763502
 [6]   -0.1109666  -13.2612132  -49.1634054   45.1281194  -30.8170907
[11] -150.5696609  -18.5350533   -6.3849785   51.1735010

Taking the mean of these values (i.e.,the average difference in home range size across all animals) gives us 1 observation from the randomization distribution:

mean(randdat)
[1] -4.149592

To recap: choosing random signs for the difference in home range size (GPS - VHF) for each animal is analogous to shuffling the (GPS and VHF) labels for each animal and then calculating the difference in home range size (“GPS” - “VHF”). Taking the mean of these differences thus gives us 1 observation from the randomization distribution consistent with the null hypotheses (mean home range size is the same for VHF and GPS data collection methods) and also consistent with how the data were collected (two values for each individual).

  1. Generate a scatterplot relating VHF-based and GPS-based home range sizes when using the KDE method. Make sure to use informative x and y axis labels. Remember: gf_point(y~x, data=, xlab= , ylab= ). Comment on the relationship between the GPS and VHF estimates of home-range size (are they associated, and if so, in what way?).
  2. I have added code to your template that: 1) calculates the mean difference for the sample data; and 2) creates 1 observation from the randomization distribution (using the approach, above). Use the do function to create 1000 observations from the randomization distribution.

Hint: the do function tells R to do the same thing many times. You just need to copy the code for creating 1 observation from the randomization distribution inside the { } next to the do function! If you get stuck, have a look back at the introductory slides for Lab 5.

  1. Is the difference in home range sizes (GPS versus VHF using the kde method) statistically significant? State your null and alternative hypotheses, plot the randomization distribution, generate a p-value, and interpret the results in the context of the problem. You can calculate the p-value using the prop() function (see lecture notes from Section 4.2).

  2. Now, create a bootstrap distribution of the mean difference in VHF-based and GPS-based home range sizes. To do this, resample the paired differences stored in the variable GPSminusVHF.KDE (i.e., use do along with mean(~GPSminusVHF.KDE, data=resample(hrests)). If you get stuck, have a look back at the introductory slides for Lab 5.

  3. Use the confint function to calculate a 95% confidence interval for \(\mu_{VHF}-\mu_{GPS}\). Use the confidence interval to conduct the hypothesis test in step [3].

  4. What information does the confidence interval give that the p-value doesn’t? What information does the p-value give that the confidence interval doesn’t?

  5. Which procedure (hypothesis test or confidence interval) is most useful in this case? Justify your answer.

If you have time, consider one or more of the comparisons below:

  • (VHF, MCP) vs. (GPS, MCP) = (comparison of VHF and GPS data, using MCPs)
  • (GPS, MCP) vs. (GPS, KDE) = (comparison of home range estimators, using GPS data)
  • (VHF, MCP) vs. (VHF, KDE) = (comparison of home range estimators, using VHF data)

Literature Cited

Kochanny, C. O., G. D. DelGiudice, and J. Fieberg. 2009. Comparing winter home ranges of white-tailed deer using GPS and VHF telemetry. Journal of Wildlife Management 73:779-787.

Newcomer, S.D., J.A. Zeh, and D.W. Zeh. 1999. Genetic benefits enhance the reproductive success of polyandrous females. Proceedings of the National Academy of Sciences 96:10236-10241.

A first draft of this lab was adapted from a lab created by Dr. Kari Lock-Morgan (which I can no longer find or access). In addition to changing much of the text, I have used a different data set and modified the coding exercises.

The lab is released under a Creative Commons Attribution-ShareAlike 3.0 Unported.