Lab 2: Exploring Data, One and Two Variable Summaries

Learning Objectives for Lab 2

The goal of this lab is to become familiar with functions for plotting and summarizing data, many of which are listed in the tabs below.

Single categorical predictor

  • tally(~x, data=) for a frequency table
  • pie(tally(~x, data=)) to produce a pie chart
  • gf_bar(~x, data=) and gf_percents(~x, data=) for bargraphs showing counts and percentages, respectively

Two categorical predictors

  • tally(y~x, data=) or tally(~x+y, data=) or tally(~x|y, data=) for producing frequency tables
  • mosaicplot(y~x, data=) to produce a mosaic plot
  • gf_bar(~x, fill=~y, data=, position=position_dodge()) for side-by-side bargraphs
  • gf_bar(~x, fill=~y, data=) for segmented bargraphs
  • gf_bar(~x|y, data=) for multi-panel bargraphs
  • gf_dotplot(~x, data=, method="histodot") for a dotplot
  • gf_histogram(~x, data=) for a histogram showing counts
  • gf_dhistogram(~x, data=) for a histogram standardized to sum to 1
  • gf_boxplot(~x, data=) for boxplot
  • gf_density(~x, data=) for a density plot (smooth histogram)
  • gf_histogram(~x|y, data=) for side-by-side histograms for different groups (x)
  • gf_boxplot(y~x, data=) side-by-side boxplots
  • gf_density(~y, fill=~x, data=) density plots for multiple groups (x) overlaid
  • gf_point(y~x, data=) for a scatterplot
  • gf_point(y~x, data=) %>% gf_line() for scatterplot with lines

For specific examples, see Handout slides from Ch 2.

Getting Started

If you open the Lab 2 Assignment on Posit Cloud, you will again find a template for today’s lab (lab2.qmd). Please add your name next to the author line.

You will see at at top of the template that I included code to tell R that we want to use code in the knitr, dplyr, and mosaic libraries.

library(knitr)
library(mosaic)
library(dplyr)

For today’s lab, we will start by exploring some data that were collected by Glenn DelGiudice and his colleagues. Glenn is an adjunct professor here at the University of Minnesota and Research Scientist with the Minnesota Department of Natural Resources. The data were collected as part of a long-term study exploring the impact of winter severity and timber harvesting on the ecology of white-tailed deer. Deer were captured at 4 different study sites and radiocollared to monitor their survival. Each year, new deer were collared to replace individuals that died or were lost due to follow-up (e.g., battery failure).

Each record (row in the data set) corresponds to a different deer captured during the study. Many different characteristics were measured on each individual when they were captured (represented by the columns in the data set). The capture methods are described in the following paper:

DelGiudice, G. D., B.A. Sampson, D.W. Kuehn, M. Cartensen, and J. Fieberg. 2005. Understanding margins of safe capture, chemical immobilization, and handling of free-ranging white-tailed deer. Wildlife Society Bulletin, 33:677-687.

To read in the data set, and assign it to an object named deerdat type the following into your lab2.qmd or lab2.R file and run it:

deerdat<-read.csv("DeerCaptures.csv")

We can look at the first 5 records of the data set using:

head(deerdat)

What if we wanted to look at the first 10 records? Can we add another argument? Let’s take a look at the head function.

?head

R help files will likely be difficult to navigate at first. It helps, however, to know that all help files are structured similarly. Lets explore the examples for the head function.

  • Description Provides information about what the function does. Here, we learn that head ``Returns the first or last parts of a vector, matrix, table, data frame or function.’’
  • Usage gives examples of how you would use the function. Here, we see that we need at least 1 argument (x) and that the function works with many different types of objects (data frames, matrices, tables, etc).
  • Arguments provides a list of options that you can supply to the function. Here, we see the argument n determines the ``size of the resulting object’’.
  • Details more information about how the function works.
  • Value what is returned when you type the function. Here, we see that the function returns an object like the x, but smaller.
  • Author Who wrote the code
  • Examples Examples illustrating the use of the code. These can be really helpful. You can highlight and run the examples to gain greater insight into the function. Here, we find that there is also a tail function that can be used to display the last several rows of the data set.

Lets explore the examples for the head function.

letters
head(letters, n=-6L)
tail(letters, 2)

Here, we can see that we can specify the number of observations using the argument n. And, we see that we can specify that we want to see all but the last 6 records by adding n=-6L. Lastly, we see that there is also a tail function. The last example shows who we can select just the last 2 observations of a data set using the tail function. So, by looking at the examples at the end of the help file we learn more about the function, how it can be used, and we even learned about another function we didn’t previously know about!

Lab 2: Exercises

As with the first lab, I have put together a series of exercises, this time the exercises are spread out among the 4 different numbered tabs, below. You should enter your answers in lab2.qmd. Remember, all code needs to go in code chunks and all text needs to go outside of the code chunks. A few resources that will help you along the way include:

  • the tabs at the top of this html file that show the different functions in R
  • the Handouts for the lectures in class
  • the cheat sheets I shared on canvas

Let’s look at the different variable names using the names function:

names(deerdat)

Many of the variables can be inferred from their names (e.g., deer_id and site identify each unique deer and the different study sites, agecapt is the age at capture). Here is a bit more information about each of the variables:

  • cdate = capture date
  • agecapt = age at capture (in years)
  • cwgtkg = weight to the nearest 0.5 (kg) at time of capture
  • year = year of study
  • cort = Cortisol (ug/dl), measure of stress?
  • indtime = time elapsed between intramuscular injection of the xylazine-ketamine combination and when the deer is safe to handle without response.
  • immobtime = time elapsed between induction and recovery.
  • recovt = time elapsed between intravenous injection of yohimbine and when deer is walking away from the release site.
  • ageclass = 1 if juvenile, 2 if adult
  • wsi = Winter severity index was calculated by accumulating 1 point for each day with an ambient temperature < –17.7 (degree C) and 1 point for each day with a snow depth >38 cm between 1 November and the end of the Julian week of the capture.
  • wsnow, wtemp = snow and temperature components of wsi
  • xdosage = xylazine dosage (mg/kg)
  • kdosage = ketamine dosage (mg/kg)

Use these data to complete the exercises below: make sure to include your code and typed answers in lab2.qmd.

  1. What are the cases? Which variables are categorical and which are quantitative?

Notes: You can see how R treats the variables by typing class(deerdat$variable_name), where variable_name is the name of the variable. Categorical variables will be listed as “character” or, more likely, “factor” (we will talk about factors in a later lab). The $ is used to select a variable in a data frame (similar to the select function).

deer_id looks numeric, but like your x500, should probably be thought of as categorical. Typically, dates are represented in terms of the number of days since a particular reference point, in which case they are best thought of as numeric. Months or days of the week, by contrast, might be thought of as categorical. If you have time at the end of the lab (or on your own), see http://www.statmethods.net/input/dates.html for some information on how to work with dates in R.

  1. How many deer were captured at each study site? Use the tally function illustrated during the lecture to find out. Remember, you can type ?tally to find out more about this function.

  2. What proportion of these deer were captured at Dirty Nose (DN)? Again, use the tally function to find out. Hint: You will have to add the format argument to get proportions rather than counts. If you get stuck, look at the examples at the end of the help file for tally.

  3. Create a bargraph showing the number of animals captured at each of the 4 study sites. Use the function gf_bar to create the bargraph. By default, this bargraph will display counts at each of the sites. Create a second bargraph, but this time using gf_percents as in: gf_percents(~variable, data=datasetname).

  4. Use the pie function along with the tally function to make a pie chart illustrating the proportion of captured deer associated with each study site. Hint: example code can be found in the Section 2.1 lecture.

  5. Consider the graphs above, and the frequency table: which site had the most deer included in the study?

  6. Suppose a researcher compares survival rates among the four study sites and finds that survival is lower at Dirty Nose (DN) than at the other 3 sites. After this discovery, he/she decides to explore whether there might have been more fawns captured at Dirty Nose (since fawns survive at lower rates than adults). Create an appropriate bargraph (side-by-side or multi-panel) that helps to explore this question. Note, you can create side-by-side barcharts using: gf_percents(~ageclass, fill=~site, position = position_dodge(), data = deerdat). R will use the total number of deer in the data set as the denominator though when calculating the percentages. To get percentages within each site, you can add the denom argument to gf_percents. Try: gf_percents(~ageclass, fill=~site, position = position_dodge(), denom = ~ fill, data = deerdat)

  7. Create a mosaic plot to further explore this question. You can use mosaicplot(y~x, data= ) to create the plot.

  8. Lastly, look at the relative frequencies of adult and fawns at each site using: tally(~ageclass|site, data=deerdat, format="proportion"). Do you think differences in ages among the 4 study sites might be important when comparing survival rates? Briefly justify your answer by referring table and above plots.

  1. Create a dotplot illustrating the ages of the captured deer using the function gf_dotplot. The default settings don’t work real well here. Try adding “binwidth=0.1” as in gf_dotplot(~x, data=, method="histodot", binwidth=0.1). NOTE: you will get a warning that 19 rows were removed (you can ignore this warning - R is just letting us know that 19 cases were missing an age at capture). Describe the shape of the distribution (is it symmetric or skewed [and, if skewed, in which direction]?).

  2. Calculate the mean age at capture. Note: there are some individuals with missing ages. You will have to add the argument na.rm=TRUE to make the mean function work, i.e., mean(~variable, data=, na.rm=TRUE). Missing data in R are given values = NA, so na.rm = TRUE tells R to remove (rm) the missing values.

  3. Create a histogram of weights at capture using the function gf_histogram. Again, describe the shape of the distribution (is it symmetric or skewed [and, if skewed, in which direction]? is it bimodal?). Note: you can again ignore the warning message here, which is just informing us that there are 99 observations that have missing weight information.

  1. Create side-by-side boxplots, summarizing weights at capture for fawns and adults. Use the gf_boxplot function. Here, and in the next few questions, you can again ignore the warning from R (which is alerting us to the fact that there are cases with missing weights at capture).

  2. Create side-by-side histograms illustrating the distribution of weights at capture for fawns and adults.

  3. Create smooth histograms (kernel density estimates), illustrating the distribution of weights at capture for fawns and adults using the gf_density function.

  4. Calculate the mean weight for fawns and also for adults (hint: you can do this with one line of code). Again, note you will have to supply the na.rm=T argument to the mean function.

  5. Why do you think the distribution of weights (in question 3 of Section 2.2) was bi-modal?

If you finish early, try exploring some of the other variables. Or, explore some of the data within the Lock5Data and abd packages. To see what data are available, first load these packages using:

library(abd)
library(Lock5Data)

Then, you can see what data are available by typing:

data(package="abd")
data(package="Lock5Data")

To load a data set, you can type data(name of data set). For example, to load the SpiderSpeed data set which we will see in the Lecture covering Section 2.4, type:

data(SpiderSpeed)
head(SpiderSpeed)

You can also find more information about the data set by exploring the help file:

?SpiderSpeed

Now, explore some of these data using your newly developed(ing) graphical skills!

The introduction to this lab was adapted from a lab created by Dr. Kari Lock-Morgan (which I can no longer find or access).

In addition to changing some of the text, I have used a different data set and modified the coding exercises.

The lab is released under a Creative Commons Attribution-ShareAlike 3.0 Unported.