a <- 1:3Lab 1: Introduction to R and Rstudio
Learning Objectives for Lab 1
- Become familiar with R and Rstudio
- Be able to read in a data set, select variables from this data set, select a subset of records, and produce a scatterplot
- Calculate simple summary statistics (e.g., mean, min, max) for a data set
- Write your first reproducible report.
Getting Started
NOTE: This document contains a lot of information that can be accessed by clicking on the different tabs, below. You should read through the first 4 tabs (Introduction to R and Rstudio, Installing packages…, Creating a reproducible …, and Wolf and Moose Counts…) BEFORE lab. Then, complete the exercises on the last tab (Lab 1 Exercises) during our scheduled lab time.
The goal of this lab is to introduce you to R and RStudio, which you’ll be using throughout the course both to learn the statistical concepts discussed in the textbook and also to analyze real data. I find that students are often confused about the difference between R and Rstudio (many list Rstudio on their CV, when it is probably more important to list R or both R and Rstudio!). R is the name of the programming language itself and RStudio is a graphical user interface (or GUI) used to interact with R. Rstudio lets us run R in an enhanced working environment by providing us with additional functionality (e.g., menu options, multiple windows for plots, code, help files, etc.).
The panel in the upper right contains your workspace as well as a history of the commands that you’ve previously entered. Any plots that you generate will show up in the panel in the lower right corner.
The panel on the left is where the action happens. It’s called the console. Everytime you launch RStudio, it will have the same text at the top of the console telling you the version of R that you’re running. Below that information is the prompt. As its name suggests, this prompt is really a request, a request for a command. Initially, interacting with R is all about typing commands and interpreting the output.
As the labs progress, you are encouraged to explore beyond what the labs dictate; a willingness to experiment will make you a much better programmer. Before we get to that stage, however, you need to build some basic fluency in R.
From the pre-lab, hopefully you remember that you can name and store results in objects using either an equal sign (=) or a <- arrow. For example, we can store a vector containing the numbers 1, 2, and 3 (generated by typing 1:3 in R) into an object called a using:
To refer to the stored object, we can just Type its name (a in this case):
a[1] 1 2 3
This will display the object we created. We can also use a in future calculations. For example, we can add 2 to all of these numbers by typing:
a + 2[1] 3 4 5
For more complex problems, you will want to write code in a file that we can save, share, and access at a later point in time.
Many users contribute code to do all sorts of things in R. They do this by writing “packages” (bundles of code, sometimes combined with data) and making them available for public download. Accessing this code requires 2 steps:
A. The package has to be downloaded (“installed”) onto your computer. This step can be accomplished using the function install.packages() or via the menus in RStudio (Tools -> Install Packages). I have installed the packages you will need for each lab as part of the different projects in our Rstudio Cloud workspace. However, you will need to install packages if you are going to use your own personal computer to work through lab and homework exercises. Packages only need to be installed once.
B. Each time we open R, we have to “tell R” if we want to use any of the add-on packages that we have downloaded. We do this by typing library(packagename) (replacing packagename with the name of the package we are interested in using).
We will make extensive use three packages in our labs, mosaic, knitr, and dplyr. We can tell R that we want to access the objects in these packages by typing:
library(knitr)
library(mosaic)
library(dplyr)We will also frequently check out the Lock5Data and abd packages and make use of data in these packages.
You could produce homework and lab reports by running code at the command prompt or by cutting and pasting code from a text file. You could also cut and paste output from running code into a separate document (e.g., in MS Word). But, I think we (i.e., you) can learn to do better. We will use some additional functionality in RStudio that makes producing reproducible reports `easy.’ To do this, we will write reports using something called quarto. Don’t worry, I will provide you with templates that you can use for completing your assignments.
If you login to Rstudio Cloud and open the workspace for Biometry and then select the Lab1 Assignment, you will see a file called lab1.qmd in the Files tab on the pane in the bottom right corner of your RStudio window. We will refer to this file as your “R quarto file” or “your report”. Click on the file name to open the file. All you need to do to complete the lab is to type up your brief answers and the R code (when necessary) in the spaces provided in the document.
Structure of quarto documents
There are three parts to an .qmd file for a cheatsheet see this link:
A Yaml Header: The text at the top of the document where you can supply a title, your name (next to
author:), and additional options that can let you customize the output created from running your code. This is a good time to add your name next to “author:” inlab1.qmd.Text sections: here you can take advantage of Markdown language syntax for formatting your report (see this quick overview to learn more). You will be adding text in response to questions as part of each lab.
Code chunks: R code that can be run and the output can be combined with your text to create a reproducible lab report. A code chunk looks like this:
```{r}
# code goes here
```
Important things to keep in mind when working with .qmd files:
- All R code must appear in a code chunk - i.e., within a:
```{r}
# code goes here
```
- All code that you need to complete your analysis must be contained in your .qmd file. When you create a reproducible report using a .qmd file, R will “forget” everything you have previously run in the console.
- All text should appear OUTSIDE of your code chunks, or be accompanied by a leading
#(the hashtag symbol is used to add comments which R will not try to run as code) - There is no need to cut and paste results from running your code into your .qmd file. All results will be included in your output document by default when you “render” it.
This is a good time to see what happens when you click “Render” in Rstudio (see below):
For today’s lab, we will explore population counts of moose and wolves on Isle Royale, Michigan. We can access these data by typing the following command:
isleRoyale<-read.csv("isleRoyale.csv")This command instructs R to read in data held in a comma delimited file. R is able to find the data since I put the data in the same folder I used to create the lab1 project on Posit Cloud (otherwise, we would have had to tell R where the data “sit”, i.e., the directory containing the data). Using projects helps with staying organized, makes it easy to tell R where to find data associated with a project, and also helps with sharing files. Imagine writing code that tells R that the data sit in a directory called Fiebergs_documents/fall2020/myfavoritestatsclass/ (this code would only work on my computer). If we use a Rstudio project, we can send anyone the directory containing all of our files and the code should run!
Working from home: If you are using R and Rstudio on your own computer, you can create a new project by going to the File menu in Rstudio, selecting “New project”, then “New Directory”, and then specifying a name for the directory and location on your computer to hold all your files associated with a project. This will create a “.Rproj” file in this new directory. Double clicking on this file will open Rstudio and automatically associate the session with your project. Alternatively, you can move between projects by going to the File menu, selecting “Open Project” and selecting the project you want to work on.
After hitting enter, You should see that the workspace area in the upper righthand corner of the RStudio window (under the Environment tab) now lists a data set called isleRoyale that has 60 observations on 3 variables. As you interact with R, you will create a series of objects. Sometimes you load them as we have done here, and sometimes you create them yourself as the byproduct of a computation or some analysis you have performed.
On Isle Royale, wolves are counted from fixed-wing aircraft each winter. Up until 2002, moose numbers were estimated by reconstructing the population from recoveries of dead moose of known ages. Since 2002, aerial surveys have been used to count moose. You can find out more about moose and wolves on Isle Royale here: http://www.isleroyalewolf.org/.
We can take a look at the data by typing its name into the console:
isleRoyaleWhat you should see are four columns of numbers, each row representing a different year: the first entry in each row is simply the row number (an index we can use to access the data from individual years if we want), the second column gives the year of the case, and the third and fourth columns give the number of estimated wolves and moose on the island in each year. Use the scroll bar on the right side of the console window to examine the complete data set.
Note that the row numbers in the first column are not part of isleRoyale data. R adds them as part of its printout to help you make visual comparisons. You can think of them as the index that you see on the left side of a spreadsheet. In fact, the comparison to a spreadsheet will generally be helpful. R has stored the Isle Royale data in a kind of spreadsheet or table called a data frame.
We can use the row indices to select specific rows. For example, we could have looked at the first three rows by typing:
isleRoyale[1:3,] Year Wolf Moose
1 1959 20 538
2 1960 22 564
3 1961 22 572
Note, that 1:3 creates a vector containing the numbers 1, 2, and 3. The [1:3 part says give me the first 3 rows, and the ,] part says give me all of the columns. If you wanted only the first 2 columns, you could have typed:
isleRoyale[1:3,1:2] Year Wolf
1 1959 20
2 1960 22
3 1961 22
You can see the names of these columns (or variables) by typing:
names(isleRoyale)[1] "Year" "Wolf" "Moose"
At this point, you might notice that many of the commands in R look a lot like functions from math class; that is, invoking R commands means supplying a function with some number of arguments. The name command, for example, took a single argument, the name of a data frame.
One advantage of RStudio is that it comes with a built-in data viewer. Click on the name isleRoyale in the upper right window that lists the objects in your workspace. This will bring up an alternative display of the data in the upper left window. You can close the data viewer by clicking on the “x” in the upper lefthand corner.
A note on expectations: For each exercise,include any relevant output (tables, summary statistics, plots) in your answer. Doing this is easy! Just place any relevant R code in a code chunk, and click the Render button
Let’s start by looking at the data a little more closely. We can access a single variable in the data frame using the select function:
select(isleRoyale, Wolf)This command will only show the number of wolves in each year (if it doesn’t work, there is a good chance you have a type-o - or, more likely, the error is due to R being case-sensitive. For example, note that Wolf has a capital W, so select(isleRoyale, wolf) will return an error!). Another way to select a variable from a data frame is to refer to the data frame name, followed by a $ and then the variable name as in the following:
isleRoyale$Wolf- What command would you use to extract the counts of moose in each year. Try it! Add your code to
lab1.qmd.
R has some powerful functions for making graphs. This year, we are going to use a new package called ggformula for creating all of our plots. This package provides a simple way to interface with the ggplot2 package (ggplot2 is the most popular graphing package among R users). The ggformula package will allow us to use a simple and common syntax for creating plots. We can create a simple scatterplot of the number of wolves counted in each year using:
gf_point(Wolf~Year, data=isleRoyale)If you add this code (in a code chunk) to your lab1.qmd file and run it, you will see a big increase throughout the 80’s and 90’s after the wolf population crashed following the introduction of canine papovirus. The moose population crashed in 1996 (attributed to a severe winter and an extreme outbreak of winter ticks).
The plot itself should appear under the “Plots” tab of the lower right panel of RStudio. Notice that the command above again looks like a function, this time with two arguments separated by a comma. The first argument in the plot function specifies the variables for the y-axis (Wolf) and the x-axis (Year).
We can add a line to the plot using the following code:
gf_point(Wolf~Year, data=isleRoyale) %>% gf_line()This code combines two different functions and can be read from left to right as “create a plot showing points and then add a line.” The %>% is called the piping operator. Basically, it takes the output of the current line and pipes it into the following line of code. Although we won’t use this functionality much in this class, it is a powerful way to combine tasks in R.
R Dialects: Like most spoken languages, there are many different dialects of R. In this class, you will learn a simple dialect of R (using the mosaic package) aimed at teaching new R users introductory statistics. There are other dialects of R that can be more powerful, but they are more challenging for beginners to learn. My goal is to help you to have a positive experience learning R, so that you stick with it after this class rather than revert to alternatives like Excel that are more limiting.
- Write code to produce a scatterplot showing how moose numbers have changed on Isle Royale over time. Add the code to your lab report (
lab1.qmd). Also, include text (outside of the code chunk!) that briefly describes how the moose population has changed over time.
Now, suppose we want to plot the moose-to-wolf ratio over time. Let’s look at Moose and wolf numbers in the first year of the data set.
isleRoyale[1,] Year Wolf Moose
1 1959 20 538
To compute the Moose to Wolf ratio, we could use the fact that R is really just a big calculator. We can type expressions like:
538/20[1] 26.9
to see the ratio in 1959. We could repeat this once for each year, but there is a faster way. We can compute the ratios for all years at once using the mutate function:
mutate(isleRoyale, moose.per.wolf=Moose/Wolf)If you type this code, you will see that the ratios (number of moose/ number of wolves) have been added as a column to the data set. Take a look at a few of these numbers and verify that they are correct. R does not, by default, actually update our data set to include this new variable. To do this, we would need to make sure to update our isleRoyale object as follows:
isleRoyale<-mutate(isleRoyale, moose.per.wolf=Moose/Wolf)We can then make a plot of these ratios over time using:
gf_point(moose.per.wolf~Year, data=isleRoyale) %>% gf_line() Summary Statistics
We can calculate many different summary statistics using built in functions in R. Let’s determine the average number of wolves in the data set by typing:
mean(~Wolf, data=isleRoyale)There are several other functions in the mosaic package that share a similar syntax: goal(~variable, data=dataset) (these include min, max, sd, favstats among others).
- What is the maximum and minimum number of wolves that have been counted on Isle Royale? Continue to add your code to
lab1.qmdas you go. As long as you add the code to yourlab1.qmdfile, there is no need to copy and paste results (they will automatically show up when you render your document!).
In addition to simple mathematical operators like subtraction and division, you can ask R to make comparisons like greater than (>), less than(<), and equal to (==). For example, we can ask R to determine whether each wolf count is greater than or less than the mean wolf count using:
mean.W<-mean(~Wolf, data=isleRoyale)
mutate(isleRoyale, wolves.gt.mean=Wolf > mean.W)Here, we first created an object called mean.W to store the mean number of wolves. We then created a new variable named wolves.gt.mean (“wolves greater than mean”) containing values of either TRUE or FALSE depending on if the wolf count in each year was greater than or less than the mean count, respectively.
Variable names: instead of wolves.gt.mean you could use any variable name that suits you, provided:
- you use only letters and numbers and the two punctuation marks
.and_. - you do not use spaces anywhere in the name
- the first character in the name is NOT a number or underscore
The variable wolves.gt.mean contains a different kind of data than we have considered so far. In the isleRoyale data frame, our values are numerical (the year, the number of wolves and moose). Here, we’ve asked R to create logical data, data where the values are either TRUE or FALSE. In general, data analysis will involve many different kinds of data types, and one reason for using R is that it is able to represent and compute with many of them.
We could also accomplish this task with 1 line of code (make sure to balance your parentheses when typing this expression!):
mutate(isleRoyale, wolves.gt.mean=Wolf > mean(~Wolf, data=isleRoyale))Escape!: it is easy to make coding mistakes in which you forget to add a comma, close a set of parentheses, or forget to include a second quotation mark. In these cases, you may see a + sign in the R console (see below). This tells you that R is waiting for additional input. To get unstuck, click the “Esc” (escape) button!
Subsetting data with the filter function
There are many situations where we will want to select only those cases that meet a certain condition. For example, we could create a new data set that only includes years with greater than average wolf numbers using:
# Create a new logical variable = TRUE if wolves are > mean, save the resulting data set
isleRoyale<-mutate(isleRoyale, wolves.gt.mean=Wolf>mean.W)
# Subset data to only include records where Wolf > mean
large.wolf.years = filter(isleRoyale, wolves.gt.mean==TRUE)Or, again with 1 line of code:
large.wolf.years = filter(isleRoyale, Wolf > mean(~Wolf, data=isleRoyale))This creates a data set large.wolf.years containing only those cases with wolf counts that were greater than the mean. OK, time for a coding challenge!
- Determine the year with the largest wolf count. Hint 1: think
maxrather thanmean. Hint2: use double equals (==) to ask for values “equal to”.
We can also produce summary statistics for different groups using the mean function (if we have loaded the mosaic library). For example, we can compare moose counts during years where wolf numbers are \(>\) or \(\le\) the overall mean wolf count using:
mean(Moose~ wolves.gt.mean, data=isleRoyale) This provides another example of the common syntax used in the mosaic package: goal(y~x, data=data set), allowing us to summarize the values of y for different groups, x. In this case, we get the mean number of moose for cases in which wolves.gt.mean = TRUE (wolf numbers were greater than the overall mean in the data set) or FALSE (wolf numbers were less than the overall mean in the data set). This leads us to one more last challenge for today!
- Calculate the minimum number of wolves separately for the years before 1996 and the years after 1996.
Last steps
Make sure you have entered all of your R code in the lab1.qmd file. When you are done, click the Render button. You should see a lab1.pdf file show up in the lower right corner of Rstudio. This is the file you should download and submit to Canvas. To download, you can export this file to your downloads folder on your computer by clicking on the gear in the lower right screen, then selecting “export” from the menu options.
The introduction to this lab was adapted from an OpenIntro lab written by Andrew Bray and Mine Çetinkaya-Rundel which in turn was adpated from a lab written by Mark Hansen of UCLA Statistics. The OpenIntro lab and also a css file which I used to partially format this document can be found here: https://github.com/mine-cetinkaya-rundel/sta101_f15
Some content was also adapted from a previous lab created by Dr. Kari Lock-Morgan (which I can no longer find or access).
In addition to changing some of the text, I have used a different data set and modified the coding exercises.
The lab is released under a Creative Commons Attribution-ShareAlike 3.0 Unported.