Lab 3: Two variable summaries, Sampling Distributions

Lab 3: Learning Objectives

Today, we will use data collected in class and in one or more data sets contained within the Lock5Data library, to:

become more fluent in R and its functions for exploratory data analysis
be able to explore how resistant various summary statistics (mean, median, sd) are to outliers
use computational methods to understand the concept of a sampling distribution

Summarizing Two Quantitative Variables

gf_point(y~x, data=) for a scatterplot
gf_point(y~x, data=) %>% gf_line() for scatterplot with lines
gf_point(y~x, data=) %>% gf_lm() for scatterplot with regression line
cor(y~x, data=) to calculate the correlation between 2 quantitative variables.

Quantitative and categorical variables

gf_histogram(~x|y, data=) for side-by-side histograms for different groups (x)
gf_boxplot(y~x, data=) side-by-side boxplots
gf_density(~y, fill=~x, data=) density plots for multiple groups (x) overlaid For specific examples, see Handout slides from Ch 2.

Sampling Distributions

Sampling distribution = the distribution of a sample statistic (e.g., sample mean, sample proportion, correlation coefficient, etc) computed using different samples of the same size from the same population.

Lab 3: Exercises

Lets load our usual libraries:

library(mosaic)
library(knitr)
library(abd)

For each exercise, include any relevant output (tables, summary statistics, plots) in your answer. Doing this is easy! Use the template provided, lab3.qmd. Place any relevant R code in a code chunk, and click Render.

The data we collected from this year (as well as in past years) containing the time students in FW4001 were able to stand on one foot (on tiptoes) and their height category are contained in the file TippyAllYYYY.csv (where YYYY refers to the current year). Use these data to complete the following two exercises.

Use the read.csv function to read in these data. Is there a relationship between height and amount of time students can stand on one foot? Justify your answer and include supporting information (e.g., summary statistics and one or more visualizations of the data).

Remember, all of your code needs to be included in a code chunk and all text needs to occur outside of a code chunk. No need to copy results into lab3.qmd (these will automatically show up when you render your file).

Are there any outliers in the data set? If there are outliers, describe the Time variable, using various summary statistics from class (for both the center and spread of the distribution), first with the outlier(s) included, then with the outlier(s) removed. Comment on the changes. Are the statistics you choose resistant to outliers? note: you can use the favstats(~x, data=) function to quickly calculate a variety of summary statistics. Also, remember you can use something like newdat<-filter(olddata, variable.name < outlier.value) to create a data set with outliers removed.

Note: if you use boxplot(s) to determine outliers, you may find that your answer depends on whether you visualize side-by-side boxplots (i.e., to see if there are any “outliers” for different height categories when considered separately) or if you pool all of your data (to look for extreme observations in the population as a whole). You might consider your answer to exercise 1 to guide your approach here.

Remember the pulse experiment from class? The data we collected are in a file called pulseallYYYY.csv (where again, YYYY refers to the current year). Use the read.csv function to read in these data, then complete the next two exercises.

Is there a relationship between treatment (exercise Yes/No) and pulse rate? Describe the relationship using at least one summary statistic (broken down by group), at least one visualization, and a description in your own words.
If you feel up for it, you might also try to visually explore whether there are year-to-year differences in pulse rates after accounting for which treatment group each student belonged to. Alternatively, explore year-to-year variation in students’ abilities to stand on one foot. Try, for example, gf_boxplot(pulse~treatment|year, data=pulseall).

Lets also load the Lock5Data library.

library(Lock5Data)

For the two questions below, you will need to explore one or more data sets in the Lock5Data library. To look at the different data sets in this library, click on the Packages tab in the lower right panel of RStudio, scroll down to find Lock5Data. A few of the better/richer data sets, containing both categorical and quantitative variables, are: StudentSurvey, EmployedACS, ExerciseHours, NutritionStudy. You can also find descriptions of the data sets in the back of your book (Appendix B). To access any of these data sets, type: data(datasetname) and make sure to include this code in your .qmd lab report file. Before the data set will show up in your environment (top right of Rstudio), you will have to explore it in some way (e.g., try typing head(datasetname) to view the first few rows of the dataframe.).

Remember to fill in your answers to the exercises, below, in lab3.qmd.

Note for Question 2: to calculate the correlation between two variables in R, we use the cor function (e.g., cor(y~x, data=)). Also, remember, we can create a scatterplot using the gf_points function (e.g., gf_points(y~x, data=)).

Choose two categorical variables from a Lock5Data data set of interest to you, and describe their relationship.Include summary statistic(s), at least one visualization, and a description in your own words.
Choose two quantitative variables from a Lock5data data set that interests you, and describe their relationship. Is there a relationship? Is it negative or positive? Is it linear or non-linear? Is it weak or strong? Use the sample correlation statistic and the scatterplot to justify your answer.

Random numbers: setting the seed

Before we explore the concept of a sampling distribution, let’s “set a seed” of the random number generator in R so that each time we render the lab document we get the same answer. We do this with the set.seed() function in R. You can use any number within the set.seed() function. Think of R reading a large list of numbers from left to right placed randomly on a page. The set.seed() function tells R where on the page to begin reading these numbers.

set.seed(241990)

Remember to fill in your answers to the exercises, below, in lab3.qmd.

Sampling Distribution from Darwin’s Words

Lets follow up on the Darwin Origin of Species sampling problem. All of the words in the passage from the origin of species are contained in the fill Darwin.csv. Let’s read in these data, then generate the mean of a random sample of 10 words using the sample function:

Darwin<-read.csv("Darwin.csv")
mean(~nchar, data=sample(Darwin, size=10))

Note: this syntax for the mean function is specific to the mosaic package and will only work if you have told R that you want to check it out from the library using library(mosaic).

If you type this code into the console, you will get a different answer each time. What if we wanted to explore the sampling distribution of the mean (i.e., look at the distribution of sample means across many different samples of size 10) - how could we do this? Well, we could repeatedly type mean(~ncar, data=sample(Darwin, size=10) and then collect all our answers, but the process would be really tedious. Or, we could learn to do some simple programming in R.

One command that will be very useful in the coming weeks is do, which allows you to “do” any command (or series of commands) multiple times, without having to type the command hundreds of times. Simply typing do(20)* in front of any command will do it 20 times. Try repeating the above command 20 times with do the function do: you should get 20 different numbers, each representing the mean of a different sample.

do(20)*{mean(~nchar, data=sample(Darwin, size=10))}

We can store these values in an object called darwin.means as follows:

 w.means<-do(20)*{mean(~nchar, data=sample(Darwin, size=10))}

The results are stored in a data.frame called w.means, containing a single variable called result (to see this, use the head function).

head(w.means)

As an aside: I chose `w.means’ to stand for ``word means”, but you could use any name provided that:

you use only letters and numbers and the two punctuation marks . and _.
you do not use spaces anywhere in the name
the first character in the name is NOT a number or underscore

Also, remember that capital letters are treated as distinct from lower-case letters. The objects Wolf and wolf are different.

Remember to fill in your answers to the exercises, below, in lab3.qmd.

Instead of taking 20 samples of size 10, take 1000 samples and compute the mean for each (using the function do). Create a histogram depicting the sampling distribution of the sample means. You can add the true mean (\(\mu\) = 4.94) using: gf_histogram(~x, data=) %>% gf_vline(xintercept=~4.94). What is the standard error of \(\bar{x}\)?

Remember to replace x with the correct variable name. If you forgot what the variable name, click on the object you created to store your sample means in the upper right corner of Rstudio!

Repeat these steps using a larger sample size of 35 words (note: we collected approximately 35 means in FW4001 the first day of class using a haphazard sampling technique). What is the standard error of \(\bar{x}\) when samples of size 35 are taken?
What do you think would happen to the sampling distribution if we took samples of size 50? Why?
Your guesses (and those of students from past years) from the first day of class (using non-random sampling) are contained in a file called DarwinYYYY.csv (where again, YYYY refers to the current year). Read in these data and calculate the mean. Compare this value to the sampling distribution in step 3. How likely would it be to see a value this extreme if you had taken a random sample?

A first draft of this lab was adapted from a lab created by Dr. Kari Lock-Morgan (which I can no longer find or access). In addition to changing much of the text, I have used a different data set and modified the coding exercises.

The lab is released under a Creative Commons Attribution-ShareAlike 3.0 Unported.