Sampling words from Darwin’s Origin of Species

Author

John Fieberg

Published

November 11, 2025

Load libraries

library(mosaic)
library(dplyr)
library(googledrive)
library(googlesheets4)

Read in data

Data are read in from googlesheets (code not shown).

Data cleaning:

There was one observation with a p^ > 1 likely due to entering data in the wrong columns. I will drop that observation (though one could assume that Total and Blue columns should be swapped for that case).

  filter(mandm, phat > 1)
            Timestamp Blue Total phat year
1 2023-09-21 11:52:09   17     4 4.25 2023
  mandm <- filter(mandm, phat<= 1.0) # eliminates 

There was one observation with 57 M&M’s (likely represents a case where someone was sick or excused from attending and purchased a full sized bag).

  filter(mandm, Total > 20) # eliminates 
            Timestamp Blue Total      phat year
1 2023-09-21 05:56:11   24    57 0.4210526 2023
  mandm <- filter(mandm, Total <= 20) # eliminates 

Visualize the sampling distributions

First, the proportion that are blue.

  gf_histogram(~phat|year, data=mandm, xlab = expression(hat(p)) ) %>% gf_vline(xintercept=~0.16) +
   theme(text=element_text(size=20)) #change font size of all text
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Then, the number of M&M’s in each bag

  gf_dotplot(~Total|year, data=mandm, binwidth=1, dotsize=0.15)  +
   theme(text=element_text(size=20)) #change font size of all text