Sampling words from Darwin’s Origin of Species

Author

John Fieberg

Published

November 11, 2025

Load libraries

library(mosaic)
library(dplyr)
library(googledrive)
library(googlesheets4)
library(flextable) 
set_flextable_defaults(fonts_ignore=TRUE)

Read in new data from 2023

darwin<-read_sheet("https://docs.google.com/spreadsheets/d/1tonGXIoPRYu_rkHZV3UC9c_FmSN7A0r1keWPQuzPujE/edit?usp=sharing" )
darwin<-darwin[-1,]

Rename some of the variables to make them easier to work with

names(darwin)[3]<-"mean.char"  
names(darwin)[2]<-"Sampling.method"

Read in the passage from Darwin so we can calculate the true mean number of characters

oas<-read.csv("data/Darwin.csv")

True population mean - this is what we are trying to estimate!

(mean.nchar<-mean(~nchar, data=oas))

[1] 4.936652

Distribution of sample means from students’ samples

gf_histogram(~mean.char, data=darwin, xlab="Mean Number of Words") %>% 
  gf_vline(xintercept=~4.93, col="red") %>% 
  gf_vline(xintercept=~mean(~mean.char, data=darwin, na.rm = TRUE),
           col = "blue")

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Histogram of sample means (i.e., mean number of characters chosen in the 10 words)

Mean (of the means) for the class

mean(~mean.char, data=darwin, na.rm = TRUE)

[1] 5.990244

We see that, on average, the mean number of words in the sample taken by students is higher than the population mean (4.93 words) Bias = the difference between the blue and red lines

(bias<-mean(~mean.char, data=darwin, na.rm = TRUE) - 4.93)

[1] 1.060244

# Methods used by students
methods <- darwin %>% select(Sampling.method) %>% data.frame() 
flextable::flextable(methods, cwidth=60)%>%
  width(width=6)

Sampling.method
first and last word of the sentence
I tried to choose randomly, one or a few words for each line of text (8 lines of text)
I chose words as randomly as possible, a few from each line.
First, I excluded words like "to", "the", and "and" which are used for connection and have a high repetition rate. Then, I selected random words of different lengths from various parts of the text. This ensures the representativeness of the sample because they are taken from different sections of the text.
As there are ten lines, I choose one word from each, the word should not be too long or to short.
The quote is 221 words, so I took every 22nd word to get an equal range.
I use my apple pencil to highlight ten words randomly without looking.
I randomly circled 10 words, following no pattern
I counted the letters in 10 (what I thought to be) random words
Closed my eyes and randomly chose 10 words
Went down in a straight line starting at the top row and chose a word in the middle about every other row
Choose one word from each line as there are 10. And the word should be randomly choosed controlled by my brain. The word should not be the longest one nither the shortest one.
I chose a selection with a high variability of word lengths
I "randomly" selected words in various sections of the text, although my eyes likely drew me to words that were in the middle in length.
randomly
I looked for words that were used most frequently
I whisked my cursor around the passage and stopped at a random word. I then selected the 9 other words that followed, in my case "variability may be partly connected with excess of food. It".
I looked for words that were repeated at least once in the passage and generally appeared to reflect the average length of words around them.
I tried to pick one "random" word from each sentence.
I picked words that occurred frequently (that, they, is) or that seemed to have the middle value of letters compared to the entire text (species, parent).
I chose my 10 words by reading the passage and trying to summarize it in the most simple way possible.
I chose my ten words by reading the passage and identifying words that came up often throughout the entirety of the passage. I also tried to find some variability in the number of characters per selected word to replicate the variability in the passage.
I chose my words based off of the overall themes of the passage: focusing on the differences and rapid evolution of our domesticated animals and plants that humans have controlled for thousands of years. I also chose words based on the overall theme of On the Origin of Species, using words like nature, diversity and so on to highlight the broad theme.
I used a random number generator to select a row 1-9 (10 excluded because it was one word). Then I used another rng to select a word in a row (I estimated 20 words per row, so the rng chose between 1-20)
I assigned each word a number 1-221 then generated 10 random numbers using a number generator to select 10 words
I choose my 10 words by seeing what stood out most and helped summarize the main idea of the passage. If I were to put all these words into a few sentences I think it could summarize the whole passage and I made sure to avoid filler words.
The first word of each line after the page break. I figured the first word of each sentence could be biased, so hopefully the page break is random enough to get a representative sample of words throughout full sentences. 69/10=6.9
I skimmed the article and wrote down the words that reoccurred the most in my eyes
I chose my words by looking at their length and trying to get a variety of different length words in my sample group. I also tried to look at the overall passage and see if there were more shorter words used or more longer words used. There were more shorter length words used so I have more short words in my sample.
I noticed that there were not many 5-7 letter words, but many 1-4 letter words and a ton that are 7+ letters. So, First I chose short, common words that show up a lot like "the", "all", "we", and "is", slightly longer common words like "that" and "from," and then longer common words like "variability" and "diversity".
I used R to generate 10 random numbers from a sample size of 221 (the number of words in this passage). I then used the random numbers (121 167 137 75 85 169 124 136 133 104 ) to identify the randomly selected words in a word document.
Randomly selected words of varying lengths throughout the text
I chose words that came up several times in the sample passage and also words that reflected the subject matter and I thought would come up throughout the greater population.
I chose my words by selecting words l felt summarized the main topics and ideas expressed in this passage
chose the smallest and biggest words within each sentence. Added the total number of letters and divided by 10 words.
I choose my words by having some that are longer (e.g. 10-12 characters long), some that are mid length (5-8 characters long) and then some that are shorter (e.g. 1-3 characters long)
driven, generally, differ, There, parent, species, oldest, partly, varied, plants I chose these because I think he used a couple bigger works but he also used a lot of words will 3 letters or less.
Random selection of filler words (us, they, and) & words relevant to Darwin's research (plants, animals, diversity)
I chose my words based on what where the key points and themes of the passage
Rolled a dice to decide which every other word to sample from the beginning of the sentence, until I reached 10 words. Then counted up the letters and found the average.
I realized that any visual method (i.e., closing my eyes and pointing to a spot in the text, randomly squiggling the cursor) would be biased towards large words because they take up more space, and I assume we were meant to be random sampling. So, I estimated there was at least 100 words, entered sample(100,10) into R, and got this list: 64 66 85 68 52 8 100 21 97 57. I counted to find the corresponding words: same, the, if, vast, which, been, and, conclude, been, conditions. Then counted letters: 4, 3, 2, 4, 5, 4, 3, 8, 4, 10.

Add on a “year” column to this year’s data

darwin <- darwin %>% mutate(year = "2025")

Compare to data from 2020, 2022, 2023

Read in data from past years

darwinold<-read.csv("data/Darwin2024.csv")

For some reason, the columns are arranged differently in the old and new data sets. We can rearrange the columns using the select function

darwin <- darwin %>% select(Timestamp, mean.char, Sampling.method, year)
darwinold <- darwinold %>% select(Timestamp, mean.char, Sampling.method, year)
darwinall <- rbind(darwin, darwinold)

Create a multi-panel plot to illustrate the results from each year

gf_histogram(~mean.char | year, data=darwinall, xlab="Mean Number of Words") %>% 
  gf_vline(xintercept=~4.93)

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_bin()`).

Histogram of sample means (i.e., mean number of characters chosen in the 10 words) for past years.

mean(~mean.char | year, data=darwinall, na.rm = TRUE)

    2020     2022     2023     2024     2025 
5.854107 5.607442 5.995349 6.246154 5.990244

(mean.nchar<-mean(~nchar, data=oas))

[1] 4.936652

Random sampling

With random sampling, however, we should get estimates that are, on average, equal to the population mean. Lets explore this by taking 10,000 random samples and computing the mean number of words in each sample!

randomsamps<-do(10000)*{
  samp.char<-sample(oas, 10)
  mean(~nchar, data=samp.char)
  # Alternative way to accomplish the same thing in 1 line of code
  #mean(~nchar, data=sample(oas, 10))
}

gf_dhistogram(~result, data=randomsamps, xlab="Mean Number of Words",
          main="Random Sampling") %>% gf_vline(xintercept=~4.93)

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Histogram of sample means when taking random samples of size 10.

mean(~result, data=randomsamps)

[1] 4.92993

## Larger samples -> less variable estimates!

Lets look at what sorts of estimates we would have gotten if we had sampled 50 words instead of 10. We find that our estimates would again be centered around the population mean, but there would be less variability in our estiamtes from sample to sample.

randomsamps<-do(10000)*{
  samp.char<-sample(oas, 50)
  mean(~nchar, data=samp.char)
  # Alternative way to accomplish the same thing in 1 line of code
  # mean(~nchar, data=sample(oas, 10))
}

gf_dhistogram(~result, data=randomsamps, xlab="Mean Number of Words",
              main="Random Sampling") %>% gf_vline(xintercept=~4.93)

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Histogram of sample means when taking random samples of size 50.

mean(~result, data=randomsamps)

[1] 4.939114

Write out the data for a future lab

write.csv(darwinall, file="data/Darwin2025.csv", row.names = FALSE)