Sampling Functions#
The package mosaic features classic sampling functions for R. However, in this environment we use for class, mosaic is difficult to load. Still, we can import the code for useful functions directly and run them in R. This page provides the code to copy into an R notebook along with examples to guide usage.
Three mosaic functions that will be useful for us include:
rflip()
rspin()
sample.data.frame()
The rflip() Function#
When we want to simulate coin flips, the rflip function generates and organizes the output. The function parameters are as follows:
n – the number of coins to toss
prob – probability of heads on each toss
quiet – a logical. If
TRUE
, less verbose output is used.verbose – a logical. If
TRUE
, more verbose output is used.summarize – if
TRUE
, return a summary (as a data frame).
Some examples:
rflip(10)
rflip(10, prob = 1/6, quiet = TRUE)
rflip(10, prob = 1/6, summarize = TRUE)
The code to create the function rflip() is below.
rflip <- function(n=1, prob=.5, quiet=FALSE, verbose = !quiet, summarize = FALSE,
summarise = summarize) {
if ( ( prob > 1 && is.integer(prob) ) ) {
# swap n and prob
temp <- prob
prob <- n
n <- temp
}
if (summarise) {
heads <- rbinom(1, n, prob)
return(data.frame(n = n, heads = heads, tails = n - heads, prob = prob))
} else {
r <- rbinom(n,1,prob)
result <- c('T','H')[ 1 + r ]
heads <- sum(r)
attr(heads,"n") <- n
attr(heads,"prob") <- prob
attr(heads,"sequence") <- result
attr(heads,"verbose") <- verbose
class(heads) <- 'cointoss'
return(heads)
}
}
Let’s flip \(20\) coins and observe the different output configurations we can access.
rflip(20)
[1] 11
attr(,"n")
[1] 20
attr(,"prob")
[1] 0.5
attr(,"sequence")
[1] "H" "T" "H" "H" "H" "H" "T" "H" "H" "T" "T" "T" "H" "H" "T" "T" "H" "T" "H"
[20] "T"
attr(,"verbose")
[1] TRUE
attr(,"class")
[1] "cointoss"
Notice that the “number of successes” is the first output and that the outputs are in a dataframe that we can subset normally. Thus, we can use square brackets [] to access that value:
rflip(20)[1]
What if the probability of success is different than 50%? We can use the prob = option to set the correct value.
rflip(20, prob = 1/6)
[1] 4
attr(,"n")
[1] 20
attr(,"prob")
[1] 0.1666667
attr(,"sequence")
[1] "T" "T" "T" "T" "T" "T" "T" "T" "H" "H" "T" "T" "T" "T" "T" "H" "H" "T" "T"
[20] "T"
attr(,"verbose")
[1] TRUE
attr(,"class")
[1] "cointoss"
The function rflip() will organize the results attractively for us in an dataframe if we set the option summarize = TRUE.
rflip(20, prob = 1/6, summarize = TRUE)
n | heads | tails | prob |
---|---|---|---|
20 | 0 | 20 | 0.1666667 |
The rspin() Function#
We can simulate spinning a spinner with rspin() using the following input parameters.
n number of spins of spinner
probs – a vector of probabilities. If the sum is not 1, the probabilities will be rescaled.
labels – a character vector of labels for the categories
Some examples:
rspin(20, prob=c(1,2,3), labels=c(“Red”, “Blue”, “Green”))
rspin(30, prob=c(1,2,3,4), labels=c(“Red”, “Blue”, “Green”, “Purple”))
rspin <- function(n, probs, labels=1:length(probs)) {
if (any(probs < 0))
stop("All probs must be non-negative.")
probs <- probs/sum(probs)
res <- as.data.frame(t(rmultinom(1, n, probs)))
names(res) <- labels
res
}
Two straightforward examples should suffice to demonstrate how the function works.
rspin(20, prob=c(1,2,3), labels=c("Red", "Blue", "Green"))
Red | Blue | Green |
---|---|---|
3 | 7 | 10 |
rspin(30, prob=c(1,2,3,4), labels=c("Red", "Blue", "Green", "Purple"))
Red | Blue | Green | Purple |
---|---|---|---|
4 | 4 | 8 | 14 |
Genetics Example: Flowers#
Suppose that, based upon Mendel’s laws as expressed in the Punnett square, we have a hybrid where we expect the purple to white flower ratio to be \(3 : 1\). Let’s use rspin() to simulate growing 400 of the plants and counting the frequency of purple and white flowers.
rspin(400, prob=c(3,1), labels=c("Purple", "White"))
Purple | White |
---|---|
308 | 92 |
Genetics Example: Peas#
Mendel chose to work with common, garden-variety pea plants for his experiments because they grow quickly and are easily raised. The plants have several visible characteristics that vary by proportions predicted by genetics, and we will focus on two of them:
Seeds can be round or wrinkled
Seeds can have yellow or green cotyledons. Cotyledons refer to the tiny leaves inside the seeds.
In Mendel’s experiment, he determined that the expected value for proportions were as given in the chart below:
Phenotype | Expected Proportion |
---|---|
Round Yellow | 9/16 |
Round Green | 3/16 |
Wrinkled Yellow | 3/16 |
Wrinkled Green | 1/16 |
Let’s use rspin() to simulate growing 2000 of the plants and determining with what frequencies these attributes occur.
rspin(400, prob=c(9,3,3,1), labels=c("Round Yellow", "Round Green", "Wrinkled Yellow", "Wrinkled Green"))
Round Yellow | Round Green | Wrinkled Yellow | Wrinkled Green |
---|---|---|---|
234 | 70 | 72 | 24 |
The sample.data.frame() Function#
We often wish to generate a random sample of rows from a dataframe, and sample.date.frame helps us to do so quickly.
x – dataframe to sample from
size – sample size to draw
groups – a vector (or variable in a data frame) specifying groups to sample within.
orig.ids – a logical; should original ids be included in returned data frame?
\dots – additional arguments passed to base::sample().
shuffled – a vector of column names. These variables are reshuffled individually (within groups if
groups
is specified), breaking associations among these columns.
Some examples:
sample.data.frame <- function(x, size, replace = FALSE, prob = NULL, groups=NULL,
orig.ids = TRUE, fixed = names(x), shuffled = c(),
invisibly.return = NULL, ...) {
if( missing(size) ) size = nrow(x)
if( is.null(invisibly.return) ) invisibly.return = size>50
shuffled <- intersect(shuffled, names(x))
fixed <- setdiff(intersect(fixed, names(x)), shuffled)
n <- nrow(x)
ids <- 1:n
groups <- eval( substitute(groups), x )
newids <- sample(n, size, replace=replace, prob=prob, ...)
origids <- ids[newids]
result <- x[newids, , drop=FALSE]
idsString <- as.character(origids)
for (column in shuffled) {
cids <- sample(newids, groups=groups[newids])
result[,column] <- x[cids,column]
idsString <- paste(idsString, ".", cids, sep="")
}
result <- result[ , union(fixed,shuffled), drop=FALSE]
if (orig.ids) result$orig.id <- idsString
if (invisibly.return) { return(invisible(result)) } else {return(result)}
}
Let’s load some data to do some examples.
p <- read.csv('https://faculty.ung.edu/rsinn/data/personality.csv')
head(p,3)
Age | Yr | Sex | G21 | Corps | Res | Greek | VarsAth | Honor | GPA | ... | Perf | OCD | Play | Extro | Narc | HSAF | HSSE | HSAG | HSSD | PHS |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
21 | 2 | M | Y | Y | 1 | N | N | N | 3.23 | ... | 105 | 10 | 142 | 8 | 11 | 41 | 40 | 26 | 27 | SE |
20 | 3 | F | N | N | 2 | Y | N | Y | 3.95 | ... | 105 | 3 | 172 | 16 | 11 | 46 | 52 | 26 | 33 | SE |
22 | 3 | M | Y | N | 2 | N | N | N | 3.06 | ... | 73 | 1 | 134 | 15 | 11 | 48 | 42 | 44 | 29 | AG |
The example below shows how to draw a random sample of size \(n = 25\) from the personality data frame. The row ID numbers have been included so you can see which rows were selected. Rerun the command, and you will see that a new sample with different rows will be drawn.
sample.data.frame(p, 25)
Age | Yr | Sex | G21 | Corps | Res | Greek | VarsAth | Honor | GPA | ... | OCD | Play | Extro | Narc | HSAF | HSSE | HSAG | HSSD | PHS | orig.id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
129 | 20 | 2 | F | N | N | 2 | Y | N | N | 3.25 | ... | 10 | 165 | 11 | 0 | 44 | 34 | 27 | 25 | AF | 129 |
42 | 20 | 2 | F | N | Y | 1 | Y | N | N | 3.02 | ... | 10 | 147 | 10 | 6 | 33 | 21 | 31 | 33 | SD | 42 |
71 | 19 | 3 | F | N | N | 1 | Y | N | N | 3.15 | ... | 9 | 131 | 9 | 4 | 48 | 42 | 27 | 22 | AF | 71 |
28 | 19 | 1 | F | N | N | 1 | N | N | N | 3.18 | ... | 9 | 170 | 11 | 7 | 52 | 47 | 23 | 15 | SE | 28 |
74 | 19 | 2 | F | N | N | 1 | N | Y | N | 3.29 | ... | 8 | 112 | 8 | 4 | 27 | 36 | 28 | 35 | SD | 74 |
22 | 20 | 2 | M | N | N | 3 | N | N | N | 3.30 | ... | 6 | 149 | 10 | 8 | 51 | 47 | 38 | 27 | AG | 22 |
62 | 19 | 1 | F | N | N | 1 | N | Y | N | 3.40 | ... | 5 | 130 | 13 | 5 | 44 | 33 | 31 | 27 | AG | 62 |
90 | 19 | 1 | M | N | N | 1 | N | N | N | 2.80 | ... | 9 | 125 | 2 | 3 | 37 | 34 | 29 | 39 | SD | 90 |
124 | 30 | 3 | F | Y | N | 3 | N | N | N | 2.79 | ... | 4 | 143 | 11 | 1 | 44 | 40 | 25 | 41 | SD | 124 |
108 | 19 | 1 | F | N | N | 1 | N | Y | N | 3.35 | ... | 4 | 136 | 10 | 2 | 49 | 40 | 14 | 10 | AF | 108 |
88 | 19 | 2 | F | N | N | 1 | N | N | N | 3.75 | ... | 11 | 141 | 14 | 3 | 44 | 35 | 19 | 17 | AF | 88 |
99 | 18 | 1 | F | N | N | 1 | N | N | N | 3.84 | ... | 13 | 158 | 13 | 2 | 49 | 46 | 34 | 23 | SE | 99 |
60 | 21 | 3 | F | Y | N | 2 | Y | N | N | 4.00 | ... | 5 | 143 | 10 | 5 | 49 | 50 | 25 | 25 | SE | 60 |
112 | 20 | 3 | F | N | N | 3 | N | N | N | 2.50 | ... | 13 | 140 | 3 | 1 | 38 | 28 | 25 | 24 | AG | 112 |
29 | 21 | 3 | M | Y | N | 2 | N | Y | N | 3.40 | ... | 8 | 112 | 13 | 7 | 48 | 34 | 43 | 40 | AG | 29 |
81 | 21 | 2 | M | Y | N | 1 | N | Y | N | 2.90 | ... | 3 | 96 | 12 | 4 | 32 | 27 | 28 | 30 | SD | 81 |
54 | 22 | 3 | M | Y | N | 1 | N | N | N | 3.26 | ... | 13 | 125 | 3 | 5 | 44 | 55 | 22 | 20 | SE | 54 |
52 | 20 | 3 | M | N | Y | 1 | Y | N | N | 3.34 | ... | 4 | 104 | 11 | 6 | 35 | 32 | 30 | 32 | SD | 52 |
70 | 22 | 4 | F | Y | N | 2 | N | N | N | 3.78 | ... | 11 | 117 | 4 | 4 | 43 | 37 | 28 | 34 | SD | 70 |
113 | 22 | 2 | F | Y | N | 2 | N | N | N | 3.49 | ... | 11 | 141 | 9 | 1 | 44 | 38 | 23 | 31 | SD | 113 |
104 | 20 | 2 | F | N | N | 1 | N | N | N | 3.80 | ... | 7 | 122 | 2 | 2 | 51 | 28 | 13 | 33 | AF | 104 |
19 | 20 | 3 | M | N | N | 2 | N | Y | Y | 3.68 | ... | 11 | 133 | 8 | 8 | 30 | 32 | 34 | 34 | AG | 19 |
6 | 22 | 3 | F | Y | N | 2 | Y | N | N | 2.63 | ... | 20 | 133 | 10 | 9 | 40 | 27 | 31 | 28 | AG | 6 |
38 | 19 | 3 | M | N | Y | 1 | Y | Y | N | 4.00 | ... | 11 | 127 | 4 | 6 | 39 | 40 | 29 | 30 | SE | 38 |
91 | 21 | 3 | F | Y | N | 2 | Y | N | N | 3.00 | ... | 8 | 142 | 3 | 3 | 43 | 34 | 40 | 47 | SD | 91 |
The example below shows how to draw a random sample of size \(n = 25\) from the narcissism column of the personality data frame.
sample.data.frame(p['Narc'], 25, orig.ids = F)
Narc | |
---|---|
46 | 6 |
23 | 8 |
57 | 5 |
27 | 7 |
88 | 3 |
78 | 4 |
37 | 6 |
66 | 4 |
67 | 4 |
87 | 3 |
61 | 5 |
115 | 1 |
107 | 2 |
31 | 7 |
100 | 2 |
60 | 5 |
42 | 6 |
51 | 6 |
17 | 8 |
105 | 2 |
79 | 4 |
118 | 1 |
128 | 0 |
94 | 3 |
33 | 7 |