Sampling Functions#

The package mosaic features classic sampling functions for R. However, in this environment we use for class, mosaic is difficult to load. Still, we can import the code for useful functions directly and run them in R. This page provides the code to copy into an R notebook along with examples to guide usage.

Three mosaic functions that will be useful for us include:

  1. rflip()

  2. rspin()

  3. sample.data.frame()

The rflip() Function#

When we want to simulate coin flips, the rflip function generates and organizes the output. The function parameters are as follows:

  • n – the number of coins to toss

  • prob – probability of heads on each toss

  • quiet – a logical. If TRUE, less verbose output is used.

  • verbose – a logical. If TRUE, more verbose output is used.

  • summarize – if TRUE, return a summary (as a data frame).

Some examples:

  • rflip(10)

  • rflip(10, prob = 1/6, quiet = TRUE)

  • rflip(10, prob = 1/6, summarize = TRUE)

The code to create the function rflip() is below.

rflip <- function(n=1, prob=.5, quiet=FALSE, verbose = !quiet, summarize = FALSE, 
                  summarise = summarize) {
	if ( ( prob > 1 && is.integer(prob) ) ) {  
		# swap n and prob
		temp <- prob
		prob <- n
		n <- temp
	}
	if (summarise) {
	  heads <- rbinom(1, n, prob)
	  return(data.frame(n = n, heads = heads, tails = n - heads, prob = prob))
	} else {
	  r <- rbinom(n,1,prob)
	  result <- c('T','H')[ 1 + r ]
	  heads <- sum(r)
	  attr(heads,"n") <- n
	  attr(heads,"prob") <- prob 
	  attr(heads,"sequence") <- result
	  attr(heads,"verbose") <- verbose
	  class(heads) <- 'cointoss'
	  return(heads)
	}
}

Let’s flip \(20\) coins and observe the different output configurations we can access.

rflip(20)
[1] 11
attr(,"n")
[1] 20
attr(,"prob")
[1] 0.5
attr(,"sequence")
 [1] "H" "T" "H" "H" "H" "H" "T" "H" "H" "T" "T" "T" "H" "H" "T" "T" "H" "T" "H"
[20] "T"
attr(,"verbose")
[1] TRUE
attr(,"class")
[1] "cointoss"

Notice that the “number of successes” is the first output and that the outputs are in a dataframe that we can subset normally. Thus, we can use square brackets [] to access that value:

rflip(20)[1]
9

What if the probability of success is different than 50%? We can use the prob = option to set the correct value.

rflip(20, prob = 1/6)
[1] 4
attr(,"n")
[1] 20
attr(,"prob")
[1] 0.1666667
attr(,"sequence")
 [1] "T" "T" "T" "T" "T" "T" "T" "T" "H" "H" "T" "T" "T" "T" "T" "H" "H" "T" "T"
[20] "T"
attr(,"verbose")
[1] TRUE
attr(,"class")
[1] "cointoss"

The function rflip() will organize the results attractively for us in an dataframe if we set the option summarize = TRUE.

rflip(20, prob = 1/6, summarize = TRUE)
nheadstailsprob
20 0 20 0.1666667

The rspin() Function#

We can simulate spinning a spinner with rspin() using the following input parameters.

  • n number of spins of spinner

  • probs – a vector of probabilities. If the sum is not 1, the probabilities will be rescaled.

  • labels – a character vector of labels for the categories

Some examples:

  • rspin(20, prob=c(1,2,3), labels=c(“Red”, “Blue”, “Green”))

  • rspin(30, prob=c(1,2,3,4), labels=c(“Red”, “Blue”, “Green”, “Purple”))

rspin <- function(n, probs, labels=1:length(probs)) {
  if (any(probs < 0))
    stop("All probs must be non-negative.")
  
  probs <- probs/sum(probs)
  res <- as.data.frame(t(rmultinom(1, n, probs)))
  names(res) <- labels
  res
}

Two straightforward examples should suffice to demonstrate how the function works.

rspin(20, prob=c(1,2,3), labels=c("Red", "Blue", "Green"))
RedBlueGreen
3 7 10
rspin(30, prob=c(1,2,3,4), labels=c("Red", "Blue", "Green", "Purple"))
RedBlueGreenPurple
4 4 8 14

Genetics Example: Flowers#

Suppose that, based upon Mendel’s laws as expressed in the Punnett square, we have a hybrid where we expect the purple to white flower ratio to be \(3 : 1\). Let’s use rspin() to simulate growing 400 of the plants and counting the frequency of purple and white flowers.

rspin(400, prob=c(3,1), labels=c("Purple", "White"))
PurpleWhite
30892

Genetics Example: Peas#

Mendel chose to work with common, garden-variety pea plants for his experiments because they grow quickly and are easily raised. The plants have several visible characteristics that vary by proportions predicted by genetics, and we will focus on two of them:

  • Seeds can be round or wrinkled

  • Seeds can have yellow or green cotyledons. Cotyledons refer to the tiny leaves inside the seeds.

In Mendel’s experiment, he determined that the expected value for proportions were as given in the chart below:

Phenotype Expected Proportion
Round Yellow 9/16
Round Green 3/16
Wrinkled Yellow 3/16
Wrinkled Green 1/16

Let’s use rspin() to simulate growing 2000 of the plants and determining with what frequencies these attributes occur.

rspin(400, prob=c(9,3,3,1), labels=c("Round Yellow", "Round Green", "Wrinkled Yellow", "Wrinkled Green"))
Round YellowRound GreenWrinkled YellowWrinkled Green
23470 72 24

The sample.data.frame() Function#

We often wish to generate a random sample of rows from a dataframe, and sample.date.frame helps us to do so quickly.

  • x – dataframe to sample from

  • size – sample size to draw

  • groups – a vector (or variable in a data frame) specifying groups to sample within.

  • orig.ids – a logical; should original ids be included in returned data frame?

  • \dots – additional arguments passed to base::sample().

  • shuffled – a vector of column names. These variables are reshuffled individually (within groups if groups is specified), breaking associations among these columns.

Some examples:

sample.data.frame <- function(x, size, replace = FALSE, prob = NULL, groups=NULL, 
                              orig.ids = TRUE, fixed = names(x), shuffled = c(),
                              invisibly.return = NULL, ...) {
  if( missing(size) ) size = nrow(x)
  if( is.null(invisibly.return) ) invisibly.return = size>50 
  shuffled <- intersect(shuffled, names(x))
  fixed <- setdiff(intersect(fixed, names(x)), shuffled)
  n <- nrow(x)
  ids <- 1:n
  groups <- eval( substitute(groups), x )
  newids <- sample(n, size, replace=replace, prob=prob, ...)
  origids <- ids[newids]
  result <- x[newids, , drop=FALSE]
  
  idsString <- as.character(origids)
  
  for (column in shuffled) {
    cids <- sample(newids, groups=groups[newids])
    result[,column] <- x[cids,column]
    idsString <- paste(idsString, ".", cids, sep="")
  }
  
  result <-  result[ , union(fixed,shuffled), drop=FALSE]
  if (orig.ids) result$orig.id <- idsString
  
  
  if (invisibly.return) { return(invisible(result)) } else {return(result)}
}

Let’s load some data to do some examples.

p <- read.csv('https://faculty.ung.edu/rsinn/data/personality.csv')
head(p,3)
AgeYrSexG21CorpsResGreekVarsAthHonorGPA...PerfOCDPlayExtroNarcHSAFHSSEHSAGHSSDPHS
21 2 M Y Y 1 N N N 3.23... 105 10 142 8 11 41 40 26 27 SE
20 3 F N N 2 Y N Y 3.95... 105 3 172 16 11 46 52 26 33 SE
22 3 M Y N 2 N N N 3.06... 73 1 134 15 11 48 42 44 29 AG

The example below shows how to draw a random sample of size \(n = 25\) from the personality data frame. The row ID numbers have been included so you can see which rows were selected. Rerun the command, and you will see that a new sample with different rows will be drawn.

sample.data.frame(p, 25)
AgeYrSexG21CorpsResGreekVarsAthHonorGPA...OCDPlayExtroNarcHSAFHSSEHSAGHSSDPHSorig.id
12920 2 F N N 2 Y N N 3.25... 10 165 11 0 44 34 27 25 AF 129
4220 2 F N Y 1 Y N N 3.02... 10 147 10 6 33 21 31 33 SD 42
7119 3 F N N 1 Y N N 3.15... 9 131 9 4 48 42 27 22 AF 71
2819 1 F N N 1 N N N 3.18... 9 170 11 7 52 47 23 15 SE 28
7419 2 F N N 1 N Y N 3.29... 8 112 8 4 27 36 28 35 SD 74
2220 2 M N N 3 N N N 3.30... 6 149 10 8 51 47 38 27 AG 22
6219 1 F N N 1 N Y N 3.40... 5 130 13 5 44 33 31 27 AG 62
9019 1 M N N 1 N N N 2.80... 9 125 2 3 37 34 29 39 SD 90
12430 3 F Y N 3 N N N 2.79... 4 143 11 1 44 40 25 41 SD 124
10819 1 F N N 1 N Y N 3.35... 4 136 10 2 49 40 14 10 AF 108
8819 2 F N N 1 N N N 3.75... 11 141 14 3 44 35 19 17 AF 88
9918 1 F N N 1 N N N 3.84... 13 158 13 2 49 46 34 23 SE 99
6021 3 F Y N 2 Y N N 4.00... 5 143 10 5 49 50 25 25 SE 60
11220 3 F N N 3 N N N 2.50... 13 140 3 1 38 28 25 24 AG 112
2921 3 M Y N 2 N Y N 3.40... 8 112 13 7 48 34 43 40 AG 29
8121 2 M Y N 1 N Y N 2.90... 3 96 12 4 32 27 28 30 SD 81
5422 3 M Y N 1 N N N 3.26... 13 125 3 5 44 55 22 20 SE 54
5220 3 M N Y 1 Y N N 3.34... 4 104 11 6 35 32 30 32 SD 52
7022 4 F Y N 2 N N N 3.78... 11 117 4 4 43 37 28 34 SD 70
11322 2 F Y N 2 N N N 3.49... 11 141 9 1 44 38 23 31 SD 113
10420 2 F N N 1 N N N 3.80... 7 122 2 2 51 28 13 33 AF 104
1920 3 M N N 2 N Y Y 3.68... 11 133 8 8 30 32 34 34 AG 19
622 3 F Y N 2 Y N N 2.63... 20 133 10 9 40 27 31 28 AG 6
3819 3 M N Y 1 Y Y N 4.00... 11 127 4 6 39 40 29 30 SE 38
9121 3 F Y N 2 Y N N 3.00... 8 142 3 3 43 34 40 47 SD 91

The example below shows how to draw a random sample of size \(n = 25\) from the narcissism column of the personality data frame.

sample.data.frame(p['Narc'], 25, orig.ids = F)
Narc
466
238
575
277
883
784
376
664
674
873
615
1151
1072
317
1002
605
426
516
178
1052
794
1181
1280
943
337