Sampling Functions#

The package mosaic features classic sampling functions for R. However, in this environment we use for class, mosaic is difficult to load. Still, we can import the code for useful functions directly and run them in R. This page provides the code to copy into an R notebook along with examples to guide usage.

Three mosaic functions that will be useful for us include:

  1. rflip()

  2. rspin()

  3. sample.data.frame()

The rflip() Function#

When we want to simulate coin flips, the rflip function generates and organizes the output. The function parameters are as follows:

  • n – the number of coins to toss

  • prob – probability of heads on each toss

  • quiet – a logical. If TRUE, less verbose output is used.

  • verbose – a logical. If TRUE, more verbose output is used.

  • summarize – if TRUE, return a summary (as a data frame).

Some examples:

  • rflip(10)

  • rflip(10, prob = 1/6, quiet = TRUE)

  • rflip(10, prob = 1/6, summarize = TRUE)

The code to create the function rflip() is below.

rflip <- function(n=1, prob=.5, quiet=FALSE, verbose = !quiet, summarize = FALSE, 
                  summarise = summarize) {
	if ( ( prob > 1 && is.integer(prob) ) ) {  
		# swap n and prob
		temp <- prob
		prob <- n
		n <- temp
	}
	if (summarise) {
	  heads <- rbinom(1, n, prob)
	  return(data.frame(n = n, heads = heads, tails = n - heads, prob = prob))
	} else {
	  r <- rbinom(n,1,prob)
	  result <- c('T','H')[ 1 + r ]
	  heads <- sum(r)
	  attr(heads,"n") <- n
	  attr(heads,"prob") <- prob 
	  attr(heads,"sequence") <- result
	  attr(heads,"verbose") <- verbose
	  class(heads) <- 'cointoss'
	  return(heads)
	}
}

Let’s flip \(20\) coins and observe the different output configurations we can access.

rflip(20)
[1] 10
attr(,"n")
[1] 20
attr(,"prob")
[1] 0.5
attr(,"sequence")
 [1] "T" "T" "H" "H" "H" "H" "H" "T" "T" "T" "H" "T" "H" "T" "H" "T" "H" "T" "T"
[20] "H"
attr(,"verbose")
[1] TRUE
attr(,"class")
[1] "cointoss"

Notice that the “number of successes” is the first output and that the outputs are in a dataframe that we can subset normally. Thus, we can use square brackets [] to access that value:

rflip(20)[1]
11

What if the probability of success is different than 50%? We can use the prob = option to set the correct value.

rflip(20, prob = 1/6)
[1] 6
attr(,"n")
[1] 20
attr(,"prob")
[1] 0.1666667
attr(,"sequence")
 [1] "T" "H" "H" "H" "H" "T" "H" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T"
[20] "H"
attr(,"verbose")
[1] TRUE
attr(,"class")
[1] "cointoss"

The function rflip() will organize the results attractively for us in an dataframe if we set the option summarize = TRUE.

rflip(20, prob = 1/6, summarize = TRUE)
nheadstailsprob
20 1 19 0.1666667

The rspin() Function#

We can simulate spinning a spinner with rspin() using the following input parameters.

  • n number of spins of spinner

  • probs – a vector of probabilities. If the sum is not 1, the probabilities will be rescaled.

  • labels – a character vector of labels for the categories

Some examples:

  • rspin(20, prob=c(1,2,3), labels=c(“Red”, “Blue”, “Green”))

  • rspin(30, prob=c(1,2,3,4), labels=c(“Red”, “Blue”, “Green”, “Purple”))

rspin <- function(n, probs, labels=1:length(probs)) {
  if (any(probs < 0))
    stop("All probs must be non-negative.")
  
  probs <- probs/sum(probs)
  res <- as.data.frame(t(rmultinom(1, n, probs)))
  names(res) <- labels
  res
}

Two straightforward examples should suffice to demonstrate how the function works.

rspin(20, prob=c(1,2,3), labels=c("Red", "Blue", "Green"))
RedBlueGreen
6 4 10
rspin(30, prob=c(1,2,3,4), labels=c("Red", "Blue", "Green", "Purple"))
RedBlueGreenPurple
3 4 1013

Genetics Example: Flowers#

Suppose that, based upon Mendel’s laws as expressed in the Punnett square, we have a hybrid where we expect the purple to white flower ratio to be \(3 : 1\). Let’s use rspin() to simulate growing 400 of the plants and counting the frequency of purple and white flowers.

rspin(400, prob=c(3,1), labels=c("Purple", "White"))
PurpleWhite
290110

Genetics Example: Peas#

Mendel chose to work with common, garden-variety pea plants for his experiments because they grow quickly and are easily raised. The plants have several visible characteristics that vary by proportions predicted by genetics, and we will focus on two of them:

  • Seeds can be round or wrinkled

  • Seeds can have yellow or green cotyledons. Cotyledons refer to the tiny leaves inside the seeds.

In Mendel’s experiment, he determined that the expected value for proportions were as given in the chart below:

Phenotype Expected Proportion
Round Yellow 9/16
Round Green 3/16
Wrinkled Yellow 3/16
Wrinkled Green 1/16

Let’s use rspin() to simulate growing 2000 of the plants and determining with what frequencies these attributes occur.

rspin(400, prob=c(9,3,3,1), labels=c("Round Yellow", "Round Green", "Wrinkled Yellow", "Wrinkled Green"))
Round YellowRound GreenWrinkled YellowWrinkled Green
22381 73 23

The sample.data.frame() Function#

We often wish to generate a random sample of rows from a dataframe, and sample.date.frame helps us to do so quickly.

  • x – dataframe to sample from

  • size – sample size to draw

  • groups – a vector (or variable in a data frame) specifying groups to sample within.

  • orig.ids – a logical; should original ids be included in returned data frame?

  • \dots – additional arguments passed to base::sample().

  • shuffled – a vector of column names. These variables are reshuffled individually (within groups if groups is specified), breaking associations among these columns.

Some examples:

sample.data.frame <- function(x, size, replace = FALSE, prob = NULL, groups=NULL, 
                              orig.ids = TRUE, fixed = names(x), shuffled = c(),
                              invisibly.return = NULL, ...) {
  if( missing(size) ) size = nrow(x)
  if( is.null(invisibly.return) ) invisibly.return = size>50 
  shuffled <- intersect(shuffled, names(x))
  fixed <- setdiff(intersect(fixed, names(x)), shuffled)
  n <- nrow(x)
  ids <- 1:n
  groups <- eval( substitute(groups), x )
  newids <- sample(n, size, replace=replace, prob=prob, ...)
  origids <- ids[newids]
  result <- x[newids, , drop=FALSE]
  
  idsString <- as.character(origids)
  
  for (column in shuffled) {
    cids <- sample(newids, groups=groups[newids])
    result[,column] <- x[cids,column]
    idsString <- paste(idsString, ".", cids, sep="")
  }
  
  result <-  result[ , union(fixed,shuffled), drop=FALSE]
  if (orig.ids) result$orig.id <- idsString
  
  
  if (invisibly.return) { return(invisible(result)) } else {return(result)}
}

Let’s load some data to do some examples.

p <- read.csv('https://faculty.ung.edu/rsinn/data/personality.csv')
head(p,3)
AgeYrSexG21CorpsResGreekVarsAthHonorGPA...PerfOCDPlayExtroNarcHSAFHSSEHSAGHSSDPHS
21 2 M Y Y 1 N N N 3.23... 105 10 142 8 11 41 40 26 27 SE
20 3 F N N 2 Y N Y 3.95... 105 3 172 16 11 46 52 26 33 SE
22 3 M Y N 2 N N N 3.06... 73 1 134 15 11 48 42 44 29 AG

The example below shows how to draw a random sample of size \(n = 25\) from the personality data frame. The row ID numbers have been included so you can see which rows were selected. Rerun the command, and you will see that a new sample with different rows will be drawn.

sample.data.frame(p, 25)
AgeYrSexG21CorpsResGreekVarsAthHonorGPA...OCDPlayExtroNarcHSAFHSSEHSAGHSSDPHSorig.id
7621 4 F Y N 2 Y N N 3.40... 7 159 9 4 54 46 25 48 SD 76
11220 3 F N N 3 N N N 2.50... 13 140 3 1 38 28 25 24 AG 112
7821 4 M Y Y 1 Y N Y 3.18... 7 120 5 4 28 36 27 34 SD 78
121 2 M Y Y 1 N N N 3.23... 10 142 8 11 41 40 26 27 SE 1
5220 3 M N Y 1 Y N N 3.34... 4 104 11 6 35 32 30 32 SD 52
8220 2 M N N 1 N Y N 3.40... 0 146 15 4 46 44 40 36 AG 82
10618 2 F N N 3 N N N 3.67... 6 122 10 2 43 36 28 22 AF 106
1320 2 M N N 1 N N Y 3.86... 15 130 10 8 43 36 36 32 AG 13
12430 3 F Y N 3 N N N 2.79... 4 143 11 1 44 40 25 41 SD 124
5422 3 M Y N 1 N N N 3.26... 13 125 3 5 44 55 22 20 SE 54
4622 3 F Y N 2 N N N 3.02... 7 174 13 6 51 39 25 18 AF 46
4720 2 F N N 1 N N N 2.90... 7 170 12 6 53 42 37 26 AF 47
9520 2 F N N 1 N N N 2.51... 5 147 6 3 47 19 19 21 AF 95
11022 4 F Y N 2 N N N 3.87... 17 129 6 1 44 45 22 30 SE 110
5320 3 F N N 1 Y N N 2.92... 2 120 9 6 50 37 28 23 AF 53
3920 2 F N Y 1 Y N Y 3.16... 11 166 7 6 56 55 24 17 SE 39
11450 4 F Y N 3 N N N 2.15... 11 118 13 1 24 34 15 20 SE 114
3418 1 F N N 3 N N N 3.92... 19 132 17 6 42 40 18 26 SE 34
1820 3 M N N 2 Y N N 2.66... 12 111 13 8 32 32 32 32 AG 18
8019 3 F N N 2 Y N N 4.00... 3 121 8 4 34 36 30 33 SD 80
9918 1 F N N 1 N N N 3.84... 13 158 13 2 49 46 34 23 SE 99
9818 1 F N N 3 N N N 3.70... 16 131 6 2 33 32 29 33 SD 98
4220 2 F N Y 1 Y N N 3.02... 10 147 10 6 33 21 31 33 SD 42
1623 4 M Y Y 1 N N N 2.49... 13 123 7 8 50 44 39 25 AG 16
6117 1 M N N 3 N N N 3.72... 5 130 14 5 48 47 32 24 SE 61

The example below shows how to draw a random sample of size \(n = 25\) from the narcissism column of the personality data frame.

sample.data.frame(p['Narc'], 25, orig.ids = F)
Narc
124 1
13 8
117 1
129 0
34 6
70 4
8 9
109 2
9 9
92 3
69 4
21 8
22 8
122 1
65 4
123 1
76 4
80 4
94 3
116 1
60 5
88 3
510
311
50 6