# Sampling Functions

The package **mosaic** features classic sampling functions for R. However, in this environment we use for class, mosaic is difficult to load. Still, we can import the code for useful functions directly and run them in R. This page provides the code to copy into an R notebook along with examples to guide usage.

Three mosaic functions that will be useful for us include:

1. rflip()
2. rspin()
3. sample.data.frame()

## The **rflip()** Function

When we want to simulate coin flips, the **rflip** function generates and organizes the output. The function parameters are as follows:

- **n** -- the number of coins to toss
- **prob** -- probability of heads on each toss
- **quiet** -- a logical.  If `TRUE`, less verbose output is used.
- **verbose** -- a logical.  If `TRUE`, more verbose output is used.
- **summarize** -- if `TRUE`, return a summary (as a data frame).

Some examples:
- rflip(10)
- rflip(10, prob = 1/6, quiet = TRUE)
- rflip(10, prob = 1/6, summarize = TRUE)

The code to create the function **rflip()** is below.

In [1]:
rflip <- function(n=1, prob=.5, quiet=FALSE, verbose = !quiet, summarize = FALSE, 
                  summarise = summarize) {
	if ( ( prob > 1 && is.integer(prob) ) ) {  
		# swap n and prob
		temp <- prob
		prob <- n
		n <- temp
	}
	if (summarise) {
	  heads <- rbinom(1, n, prob)
	  return(data.frame(n = n, heads = heads, tails = n - heads, prob = prob))
	} else {
	  r <- rbinom(n,1,prob)
	  result <- c('T','H')[ 1 + r ]
	  heads <- sum(r)
	  attr(heads,"n") <- n
	  attr(heads,"prob") <- prob 
	  attr(heads,"sequence") <- result
	  attr(heads,"verbose") <- verbose
	  class(heads) <- 'cointoss'
	  return(heads)
	}
}

Let's flip $20$ coins and observe the different output configurations we can access.

In [2]:
rflip(20)

[1] 11
attr(,"n")
[1] 20
attr(,"prob")
[1] 0.5
attr(,"sequence")
 [1] "H" "T" "H" "H" "H" "T" "H" "T" "H" "T" "H" "H" "H" "T" "T" "T" "H" "H" "T"
[20] "T"
attr(,"verbose")
[1] TRUE
attr(,"class")
[1] "cointoss"

Notice that the "number of successes" is the first output and that the outputs are in a dataframe that we can subset normally. Thus, we can use square brackets [] to access that value:

In [3]:
rflip(20)[1]

What if the probability of success is different than 50\%? We can use the *prob =* option to set the correct value.

In [4]:
rflip(20, prob = 1/6)

[1] 2
attr(,"n")
[1] 20
attr(,"prob")
[1] 0.1666667
attr(,"sequence")
 [1] "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "H" "T" "T" "T" "T" "T" "T" "T" "T"
[20] "H"
attr(,"verbose")
[1] TRUE
attr(,"class")
[1] "cointoss"

The function **rflip()** will organize the results attractively for us in an dataframe if we set the option *summarize = TRUE*.

In [5]:
rflip(20, prob = 1/6, summarize = TRUE)

n,heads,tails,prob
20,2,18,0.1666667


## The **rspin()** Function

We can simulate spinning a spinner with **rspin()** using the following input parameters.

- **n** number of spins of spinner
- **probs** -- a vector of probabilities.  If the sum is not 1, the probabilities will be rescaled.
- **labels** -- a character vector of labels for the categories

Some examples:

- rspin(20, prob=c(1,2,3), labels=c("Red", "Blue", "Green"))
- rspin(30, prob=c(1,2,3,4), labels=c("Red", "Blue", "Green", "Purple"))

In [6]:
rspin <- function(n, probs, labels=1:length(probs)) {
  if (any(probs < 0))
    stop("All probs must be non-negative.")
  
  probs <- probs/sum(probs)
  res <- as.data.frame(t(rmultinom(1, n, probs)))
  names(res) <- labels
  res
}

Two straightforward examples should suffice to demonstrate how the function works.

In [7]:
rspin(20, prob=c(1,2,3), labels=c("Red", "Blue", "Green"))

Red,Blue,Green
4,4,12


In [8]:
rspin(30, prob=c(1,2,3,4), labels=c("Red", "Blue", "Green", "Purple"))

Red,Blue,Green,Purple
3,3,12,12


### Genetics Example: Flowers

Suppose that, based upon Mendelâ€™s laws as expressed in the Punnett square, we have a hybrid where we expect the **purple** to **white** flower ratio to be $3 : 1$. Let's use **rspin()** to simulate growing 400 of the plants and counting the frequency of purple and white flowers.

In [9]:
rspin(400, prob=c(3,1), labels=c("Purple", "White"))

Purple,White
311,89


### Genetics Example: Peas

Mendel chose to work with common, garden-variety pea plants for his experiments because they grow quickly and are easily raised. The plants have several visible characteristics that vary by proportions predicted by genetics, and we will focus on two of them:

- Seeds can be round or wrinkled
- Seeds can have yellow or green cotyledons. Cotyledons refer to the tiny leaves inside the seeds.

In Mendel's experiment, he determined that the expected value for proportions were as given in the chart below:

<html><table style="width:45%">
  <tr>
    <th>Phenotype</th>
    <th>Expected Proportion</th>
  </tr>
  <tr>
    <td>Round Yellow</td>
    <td align="center">9/16</td>
  </tr>
  <tr>
    <td>Round Green</td>
    <td align="center">3/16</td>
  </tr>
  <tr>
    <td>Wrinkled Yellow</td>
    <td align="center">3/16</td>
  </tr>
  <tr>
    <td>Wrinkled Green</td>
    <td align="center">1/16</td>
  </tr>
</table></html>

Let's use **rspin()** to simulate growing 2000 of the plants and determining with what frequencies these attributes occur.

In [10]:
rspin(400, prob=c(9,3,3,1), labels=c("Round Yellow", "Round Green", "Wrinkled Yellow", "Wrinkled Green"))

Round Yellow,Round Green,Wrinkled Yellow,Wrinkled Green
229,76,67,28


## The **sample.data.frame()** Function

We often wish to generate a random sample of rows from a dataframe, and **sample.date.frame** helps us to do so quickly.

- **x** -- dataframe to sample from
- **size** -- sample size to draw
- **groups** -- a vector (or variable in a data frame) specifying groups to sample within.
- **orig.ids** -- a logical; should original ids be included in returned data frame?
- **\dots** -- additional arguments passed to *base::sample()*.
- **shuffled** -- a vector of column names. These variables are reshuffled individually (within groups if `groups` is specified), breaking associations among these columns.

Some examples:

In [11]:
sample.data.frame <- function(x, size, replace = FALSE, prob = NULL, groups=NULL, 
                              orig.ids = TRUE, fixed = names(x), shuffled = c(),
                              invisibly.return = NULL, ...) {
  if( missing(size) ) size = nrow(x)
  if( is.null(invisibly.return) ) invisibly.return = size>50 
  shuffled <- intersect(shuffled, names(x))
  fixed <- setdiff(intersect(fixed, names(x)), shuffled)
  n <- nrow(x)
  ids <- 1:n
  groups <- eval( substitute(groups), x )
  newids <- sample(n, size, replace=replace, prob=prob, ...)
  origids <- ids[newids]
  result <- x[newids, , drop=FALSE]
  
  idsString <- as.character(origids)
  
  for (column in shuffled) {
    cids <- sample(newids, groups=groups[newids])
    result[,column] <- x[cids,column]
    idsString <- paste(idsString, ".", cids, sep="")
  }
  
  result <-  result[ , union(fixed,shuffled), drop=FALSE]
  if (orig.ids) result$orig.id <- idsString
  
  
  if (invisibly.return) { return(invisible(result)) } else {return(result)}
}

Let's load some data to do some examples.

In [12]:
p <- read.csv('https://faculty.ung.edu/rsinn/data/personality.csv')
head(p,3)

Age,Yr,Sex,G21,Corps,Res,Greek,VarsAth,Honor,GPA,...,Perf,OCD,Play,Extro,Narc,HSAF,HSSE,HSAG,HSSD,PHS
21,2,M,Y,Y,1,N,N,N,3.23,...,105,10,142,8,11,41,40,26,27,SE
20,3,F,N,N,2,Y,N,Y,3.95,...,105,3,172,16,11,46,52,26,33,SE
22,3,M,Y,N,2,N,N,N,3.06,...,73,1,134,15,11,48,42,44,29,AG


**The example below shows how to draw a random sample of size $n = 25$ from the personality data frame.** The row ID numbers have been included so you can see which rows were selected. Rerun the command, and you will see that a new sample with different rows will be drawn.

In [13]:
sample.data.frame(p, 25)

Unnamed: 0,Age,Yr,Sex,G21,Corps,Res,Greek,VarsAth,Honor,GPA,...,OCD,Play,Extro,Narc,HSAF,HSSE,HSAG,HSSD,PHS,orig.id
68,19,1,F,N,N,1,N,Y,Y,4.0,...,12,130,11,4,50,46,13,13,SE,68
58,20,3,M,N,Y,1,N,N,N,2.62,...,9,150,15,5,51,55,37,31,SE,58
66,19,2,F,N,N,3,Y,N,N,3.6,...,14,122,12,4,38,30,30,29,AG,66
60,21,3,F,Y,N,2,Y,N,N,4.0,...,5,143,10,5,49,50,25,25,SE,60
114,50,4,F,Y,N,3,N,N,N,2.15,...,11,118,13,1,24,34,15,20,SE,114
14,20,3,M,N,Y,1,Y,N,N,3.3,...,15,118,7,8,41,38,32,25,AG,14
107,19,1,M,N,N,1,N,N,N,4.0,...,5,104,3,2,42,27,19,36,SD,107
88,19,2,F,N,N,1,N,N,N,3.75,...,11,141,14,3,44,35,19,17,AF,88
80,19,3,F,N,N,2,Y,N,N,4.0,...,3,121,8,4,34,36,30,33,SD,80
78,21,4,M,Y,Y,1,Y,N,Y,3.18,...,7,120,5,4,28,36,27,34,SD,78


**The example below shows how to draw a random sample of size $n = 25$ from the narcissism column of the personality data frame.** 

In [14]:
sample.data.frame(p['Narc'], 25, orig.ids = F)

Unnamed: 0,Narc
13,8
125,0
52,6
69,4
19,8
114,1
124,1
41,6
58,5
80,4
