Data Analysis: Key Ideas#

Two key questions for exploratory data analysis in statistics are:

  1. Shape?

  2. Outliers/skew?

For shape, we would like to know if the distribution appears to be uniform, exponential, binomial, bell-shaped or a fit with another common pattern. The key distribution in statistics is the normal (or bell-shaped) distribution. A uniform distribution is where all outcomes are equally likely? We have looked at distributions like the binomial, hypergeometric, geometric and negative binomial distributions. The most significant shape for introductory statistics is the normal distribution since parametric statistical tests like the \(z\)-test, the \(t\)-test, ANOVA, and regression are built on assumptions of normality.

The second question we need to ask: do skew and outliers exist? We need to know because we often must treat skewed data and data with outliers with great care and possibly different tools. The first steps of data analysis are to generate numeric descriptions of the data (descriptive statistics) and typical graphical displays of data like histograms, stem plots.

We will develop essential skills below including detecting outliers and creating the graphics needed to assess shape and skew.

Data#

Most often, we investigate single varible numeric data. In R, that data is typically stored as vector. Mathematically, we have the outcomes \(x_i\) in the set \(X\):

\[X \in \{x_1,x_2, \cdots, x_n\}\]

This data is stored in R as a column in a data frame or as a data vector. The vector is preferred when graphics or descriptives are generated.

Parameters and Statistics#

In statistics, we use two different sets of symbols to refer to the mean and standard deviation:

\[\begin{split} \begin{array}{c|cc} &\text{Population}&\text{Sample}\\ \hline \text{AVG}&\mu&\bar{x}\\ \text{SD}&\sigma& s\\ \end{array} \end{split}\]

The population parameters \(\mu\) and \(\sigma\) are rarely known. Much of statistics is about estimating these parameters using the sample statistics \(\bar{x}\) and \(s\) respectively. For example, a poker player’s distribution of winnings (per 100 hands) is a normal distribution. True win rates are not known. Winning poker players often suffer long colds streaks. Over time, things average out. We don’t know Mandy’s average win rate, \(\mu\), but we can take sample of recent sessions and estimate it with \(\bar{x}\).

Example 1

Suppose that Mandy’s winnings from her most recent 20 cash poker sessions is given in the table below. Let’s find the descriptive statistics and plot some standard statistical graphics using this data.

28 11 18 35 36 6 -38 14 -19 43
-14 -30 -16 -25 0 40 16 -79 3 1

The command is shown below using the concatenate function c().

W <- c( 28, 11, 18, 35, 36, 6, -38, 14, -19, 43, -14, -30, -16, -25, 0, 40, 16, -79, 3, 11)

Descriptive Statistics#

We will use the code block we developed in the Descriptives Statistics section. We have changed the vector to \(W\) and the title to WINNINGS.

cat('The standard descriptives for WINNINGS\n   Mean = ', round(mean(W),1),
    '\n   Standard Deviation = ', round(sd(W),2),
    '\n   Sample Size = ', length(W),
    '\n\nThe 5-number summary for WINNINGS')
summary(W)
The standard descriptives for WINNINGS
   Mean =  2 
   Standard Deviation =  30.57 
   Sample Size =  20 

The 5-number summary for WINNINGS
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -79.00  -16.75    8.50    2.00   20.50   43.00 

The most interesting point of analysis we can see already is that the mean is a good bit lower than the median. This can and often does indicate skew and a potential for outliers.

Shape: Histograms and Density Plots#

The classical way of inspecting shape is the grand old histogram.

Tip

Recall that the breaks = option allows us to increase or decrease the number of bars by controlling the bin width, e.g. the length of the equally-spaced intervals along the \(x\)-axis.

hist(W,
     main = "Histogram: Mandy's Winnings",
     xlab = 'Winnings in BB/100')
_images/e5878feb516cf3173f12453d5e4833c89ace9ec68e78e4959687f133773b5ecf.png

Tip

The \(x\)-axis is labeled in units of Big Blinds per 100 Hands which, in poker parlance, describes the winnings in terms of the expense of the game.

plot(density(W), main = "Density Plot: Mandy's Winnings",
     xlab = 'Winnings in BB/100')
_images/e92ffbf6e779baad3b3b990b9e44a1b8178da5af36d466352e89627dca7370cf.png

Analysis#

The histogram and density plot confirm the skew to the left which confirms our alert. The mean was strikingly different from the median which made us suspicious of skew and outliers. The distribution appears to be approximately normal (or bell-shaped) and skewed left. The tail to the left comprises 4 bars of the histogram while the tail to the right comprises only 2. The direction of skew, if any, is in the same direction as the longest tail in the histogram or desnity plot.

Outliers and Skew#

We have two ways that we check for outliers in basic stastiscs:

  1. Numerically using mean and standard deviation.

  2. Graphically using the box plot and 5-number summary.

Checking for Outliers Numerically#

Any data point more than 2 standard deviations away from the mean may be considered an outlier in a small data set where \(n \leq 200\).

We can use the \(<\) and \(>\) operators along with the sum() function to determine whether these conditions exist for data points in a vector. The value

\[\bar x - 2s\]

is the lower bound for outliers. Any data point eqaul to or below this value will be considered an outlier.

Notice how the operator \(\leq\) functions when applied to the data vector.

W <= mean(W) - 2 * sd(W)
  1. FALSE
  2. FALSE
  3. FALSE
  4. FALSE
  5. FALSE
  6. FALSE
  7. FALSE
  8. FALSE
  9. FALSE
  10. FALSE
  11. FALSE
  12. FALSE
  13. FALSE
  14. FALSE
  15. FALSE
  16. FALSE
  17. FALSE
  18. TRUE
  19. FALSE
  20. FALSE

The TRUE and FALSE values are 1’s and 0’s in R, so we can sum up the TRUE/FALSE vector to find the number of TRUE values in it.

sum(W <= mean(W) - 2 * sd(W))
1

Thus, we have one outlier to left or one data point that is more than 2 standard deviations below average. For outliers to right, we check the following:

sum(W >= mean(W) + 2 * sd(W))
0

We therefore find no outliers to right. We do not have any data points 2 standard deviations or more above the average.

Checking for Outliers Graphically#

The easiest way to check for outliers is to create box plot. While the box plot is a picture of the 5-number summary, most statistical software and graphing calculators also identify outliers while doing so.

boxplot(W)
_images/f00a998080bae20274cc592f9a8035045303c14216e318ac3b83518e7cd6c411.png

The box plot has identified a single outlier to the left and none to the right. The formula that is used to determine the cutoff values (called fences) is based on the values in the 5-number summary.

  • Lower Fence = Q1 - 1.5 * IQR

  • Upper Fence = Q3 + 1.5 * IQR

where IQR = Q3 - Q1 and indicates the “inner quartile range.”

Skew#

Outliers tend to cause skew. If we have several outliers to the left and none to the right, we generally can see skew to the left in the histogram or density plot. If we have outliers to the right but none to the left, we generally have skew to the right.

  • Skewed right usually means the mean is greater than the median.

  • Skewed left usually means the mean is less than the median.

This is how we guessed above that outliers and skew might be present. The above two statements can be reversed:

  • If the mean is significantly greater than the median, we expect skew to the right and the majority of outliers to the right.

  • If the mean is significantly less than the median, we expect skew to the left and the majority of outliers to the left.

When is the mean significantly greater or less than the median?

Robb’ Rule of Thumb states that when the mean and median differ by a tenth of a standard deviation or more, we should expect skew and outliers.

Pivot Tables#

We also have a quick ways to summarize 2 category variables in a 2-way table or pivot table:

pers <- read.csv('https://faculty.ung.edu/rsinn/data/personality.csv')
xtabs(~Sex + AccDate, data = pers)
   AccDate
Sex  N  Y
  F 28 46
  M 28 27

Tip

Note the statistical formula that is used above:

\[\text{~ Sex + AccDate}\]

The \(+\) operator indicates a categorical variable to follow instead of a numeric one. R understands the leading ~ indicates that both of the variables are catogorical or qualitative, not numerc.