Exploratory Data Analysis#

The 2 key questions as we conduct an exploratory data analysis are as follows:

  • What is the shape of the distribution?

  • Do outliers or skew exist?

The most significant shape of distributions for introductory statistics is the normal distribution since parametric statistical tests like the

  • \(z\)-test,

  • \(t\)-test, and

  • ANOVA

are built on assumptions of normality. We also need to know whether or not skew and outliers exist because we often must treat skewed data and outliers with care and possibly different tools.

Types of Data#

Data come in two varieties: quantitative (numeric) or qualitative (categorical) data. Poker player Pete plays Heads Up tournaments with a 20 dollar buy-in (HU20). These two-player tournaments take less than half an hour where he either loses 21, or wins 19. (The online casino takes a cut of each tournament buy-in, called the rake.) Pete’s HU20 win-loss record is category data, a series of W’s and L’s.

Pete’s friend Mandy, a math major with a data science minor at North Georgia, enjoys 6-max NL10, a small stakes cash game. Mandy’s list of winnings from her past twenty cash game sessions is numeric data.

For numeric data sets, we have a sample of \(n\) data points:

\[\{x_1,x_2,…,x_n\}∈X\]

Parameters and Statistics#

We will continue to emphasize the difference between parameters and statistics:

\[\begin{split} \begin{array}{c|cc} &\text{Population}&\text{Sample}\\ \hline \text{AVG}&\mu&\bar{x}\\ \text{SD}&\sigma& s\\ \end{array} \end{split}\]

Parameters#

The population parameters of mean \(\mu\) and standard deviation \(\sigma\) are typically the values of greatest interest. Also typically, they are rarely known. Consider a study at UNG Dahlonega where we wish to know the Perfectionism levels for undergraduate students. We know little about the overall population parameters but can collect sample data from, say, 200 students to launch our investigation.

Statistics#

The sample statistics of mean \(\bar x\) and standard deviation \(s\) are known and thus can be used to estimate the population parameters. The accuracy of the estimates will depend upon two things, among others:

  1. Shape of the data

  2. Sample size

The shape of the distribution the data are drawn from matters due to the fact that we handle different distributions in different ways. Larger samples lead to improved accuracy.

Example 1

Recall our data from Example 1 which were winnings and losses from Mandy’s 20 most recent poker sessions.

28 11 18 35 36 6 -38 14 -19 43
-14 -30 -16 -25 0 40 16 -79 3 1

Using our code from an earlier section, we have the following:

W <- c( 28, 11, 18, 35, 36, 6, -38, 14, -19, 43, -14, -30, -16, -25, 0, 40, 16, -79, 3, 11)
cat('The standard descriptives for Winnings \n   Mean = ', round(mean(W),2),
    '\n   Standard Deviation = ', round(sd(W),2),
    '\n   Sample Size = ', length(W),
    '\n\nThe 5-number summary for Winnings')
summary(W)
The standard descriptives for Winnings 
   Mean =  2 
   Standard Deviation =  30.57 
   Sample Size =  20 

The 5-number summary for Winnings
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -79.00  -16.75    8.50    2.00   20.50   43.00 

Analysis of Descriptives#

The most interesting feature of the descriptive statistics is that the mean and median are quite different. Since the median is \(8.5\) and the mean is \(2\), their difference is \(6.5\) which is about \(\frac{1}{4}\) of standard deviation.

In this example, because the mean is significantly less than the median, we anticipate skew to the left and outliers, if present, to be on the left, which brings up the second detail one should notice. There is a much longer tail to the left shown the Five Number Summary. The Lower Quartile spans the interval \((−79,−17.5)\) or more than \(60\) units. The upper Quartile spans only \((23,43)\) or \(20\) units. This indicates likely skew to left since the lower Quartile range is much larger and hence has more room for outliers.

Standardized Scores#

Standardized scores or \(z\)-scores are calculated from a data point \(x\) as follows:

\[z = \frac{x-\mu}{\sigma}\]

where \(\mu\) is the population mean and \(\sigma\) is the popolutaion standard deviation.

Example 2#

SAT math section scores have the \(N(500,100)\) distribution meaning the population average is 500 and the standard deviation is 100. If Brea earns a 630 on her math SAT section, what is her standardized score?

\[\begin{split}\begin{align}z &= \frac{630-500}{100}\\&=\frac{130}{100}\\&=1.3\end{align}\end{split}\]

The interpretation? Brea’s raw score is 1.3 standard deviation above average. A negative \(z\) score indicates a score that is below average.

A \(z\)-score indicates both distance and direction from the mean, and its units are standard deviations. Note that, when we don’t know \(\mu\) or \(\sigma\), or both, we substitute the sample statistics:

\[z = \frac{x-\bar{x}}{s}\]

Standard Deviation and Outliers#

The standard deviation can be thought of as a distance metric specific to the data set. Let’s take a moment to discover why. Given the data set \(X=\{1,2,3,6\}\), we have \(n=4\) and \(\bar x=3\). For any data point, say, \(x_1=1\), we can compute the directional distance or deviation from the mean:

\[d_i = x_i - \bar x\]

For this example, we have

\[d_1 = 1 -3 = -2\]

For deviations where \(x_i<x\) (below average), then \(d_i<0\), e.g. a negative deviation. Positive deviations indicate data points that are above average. If we treat the data set \(X\) as a column vector \(\vec x\) and calculate the deviation for each component, we have the deviation vector:

\[\begin{split}\vec{d_x} = \left(\begin{array}{c}x_1 - \bar x \\x_2 - \bar x \\ \vdots \\ x_n - \bar x \\ \end{array}\right)\end{split}\]

Standard Deviation#

The idea for the calculation of a standard distance (or deviation) beings with an application of the Euclidean distance metric to \(\vec{d_x}\) . The magnitude of the deviation vector is given by:

\[\begin{split}\begin{align*} \left\|\vec{d_x}\right\| &= \sqrt{(-2)^2+(-1)^2+(0)^2+(3)^2} \\ &=\sqrt{14}\\ &\approx 3.742 \end{align*}\end{split}\]

Outliers#

We have two different ways of checking for outliers:

  1. Based upon mean and standard deviation.

  2. Based upon the 5-Number Summary: Q1, Q3, and IQR.

Outliers Based upon Meand and Standard Deviation#

The basic rule is that any data point more than 2 standard deviations from the mean will count as an outlier in a small data set where small indicates a sample of \(n\leq 200\). For larger sample sizes, a 3 standard deviations from the mean may be more appropriate. We will create cutoff points above and below mean as follows:

  • Upper Cutoff: \(\bar x + 2s\)

  • Lower Cutoff: \(\bar x - 2s\)

Any data point above the Upper Cutoff will be counted as an outlier. Any data point below the Lower Cutoff will be counted as an outlier.

Outliers Based upon 5-Number Summary#

The box plot checks for outliers using fences which are calculated as follows:

  • Upper Fence: Q3 + 1.5 * IQR

  • Lower Fence: Q1 - 1.5 * IQR

where the inner quartile range (IQR) is given by:

IQR = Q3 - Q1

Again, any data points that lie outside or beyond the fences are designated as outliers. The only difference between the two methods is the formula used to calculate the cutoff points.

Analysis of Shape#

Describing the Distribution of a Quantitative Variable#

  • Shape: Symmetry or Skew

  • Shape: Uniform, Peaks - unimodal, bimodal, others

  • Center: Mean and Median

  • Spread: Range, IQR, standard deviation

  • Outliers

Run the cell below to see examples of different shape characteristics.

set.seed(844)
setA <- rnorm(500,21,2)
setB <- rchisq(500,df=800)
setC <- rchisq(500,df=6)
setD <- -1*setC+30
setE <- runif(5000,min=1,max=6)
setF <- c(rnorm(500,21,2),rnorm(500,32,2))

par(mfrow=c(3,2))
hist(setA,main="Approximately Normal", xlab="Data Set A")
hist(setB,main="Unimodal and Roughly Symmetric", xlab="Data Set B")
hist(setC,main="Unimodal and Skewed RIGHT", xlab="Data Set C")
hist(setD,main="Unimodal and Skewed LEFT", xlab="Data Set D")
hist(setE,main="Approximately Uniform", xlab="Data Set E")
hist(setF,main="Bimodal", xlab="Data Set F")
_images/f8fb0462a56c15431727b429f29fcdabaa016f15b9a30df144e2d7ff07fa64c8.png

Skew and Outliers#

If a distribution has skew, then the majority of outliers in that data set are likely to lie in the same direction. Why do we care?

  1. Outliers are called influential data points because they affect the mean.

  2. The median is not affected much by outliers.

Thus, in the presence of skew, we can see where the majority of these influential data points are likely to live.

Example

Due to the influential data points, e.g. outliers, the following are true:

  • If the distribution is skewed left, the mean is likely to be less than the median.

  • If the distribution is skewed right, the mean is likely to be greater than the median.

Example 3#

Perform exploratory data analysis for Life Expectancy in 2015 using the World Health Organization country-by-country comparison data.

life <- read.csv('https://faculty.ung.edu/rsinn/data/lifeexpectancy.csv')
life <- subset(life, Year == 2015)
head(life,7)
CountryYearStatusLifeExpectancyAdultMortalityInfantDeathsAlcoholPercentageExpenditureHepatitisBMeasles...PolioTotalExpenditureDiphtheriaHIV.AIDSGDPPopulationThinness_.1.19_yearsThinness_5.9_yearsIncomeSchoolingYrs
1Afghanistan 2015 Developing 65.0 263 62 0.01 71.27962 65 1154 ... 6 8.16 65 0.1 584.2592 33736494 17.2 17.3 0.479 10.1
17Albania 2015 Developing 77.8 74 0 4.60 364.97523 99 0 ... 99 6.00 99 0.1 3954.2278 28873 1.2 1.3 0.762 14.2
33Algeria 2015 Developing 75.6 19 21 NA 0.00000 95 63 ... 95 NA 95 0.1 4132.7629 39871528 6.0 5.8 0.743 14.4
49Angola 2015 Developing 52.4 335 66 NA 0.00000 64 118 ... 7 NA 64 1.9 3695.7937 2785935 8.3 8.2 0.531 11.4
65Antigua and Barbuda2015 Developing 76.4 13 0 NA 0.00000 99 0 ... 86 NA 99 0.2 13566.9541 NA 3.3 3.3 0.784 13.9
81Argentina 2015 Developing 76.3 116 8 NA 0.00000 94 0 ... 93 NA 94 0.1 13467.1236 43417765 1.0 0.9 0.826 17.3
97Armenia 2015 Developing 74.8 118 1 NA 0.00000 94 33 ... 96 NA 94 0.1 369.6548 291695 2.1 2.2 0.741 12.7
le <- life[ , 'LifeExpectancy']
head(le,4)
  1. 65
  2. 77.8
  3. 75.6
  4. 52.4

Example 3: Shape#

Let’s consider shape by constructing a histogram and a density plot.

hist(le, breaks = 10, main = 'Histogram: Life Expectancy in 2010',xlab = "Years")
_images/9250aaca7da33ca6f311623401bd4f634e5031cf754c52c9386e88f2f3810f8c.png
plot(density(le), main = 'Histogram: Life Expectancy in 2010',xlab = "Years")
_images/6df03b46f85bbf9ec93a98a569cca6419e77344073f469acafa89c60c7081a14.png

We have a normal or bell-shaped distribution. Given the longer tail to the left, we are seeing some skew to the left and anticipate outliers in that direction.

Example 3: Outliers#

Let’s create a boxplot to investigate skew and outliers.

boxplot(le, main = 'Life Expectancy', ylab = 'Years')
_images/9beb37a9789293e8133e60740f903aed09cc76214b6c2400d1e501f8b5f61da1.png

We see no outliers in the boxplot, but we do see a slightly longer tail to the left. Checking numerically, we find that:

upper_cutoff = mean(le) + 2 * sd(le)
lower_cutoff = mean(le) - 2 * sd(le)

upper_cutoff
lower_cutoff
87.864352186001
55.3695275954198
sum(le < lower_cutoff)
sum(le > upper_cutoff)
8
1

Using the boxplot method for checking for outliers, we find:

  • Upper Fence = Q3 + 1.5 * IQR

  • Lower Fence = Q1 - 1.5 * IQR

summary(le)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  51.00   65.75   73.90   71.62   76.95   88.00 
IQR = 76.95 - 65.75
upper_fence = 76.95 + 1.5 * IQR
lower_fence = 65.75 - 1.5 * IQR

upper_fence
lower_fence
93.75
48.95
sum(le < lower_fence)
sum(le > upper_fence)
0
0

Notice that two methods differ. While the box plot shows no outliers, the numeric method identifies 8 outliers to the left with only 1 to the right. The methods do differ at times, especially when the outliers in discrepency are marginal. If a data point is 4 standard deviations from the mean, both methods will identify it as an outlier. When close the cutoff points, some data points will be classified differently by the two methods.

Given several marginal outliers to the left and 1 to the right, we see the skew to the left.

Example 3: Descriptive Statistics#

cat('The standard descriptives for Life Expectancy \n   Mean = ', round(mean(le),1),
    '\n   Standard Deviation = ', round(sd(le),2),
    '\n   Sample Size = ', length(le),
    '\n\nThe 5-number summary for Life Expectancy')
summary(le)
The standard descriptives for Life Expectancy 
   Mean =  71.6 
   Standard Deviation =  8.12 
   Sample Size =  183 

The 5-number summary for Life Expectancy
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  51.00   65.75   73.90   71.62   76.95   88.00 

Applying **Robb’s Rule of Thumb,” notice that the mean and median are significantly differ (e.g. differ by more than \(\frac{s}{10}\) ) since

\[\text{mean} - \text{median} = 71.62 - 73.9 = -2.28\]

The absolute value of the difference is greater than a tenth of the standard deviation: $\(\frac{s}{10} = \frac{8.12}{10} = 0.812\)$

Robb’s Rule of Thumb says that, in this instance, we would expect skew and outlier. This confirms the work above showing a skew to the left and several outliers to the left.