Statistical Formulas#
These formulas are similar to many such sections in statistics textbooks and provide guidance on how perform basic statistical hypothesis testing and other basic statistical tasks.
Probability#
The following rules of probability will be utilized for the entire course:
For any event \(A\), \(0\leq P(A) \leq 1\)
The sample space \(S\) has probability \(P(S) =1\)
For disjoint event sets \(A\) and \(B\),
In general, for event sets \(A\) and \(B\),
\(P(A \text{ does not occur}) =1-P(A)\)
For a discrete probability density function \(p(x)\):
\(0\leq p(x_i) \leq 1 \hspace{.25cm}\text{for all}\hspace{.25cm} 1\leq i \leq n\)
\(\sum p(x_i)=1\)
For any continuous probability density function \(f(x)\):
\(f(x) \geq 0 \hspace{.25cm}\text{for all}\hspace{.25cm} x\in(-\infty,+\infty)\)
\(\displaystyle\int\limits_{-\infty}^{+\infty} f(x)dx=1\)
Exploring Data: Distributions and Descriptives#
Look for overall pattern (shape, center, spread) and deviations (outliers).
Mean:
Standard deviation:
Median: Arrange all observations from smallest to largest. The median \(M\) is located \(\frac{(n + 1)}{2}\) from the beginning of this list. We also use the notation \(\tilde{x}\) for the median.
Quartiles: The first quartile \(Q1\) is the median of the observations whose position in the ordered list is to the left of the location of the overall median. The third quartile \(Q3\) is the median of the observations to the right of the location of the overall median.
Standardized value of \(x\): $\(z=\frac{x-\mu}{\sigma}\)$
Five Number Summary#
Statistic | Symbol |
---|---|
Minimum | min |
1st Quartile | Q1 |
Median | med |
3rd Quartile | Q3 |
Maximum | max |
Exploring Data: Relationships#
Look for overall pattern (form, direction, strength) and deviations (outliers, influential observations).
Correlation (conceptual form using \(z\)-scores):
Least-squares regression line:
The slope \(b\) of the regression line (or line of best fit) is the standard change (deviation) in \(y\) over the standard change (deviation) in \(x\) multiplied by the strength (correlation) of the relationship between the \(x\)- and \(y\)-variables.
Residuals: for any data point \((x_i,y_i)\):
or$\(y_i-\hat{y_i}=y_i-a-bx_i\)$
which gives the vertical distance between the actual \(y\)-value and the line of best fit. The line of best" is the
least squares line” that minimizes the sum of the squared residuals, e.g. minimizes the total vertical distance between the actual data and the line of best fit (using calculus).
Sampling Distributions#
\(\bar{x}\) has mean \(\mu\) and standard deviation \(\frac{\sigma}{\sqrt{n}}\)
\(\bar{x}\) has a Normal distribution if the population distribution is Normal.
Central limit theorem: \(X\) is approximately Normal when \(n\) is large.
Basics of Inference#
\(z\) confidence interval for a population mean (\(\sigma\) known, SRS from Normal population):
Sample size for desired margin of error \(m\):
\(z\) test statistic for \(H_0 : \mu = \mu_0\) (\(\sigma \) known, SRS from Normal population, rarely used in modern practice):
with \(p\)-values from \(N(0,1)\).
Inference About Means#
The \(t\) confidence interval for a population mean (SRS from Normal Population):
with \(t^*\) from the \(t\)-distribution with degrees of freedom \(n-1\).
\(t\) test statistic \(H_0:\mu=\mu_0\) (SRS from Normal Population):
with \(P\)-values from the \(t\)-distribution with degrees of freedom\(n-1\).
Matched pairs: To compare the responses to the two treatments, apply the one-sample \(t\) procedures to the observed differences.
Two-sample confidence interval for \(\mu_1-\mu_2\) (independent SRSs from the Normal Populations):
with conservative \(t^*\) from \(t\) with \(df=\min(n_1-1,n_2-1)\), or use software.
Two-sample \(t\) test statistic for \(H_0 : \mu_1 = \mu_2\) (independent SRSs from Normal populations):
with conservative \(p\)-values from \(t\) with \(df=\min(n_1-1,n_2-1)\), or use software.
Inference About Proportions#
Sampling distribution of a sample proportion: when the population and the sample size are both large and \(p\) is not close to \(0\) or \(1\), \(\hat{p}\) is approximately \(N\left(p,\sqrt{p(1-p)/n}\right)\).
Large-sample \(z\) confidence interval for \(p\):
with \(z^*\) from \(N(0,1)\). Plus Four Method greatly improves accuracy: same formula after adding four imaginary observations: two success and two failures.
The \(z\) test statistic for \(H_0: p=p_0\) (large SRS):
with \(p\)-values from \(N(0,1)\)}$.
Sample size for desired margin of error \(m\): $\(n=\left(\frac{z^*}{m}\right)^2p^*(1-p^*)\)\(where \)p^\( is a guessed value for \)p\(, or \)p^=0.5$
Large-sample \(z\) confidence interval for \(p_1-p_2\) where
with \(z^*\) from \(N(0,1)\) and standard error:
Plus 4 Method greatly improves accuracy: use the same formulas after adding one success and one failure to each sample.
Two-sample \(z\) test statistic for \(H_0 : p_1=p_2\) (large independent SRSs):
where \(\hat{p}\) is the pooled (overall) proportion of successes.
The \(\chi^2\) Test#
Calculating cells of Expected Matrix:
\(\chi^2\) test statistic for testing whether the row and column variables in an \(r\times c\) table are unrelated (expected cell counts not too small):
with \(p\)-values from the \(\chi^2\) distribution with
Inference for Regression#
Conditions for regression inference: \(n\) observations on \(x\) and \(y\). The response \(y\) for any fixed \(x\) has a Normal distribution with mean given by the true regression line \(y=\alpha+\beta x\) and standard deviation \(\sigma\). Parameters are \(\alpha, \beta, \sigma\).
Estimate \(\alpha\) by the intercept \(a\) and \(\beta\) by the slope \(b\) of the least-squares line.
Estimate \(\sigma\) by the regression standard error: $\(s=\sqrt{\frac{1}{n-2}\sum{\text{residual}}^2}\)$
A \(t\) confidence interval for regression slope \(\beta\) can be calculated by hand as follows:
where
However, best practice strongly indicates using software to calculate all standard errors in regression.
Testing for no correlation, \(H_0:\rho=0\):
where \(\rho\) is the parameter that \(r\) estimates.
ANOVA#
ANOVA tests whether \(k\) populations have the same mean based on independent SRS’s from \(k\) normal populations. The \(p\)-values come from the \(F\) distribution with \(k-1\) and \(N-k\) degrees of freedom, where \(N\) is the total observations in all samples.
Describe the data using the \(k\) sample means \((\bar{x}_i)\) and standard deviations \((s_i)\) and side-by-side graphs of the samples. The overall sample size is \(N=n_1+n_2+\ldots+n_k\), and the grand mean is \((\bar{x})\), the airthmetic average of all \(N\) observations.
The \(F\) test statistic is given by
where MSB is between group mean sum of squares (or MS Factor for TI calculators):
and MSW is within group mean sum of squares and (or MSE for mean sum of squares for the error on TI):
However, with some algebra, this reduces to a more computationally friendly version:
ANOVA Post Hoc Testing#
After finding significant differences between the sample means, we employ a \textit{post hoc} test to ferret out significantly different group means. Tukey’s HSD (Honestly Significant Difference) is the most common. However, Tukey’s HSD is liberal and tends to err on the side of finding significance differences. Using Scheffe’s or Dunnet’s may be preferable where a more conservative or flexible approach is appropriate.
Tukey’s HSD must be computed for each individual pair of group means and depends upon the harmonic mean (see below). The HSD value is given by:
where \(q^*\) is a value from the Studentized Range Statistic (found in a table based upon \(\alpha\) and degrees of freedom).
We flag as significantly different any pair of groups \(i\) and \(j\) such that
Harmonic Mean#
The harmonic mean \(H\) is the reciprocal of the arithmetic mean of the reciprocals. For 2 real numbers \(r\) and \(s\) we have
which simplifies to
In the HSD calculation, \(n_{ij}\) refers to the harmonic mean of group sizes \(n_i\) and \(n_j\):