The \(\chi^2\) Test of Independence#
Just as ANOVA is the straightforward extension of \(t\) procedures into the cases where we have more than 2 samples of numeric data, \(\chi^2\) methods are the mathematical extension of \(z\)-proportion procedures for categorical data.
Example: Using R Calculations#
The table below shows a breakdown at a certain university of the number of students still undecided about their majors compared to the number who chosen a major already.
Freshman | Sophomore | Junior | |
---|---|---|---|
Have Chosen a Major | 114 | 168 | 198 |
Have not Chosen a Major | 212 | 171 | 92 |
Hypotheses#
The hypothesis setup in its most general form is as follows:
\(H_0 : \text{Variables are Independent}\)
\(H_a : \text{Variables are Dependent}\)
We often include more specificity for the names of the variable to better indicate what is being studied which in this case would be as follows:
\(H_0 : \text{The proportion of students who have chosen a major is }\textbf{independent }\text{of year in school}\)
\(H_a : \text{The proportion of students who have chosen a major is }\textbf{dependent }\text{upon year in school}\)
Observed Data Matrix#
We create the observed data below:
obs <- matrix(c(114,212,168,171,198,92),ncol=3)
obs
114 | 168 | 198 |
212 | 171 | 92 |
We add column titles and row titles as follows:
colnames(obs) <- c('Freshmen', 'Sophomore', 'Junior')
rownames(obs) <- c('Have Chosen', 'Have NOT Chosen')
obs
Freshmen | Sophomore | Junior | |
---|---|---|---|
Have Chosen | 114 | 168 | 198 |
Have NOT Chosen | 212 | 171 | 92 |
Conduct the Test#
chisq.test(obs)
Pearson's Chi-squared test
data: obs
X-squared = 68.207, df = 2, p-value = 1.545e-15
Reporting Out#
Because \(p = 1.545\times 10^{-15} < 0.05 = \alpha\), we reject the null. We thus have evidence that the percentage of students who have chosen their majors depends upon which year in school they are.
Example: Using Tables and Formulas#
We have the observed data matrix above. We need to calculate the expected matrix. For this, we will need a formula to work with. From the formula sheet, we have the following for calculating cells of the expected matrix:
Expected Matrix#
Starting with the observed data matrix:$$$$
obs
Freshmen | Sophomore | Junior | |
---|---|---|---|
Have Chosen | 114 | 168 | 198 |
Have NOT Chosen | 212 | 171 | 92 |
We calculate the expected matrix with the top-left cell (\(TL\)) as follows:
The bottom-left (\(BL\)) is as follows: $\(\begin{align}BL &= \frac{(114+168+198) \times (168+171)}{955}\\&=170.39\end{align}\)$
Proceeding in the same for four more times, we have the following exp matrix:
obs
Freshmen | Sophomore | Junior | |
---|---|---|---|
Have Chosen | 114 | 168 | 198 |
Have NOT Chosen | 212 | 171 | 92 |
exp <- matrix(c(163.85,162.15,170.39,168.61,145.76,144.24),ncol=3)
colnames(exp) <- c('Freshmen', 'Sophomore', 'Junior')
rownames(exp) <- c('Have Chosen', 'Have NOT Chosen')
exp
Freshmen | Sophomore | Junior | |
---|---|---|---|
Have Chosen | 163.85 | 170.39 | 145.76 |
Have NOT Chosen | 162.15 | 168.61 | 144.24 |
Test Statistic \(\chi^2\)#
To calcuate the \(\chi^2\) test statistic, referring to the formula sheet provides the following:
where
O : Observed Cell Count
E : Expected Cell Count
Hence:
which gives:
\(\displaystyle x^2\approx 68.2\)
Cutoff Value from Table#
To find \({\chi^2}^*\) in the class’s \(\chi^2\) table, note that we have
where \(r\) and \(c\) are the numbers of rows and number of columns respectively in the observed and expected matrices. Both matrices should have identical shape. In the row where \(df = 2\) and the column where \(\alpha = 0.05\), we find that:
Reporting Out#
Since \(\chi^2 = 68.2 > 5.991 = {\chi^2}^2\), we reject the null hypothesis. We thus have evidence for the alternative which indicates that the proportion of students who have chosen their major depends upon the year in school.