The \(\chi^2\) Test of Independence#

Just as ANOVA is the straightforward extension of \(t\) procedures into the cases where we have more than 2 samples of numeric data, \(\chi^2\) methods are the mathematical extension of \(z\)-proportion procedures for categorical data.

Example: Using R Calculations#

The table below shows a breakdown at a certain university of the number of students still undecided about their majors compared to the number who chosen a major already.

Freshman Sophomore Junior
Have Chosen a Major 114 168 198
Have not Chosen a Major 212 171 92

Hypotheses#

The hypothesis setup in its most general form is as follows:

  • \(H_0 : \text{Variables are Independent}\)

  • \(H_a : \text{Variables are Dependent}\)

We often include more specificity for the names of the variable to better indicate what is being studied which in this case would be as follows:

  • \(H_0 : \text{The proportion of students who have chosen a major is }\textbf{independent }\text{of year in school}\)

  • \(H_a : \text{The proportion of students who have chosen a major is }\textbf{dependent }\text{upon year in school}\)

Observed Data Matrix#

We create the observed data below:

obs <- matrix(c(114,212,168,171,198,92),ncol=3)
obs
114168198
212171 92

We add column titles and row titles as follows:

colnames(obs) <- c('Freshmen', 'Sophomore', 'Junior')
rownames(obs) <- c('Have Chosen', 'Have NOT Chosen')
obs
FreshmenSophomoreJunior
Have Chosen114168198
Have NOT Chosen212171 92

Conduct the Test#

chisq.test(obs)
	Pearson's Chi-squared test

data:  obs
X-squared = 68.207, df = 2, p-value = 1.545e-15

Reporting Out#

Because \(p = 1.545\times 10^{-15} < 0.05 = \alpha\), we reject the null. We thus have evidence that the percentage of students who have chosen their majors depends upon which year in school they are.

Example: Using Tables and Formulas#

We have the observed data matrix above. We need to calculate the expected matrix. For this, we will need a formula to work with. From the formula sheet, we have the following for calculating cells of the expected matrix:

\[\text{expected count} = \frac{\text{row total}\times \text{column total}}{\text{table total}}\]

Expected Matrix#

Starting with the observed data matrix:$$$$

obs
FreshmenSophomoreJunior
Have Chosen114168198
Have NOT Chosen212171 92

We calculate the expected matrix with the top-left cell (\(TL\)) as follows:

\[\begin{split}\begin{align}TL &= \frac{(114+168+198) \times (114+212)}{955}\\&= \frac{(480) \times (326)}{955}\\&= \frac{156480}{955}\\&=163.85\end{align}\end{split}\]

The bottom-left (\(BL\)) is as follows: $\(\begin{align}BL &= \frac{(114+168+198) \times (168+171)}{955}\\&=170.39\end{align}\)$

Proceeding in the same for four more times, we have the following exp matrix:

obs
FreshmenSophomoreJunior
Have Chosen114168198
Have NOT Chosen212171 92
exp <- matrix(c(163.85,162.15,170.39,168.61,145.76,144.24),ncol=3)
colnames(exp) <- c('Freshmen', 'Sophomore', 'Junior')
rownames(exp) <- c('Have Chosen', 'Have NOT Chosen')
exp
FreshmenSophomoreJunior
Have Chosen163.85170.39145.76
Have NOT Chosen162.15168.61144.24

Test Statistic \(\chi^2\)#

To calcuate the \(\chi^2\) test statistic, referring to the formula sheet provides the following:

\[\chi^2 = \sum \frac{(O−E)^2}{E}\]

where

  • O : Observed Cell Count

  • E : Expected Cell Count

Hence:

\[\begin{split}\begin{align}\chi^2 &= \frac{(114-163.85)^2}{163.85}+\frac{(212-162.15)^2}{162.15}+\frac{(168-170.39)^2}{170.39}+\frac{(171-168.61)^2}{168.61}\\&+\frac{(198-145.76)^2}{145.76}+\frac{(92-144.24)^2}{144.24}\\&= \frac{2485.0}{163.85}+\frac{2485.0}{162.15}+\frac{5.7}{170.39}+\frac{5.7}{168.61}+\frac{2729.0}{145.76}+\frac{2729.0}{144.24}\\&= 15.17+15.33+0.03+0.03+18.72+18.92\end{align}\end{split}\]

which gives:

\(\displaystyle x^2\approx 68.2\)

Cutoff Value from Table#

To find \({\chi^2}^*\) in the class’s \(\chi^2\) table, note that we have

\[df = (r-1)(c-1)=2\times 1=2\]

where \(r\) and \(c\) are the numbers of rows and number of columns respectively in the observed and expected matrices. Both matrices should have identical shape. In the row where \(df = 2\) and the column where \(\alpha = 0.05\), we find that:

\[{\chi^2}^* = 5.991\]

Reporting Out#

Since \(\chi^2 = 68.2 > 5.991 = {\chi^2}^2\), we reject the null hypothesis. We thus have evidence for the alternative which indicates that the proportion of students who have chosen their major depends upon the year in school.