{
"cells": [
{
"cell_type": "markdown",
"id": "8290d04d-fb31-4224-b584-0b305767e5f7",
"metadata": {},
"source": [
"# The $\\chi^2$ Test of Independence\n",
"\n",
"Just as ANOVA is the straightforward extension of $t$ procedures into the cases where we have more than 2 samples of numeric data, $\\chi^2$ methods are the mathematical extension of $z$-proportion procedures for categorical data."
]
},
{
"cell_type": "markdown",
"id": "2b51c8cb-9cbd-4f89-878d-82b40c9ffa76",
"metadata": {},
"source": [
"## Example: Using R Calculations\n",
"\n",
"The table below shows a breakdown at a certain university of the number of students still undecided about their majors compared to the number who chosen a major already."
]
},
{
"cell_type": "markdown",
"id": "07afd179-ed7f-4996-b411-149c9e3d920c",
"metadata": {},
"source": [
"
\n",
"\n",
" | \n",
" Freshman | \n",
" Sophomore | \n",
" Junior | \n",
"
\n",
"\n",
" Have Chosen a Major | \n",
" 114 | \n",
" 168 | \n",
" 198 | \n",
"
\n",
"\n",
" Have not Chosen a Major | \n",
" 212 | \n",
" 171 | \n",
" 92 | \n",
"
\n",
"
"
]
},
{
"cell_type": "markdown",
"id": "1f0efd76-6323-48ce-8952-81c367da5b45",
"metadata": {},
"source": [
"### Hypotheses\n",
"\n",
"The hypothesis setup in its most general form is as follows:\n",
"\n",
"- $H_0 : \\text{Variables are Independent}$\n",
"- $H_a : \\text{Variables are Dependent}$\n",
"\n",
"We often include more specificity for the names of the variable to better indicate what is being studied which in this case would be as follows:\n",
"\n",
"- $H_0 : \\text{The proportion of students who have chosen a major is }\\textbf{independent }\\text{of year in school}$\n",
"- $H_a : \\text{The proportion of students who have chosen a major is }\\textbf{dependent }\\text{upon year in school}$"
]
},
{
"cell_type": "markdown",
"id": "9e092df2-0718-4799-9d9b-5c220937f8b9",
"metadata": {},
"source": [
"### Observed Data Matrix\n",
"\n",
"We create the observed data below:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "c4007310-3e00-4e01-9b3c-0136932ad6da",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"\t114 | 168 | 198 |
\n",
"\t212 | 171 | 92 |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{lll}\n",
"\t 114 & 168 & 198\\\\\n",
"\t 212 & 171 & 92\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"| 114 | 168 | 198 |\n",
"| 212 | 171 | 92 |\n",
"\n"
],
"text/plain": [
" [,1] [,2] [,3]\n",
"[1,] 114 168 198 \n",
"[2,] 212 171 92 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"obs <- matrix(c(114,212,168,171,198,92),ncol=3)\n",
"obs"
]
},
{
"cell_type": "markdown",
"id": "2fe4b6d1-4f70-4fd8-8e85-c8060af69745",
"metadata": {},
"source": [
"We add column titles and row titles as follows:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "fed6f4de-2943-4fc0-9fc3-69cd6514a2bd",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" | Freshmen | Sophomore | Junior |
\n",
"\n",
"\tHave Chosen | 114 | 168 | 198 |
\n",
"\tHave NOT Chosen | 212 | 171 | 92 |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|lll}\n",
" & Freshmen & Sophomore & Junior\\\\\n",
"\\hline\n",
"\tHave Chosen & 114 & 168 & 198\\\\\n",
"\tHave NOT Chosen & 212 & 171 & 92\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"| | Freshmen | Sophomore | Junior |\n",
"|---|---|---|---|\n",
"| Have Chosen | 114 | 168 | 198 |\n",
"| Have NOT Chosen | 212 | 171 | 92 |\n",
"\n"
],
"text/plain": [
" Freshmen Sophomore Junior\n",
"Have Chosen 114 168 198 \n",
"Have NOT Chosen 212 171 92 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"colnames(obs) <- c('Freshmen', 'Sophomore', 'Junior')\n",
"rownames(obs) <- c('Have Chosen', 'Have NOT Chosen')\n",
"obs"
]
},
{
"cell_type": "markdown",
"id": "3323d371-1c2f-41f0-8d83-3fcdebab3e52",
"metadata": {},
"source": [
"### Conduct the Test"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "6a6d89a9-2dd8-4004-8eff-cb1d75b5846b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\n",
"\tPearson's Chi-squared test\n",
"\n",
"data: obs\n",
"X-squared = 68.207, df = 2, p-value = 1.545e-15\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"chisq.test(obs)"
]
},
{
"cell_type": "markdown",
"id": "2e2d110c-48ff-415f-a51e-07d049b2520c",
"metadata": {},
"source": [
"### Reporting Out\n",
"\n",
"Because $p = 1.545\\times 10^{-15} < 0.05 = \\alpha$, we reject the null. We thus have evidence that the percentage of students who have chosen their majors depends upon which year in school they are."
]
},
{
"cell_type": "markdown",
"id": "97c272bc-8af9-4971-acce-4d18beb9d1e9",
"metadata": {},
"source": [
"## Example: Using Tables and Formulas\n",
"\n",
"We have the observed data matrix above. We need to calculate the expected matrix. For this, we will need a formula to work with. From the [formula sheet](https://faculty.ung.edu/rsinn/3350/StatsFormulas.pdf), we have the following for **calculating cells of the expected matrix**:\n",
"\n",
"$$\\text{expected count} = \\frac{\\text{row total}\\times \\text{column total}}{\\text{table total}}$$\n",
"\n",
"### Expected Matrix\n",
"\n",
"Starting with the observed data matrix:$$$$"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "eabf920e-c378-4c4d-be84-3e7401d57efa",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" | Freshmen | Sophomore | Junior |
\n",
"\n",
"\tHave Chosen | 114 | 168 | 198 |
\n",
"\tHave NOT Chosen | 212 | 171 | 92 |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|lll}\n",
" & Freshmen & Sophomore & Junior\\\\\n",
"\\hline\n",
"\tHave Chosen & 114 & 168 & 198\\\\\n",
"\tHave NOT Chosen & 212 & 171 & 92\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"| | Freshmen | Sophomore | Junior |\n",
"|---|---|---|---|\n",
"| Have Chosen | 114 | 168 | 198 |\n",
"| Have NOT Chosen | 212 | 171 | 92 |\n",
"\n"
],
"text/plain": [
" Freshmen Sophomore Junior\n",
"Have Chosen 114 168 198 \n",
"Have NOT Chosen 212 171 92 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"obs"
]
},
{
"cell_type": "markdown",
"id": "eb3661c3-aa90-4c25-8e16-7959d664af28",
"metadata": {},
"source": [
"We calculate the expected matrix with the top-left cell ($TL$) as follows:\n",
"\n",
"$$\\begin{align}TL &= \\frac{(114+168+198) \\times (114+212)}{955}\\\\&= \\frac{(480) \\times (326)}{955}\\\\&= \\frac{156480}{955}\\\\&=163.85\\end{align}$$\n",
"\n",
"The bottom-left ($BL$) is as follows:\n",
"$$\\begin{align}BL &= \\frac{(114+168+198) \\times (168+171)}{955}\\\\&=170.39\\end{align}$$"
]
},
{
"cell_type": "markdown",
"id": "f136df0e-06b3-421e-961c-95d2de55a78e",
"metadata": {},
"source": [
"Proceeding in the same for four more times, we have the following exp matrix:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "22fe0b31-6692-4289-88b5-47c80131de59",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" | Freshmen | Sophomore | Junior |
\n",
"\n",
"\tHave Chosen | 114 | 168 | 198 |
\n",
"\tHave NOT Chosen | 212 | 171 | 92 |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|lll}\n",
" & Freshmen & Sophomore & Junior\\\\\n",
"\\hline\n",
"\tHave Chosen & 114 & 168 & 198\\\\\n",
"\tHave NOT Chosen & 212 & 171 & 92\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"| | Freshmen | Sophomore | Junior |\n",
"|---|---|---|---|\n",
"| Have Chosen | 114 | 168 | 198 |\n",
"| Have NOT Chosen | 212 | 171 | 92 |\n",
"\n"
],
"text/plain": [
" Freshmen Sophomore Junior\n",
"Have Chosen 114 168 198 \n",
"Have NOT Chosen 212 171 92 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"obs"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "be6b7446-a9ba-49e1-81c3-adececdd3fd0",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" | Freshmen | Sophomore | Junior |
\n",
"\n",
"\tHave Chosen | 163.85 | 170.39 | 145.76 |
\n",
"\tHave NOT Chosen | 162.15 | 168.61 | 144.24 |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|lll}\n",
" & Freshmen & Sophomore & Junior\\\\\n",
"\\hline\n",
"\tHave Chosen & 163.85 & 170.39 & 145.76\\\\\n",
"\tHave NOT Chosen & 162.15 & 168.61 & 144.24\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"| | Freshmen | Sophomore | Junior |\n",
"|---|---|---|---|\n",
"| Have Chosen | 163.85 | 170.39 | 145.76 |\n",
"| Have NOT Chosen | 162.15 | 168.61 | 144.24 |\n",
"\n"
],
"text/plain": [
" Freshmen Sophomore Junior\n",
"Have Chosen 163.85 170.39 145.76\n",
"Have NOT Chosen 162.15 168.61 144.24"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"exp <- matrix(c(163.85,162.15,170.39,168.61,145.76,144.24),ncol=3)\n",
"colnames(exp) <- c('Freshmen', 'Sophomore', 'Junior')\n",
"rownames(exp) <- c('Have Chosen', 'Have NOT Chosen')\n",
"exp"
]
},
{
"cell_type": "markdown",
"id": "9217a2ac-0f9d-4950-962d-4c8d2135da71",
"metadata": {},
"source": [
"### Test Statistic $\\chi^2$\n",
"\n",
"To calcuate the $\\chi^2$ test statistic, referring to the formula sheet provides the following:\n",
"\n",
"$$\\chi^2 = \\sum \\frac{(O−E)^2}{E}$$\n",
"\n",
"where\n",
"- O : Observed Cell Count\n",
"- E : Expected Cell Count\n",
"\n",
"Hence:"
]
},
{
"cell_type": "markdown",
"id": "8fadb1fc-23d8-4908-9e4c-72c1c6039f2d",
"metadata": {},
"source": [
"$$\\begin{align}\\chi^2 &= \\frac{(114-163.85)^2}{163.85}+\\frac{(212-162.15)^2}{162.15}+\\frac{(168-170.39)^2}{170.39}+\\frac{(171-168.61)^2}{168.61}\\\\&+\\frac{(198-145.76)^2}{145.76}+\\frac{(92-144.24)^2}{144.24}\\\\&= \\frac{2485.0}{163.85}+\\frac{2485.0}{162.15}+\\frac{5.7}{170.39}+\\frac{5.7}{168.61}+\\frac{2729.0}{145.76}+\\frac{2729.0}{144.24}\\\\&= 15.17+15.33+0.03+0.03+18.72+18.92\\end{align}$$\n",
"\n",
"which gives:\n",
"\n",
"$\\displaystyle x^2\\approx 68.2$"
]
},
{
"cell_type": "markdown",
"id": "0d213b5d-18ca-4f66-a60e-8adb28e922a5",
"metadata": {},
"source": [
"### Cutoff Value from Table\n",
"\n",
"To find ${\\chi^2}^*$ in the [class's $\\chi^2$ table](https://faculty.ung.edu/rsinn/3350/Table_ChiSquared.pdf), note that we have\n",
"\n",
"$$df = (r-1)(c-1)=2\\times 1=2$$\n",
"\n",
"where $r$ and $c$ are the numbers of rows and number of columns respectively in the observed and expected matrices. Both matrices should have identical shape. In the row where $df = 2$ and the column where $\\alpha = 0.05$, we find that:\n",
"\n",
"$${\\chi^2}^* = 5.991$$"
]
},
{
"cell_type": "markdown",
"id": "cf32c537-5b01-462d-abe8-80aee3d72e08",
"metadata": {},
"source": [
"### Reporting Out\n",
"\n",
"Since $\\chi^2 = 68.2 > 5.991 = {\\chi^2}^2$, we reject the null hypothesis. We thus have evidence for the alternative which indicates that the proportion of students who have chosen their major depends upon the year in school."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "R",
"language": "R",
"name": "ir"
},
"language_info": {
"codemirror_mode": "r",
"file_extension": ".r",
"mimetype": "text/x-r-source",
"name": "R",
"pygments_lexer": "r",
"version": "3.6.1"
}
},
"nbformat": 4,
"nbformat_minor": 5
}