A categorical (qualitative) variable takes values in a finite set of categories (labels), e.g.
Data for categorical variables are summarized by counts (frequencies).
Suppose a variable has $k$ categories. We observe counts $$ O_1, O_2, \dots, O_k, $$ with total sample size $$ n = \sum_{i=1}^k O_i. $$
A hypothesis test compares:
Given a test statistic $T$:
Decision:
Chi-square tests are built from the idea: compare observed counts to expected counts under $H_0$.
Let $E_i$ be the expected count in category $i$ under $H_0$.
A natural measure of discrepancy is $$ \sum_{i=1}^k (O_i - E_i)^2, $$ but this depends on the scale of $E_i$. So we standardize by dividing by $E_i$: $$ \chi^2 = \sum_{i=1}^k \frac{(O_i - E_i)^2}{E_i}. $$
Under mild conditions (large enough expected counts), and assuming $H_0$ is true, $$ \chi^2 \ \approx\ \chi^2(\text{df}), $$ a chi-square distribution with appropriate degrees of freedom.
Why “approx”? Because the chi-square distribution is an asymptotic (large-sample) result based on:
If $Z_1,\dots,Z_\nu$ are independent standard normals, $Z_j \sim \mathcal{N}(0,1)$, then $$ Q = \sum_{j=1}^{\nu} Z_j^2 $$ follows a chi-square distribution with $\nu$ degrees of freedom: $$ Q \sim \chi^2(\nu). $$
Properties:
Test whether one categorical variable follows a specified distribution.
Suppose there are $k$ categories. Under $H_0$ we assume probabilities $$ p_1, p_2, \dots, p_k,\quad p_i \ge 0,\quad \sum_{i=1}^k p_i = 1. $$
If we observe $n$ independent outcomes, the count vector $(O_1,\dots,O_k)$ follows a multinomial distribution under $H_0$: $$ (O_1,\dots,O_k) \sim \mathrm{Multinomial}\left(n; p_1,\dots,p_k\right). $$
Under $H_0$, expected counts are: $$ E_i = n p_i,\quad i=1,\dots,k. $$
If $p_1,\dots,p_k$ are fully specified (no parameters estimated from the data), then $$ \text{df} = k - 1. $$
If the model contains $m$ unknown parameters estimated from the data, then: $$ \text{df} = k - 1 - m. $$
Explanation of $k-1$: counts sum to $n$, so only $k-1$ counts are free.
Estimating parameters uses up additional constraints, reducing df further.
Compute $\chi^2_{\text{obs}}$ from the sample. Under $H_0$: $$ \chi^2_{\text{obs}} \approx \chi^2(\text{df}). $$
Reject $H_0$ if p-value $< \alpha$.
$k=6$ categories, $p_i = 1/6$, df $= 5$.
Test whether two categorical variables are independent.
Example:
Let variable $A$ have $r$ categories and variable $B$ have $c$ categories.
Observed counts $O_{ij}$ arranged in an $r \times c$ table.
Row sums: $$ O_{i\cdot} = \sum_{j=1}^c O_{ij} $$ Column sums: $$ O_{\cdot j} = \sum_{i=1}^r O_{ij} $$ Total: $$ n = \sum_{i=1}^r \sum_{j=1}^c O_{ij}. $$
Null hypothesis (no association / independence): $$ H_0: A \text{ and } B \text{ are independent.} $$ Formally, for all $i,j$: $$ P(A=i, B=j) = P(A=i)P(B=j). $$
Alternative hypothesis: $$ H_1: A \text{ and } B \text{ are not independent (associated).} $$
Under independence, $$ P(A=i, B=j) = P(A=i)P(B=j). $$ Estimate $P(A=i)$ and $P(B=j)$ by sample proportions: $$ \widehat{P}(A=i) = \frac{O_{i\cdot}}{n}, \quad \widehat{P}(B=j) = \frac{O_{\cdot j}}{n}. $$
Thus the expected count in cell $(i,j)$ is: $$ E_{ij} = n \cdot \widehat{P}(A=i)\widehat{P}(B=j) = n \cdot \frac{O_{i\cdot}}{n}\cdot \frac{O_{\cdot j}}{n} = \frac{O_{i\cdot} O_{\cdot j}}{n}. $$
Why?
An $r\times c$ table has $rc$ cells, but:
Hence free cells: $$ rc - (r+c-1) = (r-1)(c-1). $$
Under $H_0$: $$ \chi^2_{\text{obs}} \approx \chi^2((r-1)(c-1)). $$ p-value: $$ \text{p-value} = P\left(\chi^2(\text{df}) \ge \chi^2_{\text{obs}}\right). $$
Reject $H_0$ if p-value $< \alpha$.
Each observation (person, trial, unit) should contribute to exactly one cell and be independent of others.
Common rule of thumb:
More nuanced guideline:
If violated:
Very similar to independence, but framing differs:
(If needed: analyze standardized residuals to see which cells contribute most.)
An instructor claims that the grade distribution of their students is different from the department’s grade distribution.
The department-wide grade distribution for introductory statistics courses is:
A random sample of 250 introductory statistics students taught by this instructor produced the following grades:
Using a 5% level of significance, test the instructor’s claim that their students’ grade distribution differs from the department’s distribution.
A research company is investigating whether the proportion of consumers who purchase a cereal is different depending on shelf placement.
They consider four shelf locations:
Test whether there is a preference among the four shelf placements. Use the p-value method with significance level
[
\alpha = 0.05.
]
The observed counts are:
| Shelf Placement | Bottom | Middle | Top | End |
|---|---|---|---|---|
| Observed | 45 | 67 | 55 | 73 |
Is there a relationship between autism spectrum disorder (ASD) and breastfeeding?
To investigate this question, a researcher asked mothers of ASD and non-ASD children to report the length of time they breastfed their children.
Does the data provide enough evidence to conclude that breastfeeding and ASD are independent?
Conduct the test at the 1% significance level.
The observed data are summarized in the contingency table below.
| ASD | None | Less than 2 months | 2 to 6 months | Over 6 months | Total |
|---|---|---|---|---|---|
| Yes | 241 | 198 | 164 | 215 | 818 |
| No | 20 | 25 | 27 | 44 | 116 |
| Total | 261 | 223 | 191 | 259 | 934 |
(Source: Schultz, Klonoff-Cohen, Wingard, Askhoomoff, Macera, Ji & Bacher, 2006.)
The sample data below show the number of companies providing dental insurance for small, medium, and large companies.
Test whether there is a relationship between dental insurance coverage and company size. Use $\alpha = 0.05$.
The observed data are:
| Dental Insurance | Small | Medium | Large |
|---|---|---|---|
| Yes | 21 | 25 | 19 |
| No | 46 | 39 | 10 |