Seminar 4

Chi-Square Tests for Categorical Data

Hypothesis Testing: Goodness-of-Fit and Test of Independence (No Association)¶

1. Categorical data and frequency tables¶

1.1 Categorical variable¶

A categorical (qualitative) variable takes values in a finite set of categories (labels), e.g.

blood type: A, B, AB, O
device type: Mac, Windows, Linux
satisfaction: low, medium, high

Data for categorical variables are summarized by counts (frequencies).

1.2 Observed counts¶

Suppose a variable has $k$ categories. We observe counts $$ O_1, O_2, \dots, O_k, $$ with total sample size $$ n = \sum_{i=1}^k O_i. $$

2. Hypothesis testing recap¶

A hypothesis test compares:

Null hypothesis $H_0$ (default model)
Alternative hypothesis $H_1$ (departure from the model)

Given a test statistic $T$:

compute the observed value $T_{\text{obs}}$
find the distribution of $T$ under $H_0$
compute a p-value or compare to a critical value
decide to reject or fail to reject $H_0$

2.1 Significance level, p-value¶

$\alpha$ = significance level (commonly $0.05$)
p-value = probability (under $H_0$) of observing a statistic at least as extreme as $T_{\text{obs}}$

Decision:

reject $H_0$ if p-value $< \alpha$

3. Why chi-square tests work (core theory)¶

Chi-square tests are built from the idea: compare observed counts to expected counts under $H_0$.

Let $E_i$ be the expected count in category $i$ under $H_0$.

A natural measure of discrepancy is $$ \sum_{i=1}^k (O_i - E_i)^2, $$ but this depends on the scale of $E_i$. So we standardize by dividing by $E_i$: $$ \chi^2 = \sum_{i=1}^k \frac{(O_i - E_i)^2}{E_i}. $$

3.1 Asymptotic chi-square distribution (informal but essential)¶

Under mild conditions (large enough expected counts), and assuming $H_0$ is true, $$ \chi^2 \ \approx\ \chi^2(\text{df}), $$ a chi-square distribution with appropriate degrees of freedom.

Why “approx”? Because the chi-square distribution is an asymptotic (large-sample) result based on:

multinomial sampling for counts
central limit theorem / normal approximation
quadratic form convergence

3.2 Chi-square distribution definition¶

If $Z_1,\dots,Z_\nu$ are independent standard normals, $Z_j \sim \mathcal{N}(0,1)$, then $$ Q = \sum_{j=1}^{\nu} Z_j^2 $$ follows a chi-square distribution with $\nu$ degrees of freedom: $$ Q \sim \chi^2(\nu). $$

Properties:

support: $[0,\infty)$
mean: $E[Q] = \nu$
variance: $\mathrm{Var}(Q) = 2\nu$

4. Chi-square Goodness-of-Fit (GoF) test

4.1 Goal¶

Test whether one categorical variable follows a specified distribution.

4.2 Setup (multinomial model)¶

Suppose there are $k$ categories. Under $H_0$ we assume probabilities $$ p_1, p_2, \dots, p_k,\quad p_i \ge 0,\quad \sum_{i=1}^k p_i = 1. $$

If we observe $n$ independent outcomes, the count vector $(O_1,\dots,O_k)$ follows a multinomial distribution under $H_0$: $$ (O_1,\dots,O_k) \sim \mathrm{Multinomial}\left(n; p_1,\dots,p_k\right). $$

4.3 Hypotheses¶

Null hypothesis: $$ H_0: \text{The true category probabilities equal } (p_1,\dots,p_k). $$
Alternative hypothesis: $$ H_1: \text{The true probabilities differ from } (p_1,\dots,p_k). $$

4.4 Expected counts¶

Under $H_0$, expected counts are: $$ E_i = n p_i,\quad i=1,\dots,k. $$

4.5 Test statistic¶

$$ \chi^2 = \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i}. $$

4.6 Degrees of freedom (GoF)¶

If $p_1,\dots,p_k$ are fully specified (no parameters estimated from the data), then $$ \text{df} = k - 1. $$

If the model contains $m$ unknown parameters estimated from the data, then: $$ \text{df} = k - 1 - m. $$

Explanation of $k-1$: counts sum to $n$, so only $k-1$ counts are free.
Estimating parameters uses up additional constraints, reducing df further.

4.7 Decision rule and p-value¶

Compute $\chi^2_{\text{obs}}$ from the sample. Under $H_0$: $$ \chi^2_{\text{obs}} \approx \chi^2(\text{df}). $$

p-value: $$ \text{p-value} = P\left(\chi^2(\text{df}) \ge \chi^2_{\text{obs}}\right). $$

Reject $H_0$ if p-value $< \alpha$.

4.8 Typical example: fair die¶

$k=6$ categories, $p_i = 1/6$, df $= 5$.

5. Chi-square Test of Independence (No Association)

5.1 Goal¶

Test whether two categorical variables are independent.

Example:

gender (M/F) and preference (A/B/C)
treatment group (control/drug) and outcome (success/failure)

5.2 Contingency table¶

Let variable $A$ have $r$ categories and variable $B$ have $c$ categories.

Observed counts $O_{ij}$ arranged in an $r \times c$ table.

Row sums: $$ O_{i\cdot} = \sum_{j=1}^c O_{ij} $$ Column sums: $$ O_{\cdot j} = \sum_{i=1}^r O_{ij} $$ Total: $$ n = \sum_{i=1}^r \sum_{j=1}^c O_{ij}. $$

5.3 Hypotheses¶

Null hypothesis (no association / independence): $$ H_0: A \text{ and } B \text{ are independent.} $$ Formally, for all $i,j$: $$ P(A=i, B=j) = P(A=i)P(B=j). $$
Alternative hypothesis: $$ H_1: A \text{ and } B \text{ are not independent (associated).} $$

5.4 Expected counts under independence¶

Under independence, $$ P(A=i, B=j) = P(A=i)P(B=j). $$ Estimate $P(A=i)$ and $P(B=j)$ by sample proportions: $$ \widehat{P}(A=i) = \frac{O_{i\cdot}}{n}, \quad \widehat{P}(B=j) = \frac{O_{\cdot j}}{n}. $$

Thus the expected count in cell $(i,j)$ is: $$ E_{ij} = n \cdot \widehat{P}(A=i)\widehat{P}(B=j) = n \cdot \frac{O_{i\cdot}}{n}\cdot \frac{O_{\cdot j}}{n} = \frac{O_{i\cdot} O_{\cdot j}}{n}. $$

5.5 Test statistic¶

$$ \chi^2 = \sum_{i=1}^{r}\sum_{j=1}^{c} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}. $$

5.6 Degrees of freedom (independence test)¶

$$ \text{df} = (r-1)(c-1). $$

Why?
An $r\times c$ table has $rc$ cells, but:

row totals impose $r$ constraints
column totals impose $c$ constraints
total $n$ is counted twice, so total constraints are $r+c-1$

Hence free cells: $$ rc - (r+c-1) = (r-1)(c-1). $$

5.7 p-value¶

Under $H_0$: $$ \chi^2_{\text{obs}} \approx \chi^2((r-1)(c-1)). $$ p-value: $$ \text{p-value} = P\left(\chi^2(\text{df}) \ge \chi^2_{\text{obs}}\right). $$

Reject $H_0$ if p-value $< \alpha$.

6. Assumptions and practical rules¶

6.1 Independence of observations¶

Each observation (person, trial, unit) should contribute to exactly one cell and be independent of others.

6.2 Expected counts should not be too small¶

Common rule of thumb:

all expected counts $E_{ij} \ge 5$

More nuanced guideline:

no more than 20% of cells with $E_{ij}<5$
none with $E_{ij}<1$

If violated:

merge rare categories
use Fisher’s exact test for $2\times 2$ (small sample)
consider exact / Monte Carlo methods

7. Relationship to other chi-square tests (for context)¶

7.1 Test of homogeneity¶

Very similar to independence, but framing differs:

independence: one sample, two variables
homogeneity: multiple samples, one categorical variable, compare distributions Mathematically the same $\chi^2$ statistic and df.

9. Summary¶

9.1 Goodness-of-Fit (GoF)¶

one categorical variable
compare observed counts $O_i$ to expected $E_i = np_i$
test statistic: $$ \chi^2 = \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i} $$
df: $k-1-m$

9.2 Independence (No Association)¶

two categorical variables
expected: $$ E_{ij} = \frac{O_{i\cdot} O_{\cdot j}}{n} $$
test statistic: $$ \chi^2 = \sum_{i=1}^{r}\sum_{j=1}^{c} \frac{(O_{ij} - E_{ij})^2}{E_{ij}} $$
df: $(r-1)(c-1)$

10. What chi-square tests do not tell you¶

causality
direction of association
which cells drive the association (without post-hoc residual analysis)

(If needed: analyze standardized residuals to see which cells contribute most.)

Example: Chi-square Goodness-of-Fit (GoF) test

An instructor claims that the grade distribution of their students is different from the department’s grade distribution.

The department-wide grade distribution for introductory statistics courses is:

A: 35%
B: 23%
C: 25%
D: 10%
F: 7%

A random sample of 250 introductory statistics students taught by this instructor produced the following grades:

A: 80
B: 50
C: 58
D: 38
F: 24

Using a 5% level of significance, test the instructor’s claim that their students’ grade distribution differs from the department’s distribution.

Example 2

Problem: Chi-Square Goodness-of-Fit Test (Shelf Placement Preference)¶

A research company is investigating whether the proportion of consumers who purchase a cereal is different depending on shelf placement.

They consider four shelf locations:

Bottom Shelf
Middle Shelf
Top Shelf
Aisle End Shelf

Test whether there is a preference among the four shelf placements. Use the p-value method with significance level
[ \alpha = 0.05. ]

The observed counts are:

Shelf Placement	Bottom	Middle	Top	End
Observed	45	67	55	73

Example 3 : Chi-square Test of Independence (No Association)

Problem: Chi-Square Test of Independence (ASD and Breastfeeding)¶

Is there a relationship between autism spectrum disorder (ASD) and breastfeeding?

To investigate this question, a researcher asked mothers of ASD and non-ASD children to report the length of time they breastfed their children.

Does the data provide enough evidence to conclude that breastfeeding and ASD are independent?
Conduct the test at the 1% significance level.

The observed data are summarized in the contingency table below.

ASD	None	Less than 2 months	2 to 6 months	Over 6 months	Total
Yes	241	198	164	215	818
No	20	25	27	44	116
Total	261	223	191	259	934

(Source: Schultz, Klonoff-Cohen, Wingard, Askhoomoff, Macera, Ji & Bacher, 2006.)

Example 4

Problem: Chi-Square Test of Independence (Dental Insurance and Company Size)¶

The sample data below show the number of companies providing dental insurance for small, medium, and large companies.

Test whether there is a relationship between dental insurance coverage and company size. Use $\alpha = 0.05$.

The observed data are:

Dental Insurance	Small	Medium	Large
Yes	21	25	19
No	46	39	10