Seminar 4

Chi-Square Tests for Categorical Data

Hypothesis Testing: Goodness-of-Fit and Test of Independence (No Association)¶


1. Categorical data and frequency tables¶

1.1 Categorical variable¶

A categorical (qualitative) variable takes values in a finite set of categories (labels), e.g.

  • blood type: A, B, AB, O
  • device type: Mac, Windows, Linux
  • satisfaction: low, medium, high

Data for categorical variables are summarized by counts (frequencies).

1.2 Observed counts¶

Suppose a variable has $k$ categories. We observe counts $$ O_1, O_2, \dots, O_k, $$ with total sample size $$ n = \sum_{i=1}^k O_i. $$


2. Hypothesis testing recap¶

A hypothesis test compares:

  • Null hypothesis $H_0$ (default model)
  • Alternative hypothesis $H_1$ (departure from the model)

Given a test statistic $T$:

  • compute the observed value $T_{\text{obs}}$
  • find the distribution of $T$ under $H_0$
  • compute a p-value or compare to a critical value
  • decide to reject or fail to reject $H_0$

2.1 Significance level, p-value¶

  • $\alpha$ = significance level (commonly $0.05$)
  • p-value = probability (under $H_0$) of observing a statistic at least as extreme as $T_{\text{obs}}$

Decision:

  • reject $H_0$ if p-value $< \alpha$

3. Why chi-square tests work (core theory)¶

Chi-square tests are built from the idea: compare observed counts to expected counts under $H_0$.

Let $E_i$ be the expected count in category $i$ under $H_0$.

A natural measure of discrepancy is $$ \sum_{i=1}^k (O_i - E_i)^2, $$ but this depends on the scale of $E_i$. So we standardize by dividing by $E_i$: $$ \chi^2 = \sum_{i=1}^k \frac{(O_i - E_i)^2}{E_i}. $$

3.1 Asymptotic chi-square distribution (informal but essential)¶

Under mild conditions (large enough expected counts), and assuming $H_0$ is true, $$ \chi^2 \ \approx\ \chi^2(\text{df}), $$ a chi-square distribution with appropriate degrees of freedom.

Why “approx”? Because the chi-square distribution is an asymptotic (large-sample) result based on:

  • multinomial sampling for counts
  • central limit theorem / normal approximation
  • quadratic form convergence

3.2 Chi-square distribution definition¶

If $Z_1,\dots,Z_\nu$ are independent standard normals, $Z_j \sim \mathcal{N}(0,1)$, then $$ Q = \sum_{j=1}^{\nu} Z_j^2 $$ follows a chi-square distribution with $\nu$ degrees of freedom: $$ Q \sim \chi^2(\nu). $$

Properties:

  • support: $[0,\infty)$
  • mean: $E[Q] = \nu$
  • variance: $\mathrm{Var}(Q) = 2\nu$

4. Chi-square Goodness-of-Fit (GoF) test

4.1 Goal¶

Test whether one categorical variable follows a specified distribution.

4.2 Setup (multinomial model)¶

Suppose there are $k$ categories. Under $H_0$ we assume probabilities $$ p_1, p_2, \dots, p_k,\quad p_i \ge 0,\quad \sum_{i=1}^k p_i = 1. $$

If we observe $n$ independent outcomes, the count vector $(O_1,\dots,O_k)$ follows a multinomial distribution under $H_0$: $$ (O_1,\dots,O_k) \sim \mathrm{Multinomial}\left(n; p_1,\dots,p_k\right). $$

4.3 Hypotheses¶

  • Null hypothesis: $$ H_0: \text{The true category probabilities equal } (p_1,\dots,p_k). $$
  • Alternative hypothesis: $$ H_1: \text{The true probabilities differ from } (p_1,\dots,p_k). $$

4.4 Expected counts¶

Under $H_0$, expected counts are: $$ E_i = n p_i,\quad i=1,\dots,k. $$

4.5 Test statistic¶

$$ \chi^2 = \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i}. $$

4.6 Degrees of freedom (GoF)¶

If $p_1,\dots,p_k$ are fully specified (no parameters estimated from the data), then $$ \text{df} = k - 1. $$

If the model contains $m$ unknown parameters estimated from the data, then: $$ \text{df} = k - 1 - m. $$

Explanation of $k-1$: counts sum to $n$, so only $k-1$ counts are free.
Estimating parameters uses up additional constraints, reducing df further.

4.7 Decision rule and p-value¶

Compute $\chi^2_{\text{obs}}$ from the sample. Under $H_0$: $$ \chi^2_{\text{obs}} \approx \chi^2(\text{df}). $$

  • p-value: $$ \text{p-value} = P\left(\chi^2(\text{df}) \ge \chi^2_{\text{obs}}\right). $$

Reject $H_0$ if p-value $< \alpha$.

4.8 Typical example: fair die¶

$k=6$ categories, $p_i = 1/6$, df $= 5$.


5. Chi-square Test of Independence (No Association)

5.1 Goal¶

Test whether two categorical variables are independent.

Example:

  • gender (M/F) and preference (A/B/C)
  • treatment group (control/drug) and outcome (success/failure)

5.2 Contingency table¶

Let variable $A$ have $r$ categories and variable $B$ have $c$ categories.

Observed counts $O_{ij}$ arranged in an $r \times c$ table.

Row sums: $$ O_{i\cdot} = \sum_{j=1}^c O_{ij} $$ Column sums: $$ O_{\cdot j} = \sum_{i=1}^r O_{ij} $$ Total: $$ n = \sum_{i=1}^r \sum_{j=1}^c O_{ij}. $$

5.3 Hypotheses¶

  • Null hypothesis (no association / independence): $$ H_0: A \text{ and } B \text{ are independent.} $$ Formally, for all $i,j$: $$ P(A=i, B=j) = P(A=i)P(B=j). $$

  • Alternative hypothesis: $$ H_1: A \text{ and } B \text{ are not independent (associated).} $$

5.4 Expected counts under independence¶

Under independence, $$ P(A=i, B=j) = P(A=i)P(B=j). $$ Estimate $P(A=i)$ and $P(B=j)$ by sample proportions: $$ \widehat{P}(A=i) = \frac{O_{i\cdot}}{n}, \quad \widehat{P}(B=j) = \frac{O_{\cdot j}}{n}. $$

Thus the expected count in cell $(i,j)$ is: $$ E_{ij} = n \cdot \widehat{P}(A=i)\widehat{P}(B=j) = n \cdot \frac{O_{i\cdot}}{n}\cdot \frac{O_{\cdot j}}{n} = \frac{O_{i\cdot} O_{\cdot j}}{n}. $$

5.5 Test statistic¶

$$ \chi^2 = \sum_{i=1}^{r}\sum_{j=1}^{c} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}. $$

5.6 Degrees of freedom (independence test)¶

$$ \text{df} = (r-1)(c-1). $$

Why?
An $r\times c$ table has $rc$ cells, but:

  • row totals impose $r$ constraints
  • column totals impose $c$ constraints
  • total $n$ is counted twice, so total constraints are $r+c-1$

Hence free cells: $$ rc - (r+c-1) = (r-1)(c-1). $$

5.7 p-value¶

Under $H_0$: $$ \chi^2_{\text{obs}} \approx \chi^2((r-1)(c-1)). $$ p-value: $$ \text{p-value} = P\left(\chi^2(\text{df}) \ge \chi^2_{\text{obs}}\right). $$

Reject $H_0$ if p-value $< \alpha$.


6. Assumptions and practical rules¶

6.1 Independence of observations¶

Each observation (person, trial, unit) should contribute to exactly one cell and be independent of others.

6.2 Expected counts should not be too small¶

Common rule of thumb:

  • all expected counts $E_{ij} \ge 5$

More nuanced guideline:

  • no more than 20% of cells with $E_{ij}<5$
  • none with $E_{ij}<1$

If violated:

  • merge rare categories
  • use Fisher’s exact test for $2\times 2$ (small sample)
  • consider exact / Monte Carlo methods

7. Relationship to other chi-square tests (for context)¶

7.1 Test of homogeneity¶

Very similar to independence, but framing differs:

  • independence: one sample, two variables
  • homogeneity: multiple samples, one categorical variable, compare distributions Mathematically the same $\chi^2$ statistic and df.

9. Summary¶

9.1 Goodness-of-Fit (GoF)¶

  • one categorical variable
  • compare observed counts $O_i$ to expected $E_i = np_i$
  • test statistic: $$ \chi^2 = \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i} $$
  • df: $k-1-m$

9.2 Independence (No Association)¶

  • two categorical variables
  • expected: $$ E_{ij} = \frac{O_{i\cdot} O_{\cdot j}}{n} $$
  • test statistic: $$ \chi^2 = \sum_{i=1}^{r}\sum_{j=1}^{c} \frac{(O_{ij} - E_{ij})^2}{E_{ij}} $$
  • df: $(r-1)(c-1)$

10. What chi-square tests do not tell you¶

  • causality
  • direction of association
  • which cells drive the association (without post-hoc residual analysis)

(If needed: analyze standardized residuals to see which cells contribute most.)


Example: Chi-square Goodness-of-Fit (GoF) test

An instructor claims that the grade distribution of their students is different from the department’s grade distribution.

The department-wide grade distribution for introductory statistics courses is:

  • A: 35%
  • B: 23%
  • C: 25%
  • D: 10%
  • F: 7%

A random sample of 250 introductory statistics students taught by this instructor produced the following grades:

  • A: 80
  • B: 50
  • C: 58
  • D: 38
  • F: 24

Using a 5% level of significance, test the instructor’s claim that their students’ grade distribution differs from the department’s distribution.

Example 2

Problem: Chi-Square Goodness-of-Fit Test (Shelf Placement Preference)¶

A research company is investigating whether the proportion of consumers who purchase a cereal is different depending on shelf placement.

They consider four shelf locations:

  • Bottom Shelf
  • Middle Shelf
  • Top Shelf
  • Aisle End Shelf

Test whether there is a preference among the four shelf placements. Use the p-value method with significance level
[ \alpha = 0.05. ]

The observed counts are:

Shelf Placement Bottom Middle Top End
Observed 45 67 55 73

Example 3 : Chi-square Test of Independence (No Association)

Problem: Chi-Square Test of Independence (ASD and Breastfeeding)¶

Is there a relationship between autism spectrum disorder (ASD) and breastfeeding?

To investigate this question, a researcher asked mothers of ASD and non-ASD children to report the length of time they breastfed their children.

Does the data provide enough evidence to conclude that breastfeeding and ASD are independent?
Conduct the test at the 1% significance level.

The observed data are summarized in the contingency table below.

ASD None Less than 2 months 2 to 6 months Over 6 months Total
Yes 241 198 164 215 818
No 20 25 27 44 116
Total 261 223 191 259 934

(Source: Schultz, Klonoff-Cohen, Wingard, Askhoomoff, Macera, Ji & Bacher, 2006.)

Example 4

Problem: Chi-Square Test of Independence (Dental Insurance and Company Size)¶

The sample data below show the number of companies providing dental insurance for small, medium, and large companies.

Test whether there is a relationship between dental insurance coverage and company size. Use $\alpha = 0.05$.

The observed data are:

Dental Insurance Small Medium Large
Yes 21 25 19
No 46 39 10