Seminar 5

Analysis of Variance (ANOVA)

One-Way ANOVA: Theory, Assumptions, and Interpretation¶

1. Motivation¶

In many applications we want to compare more than two population means.

Examples:

mean income across several regions
average exam score across multiple teaching methods
mean response time for different algorithms
effect of different treatments in an experiment

A naive approach would be to perform many pairwise $t$-tests.
This is incorrect, because it inflates the Type I error rate.

ANOVA provides a single global test for comparing multiple means.

2. Statistical Question¶

Are all group means equal, or does at least one group differ?

ANOVA tests equality of means, not variances (despite the name).

3. One-Way ANOVA Model¶

3.1 Data structure¶

Suppose we have $k$ groups.

Group $i$ has observations: $$ X_{i1}, X_{i2}, \dots, X_{in_i}, \quad i = 1,\dots,k. $$

Total sample size: $$ N = \sum_{i=1}^{k} n_i. $$

3.2 Model assumption¶

The one-way ANOVA model is: $$ X_{ij} = \mu_i + \varepsilon_{ij}, $$ where:

$\mu_i$ = mean of group $i$
$\varepsilon_{ij}$ = random error

Assumptions on errors: $$ \varepsilon_{ij} \sim \mathcal{N}(0, \sigma^2), $$ independently for all $i,j$.

Equivalently: $$ X_{ij} \sim \mathcal{N}(\mu_i, \sigma^2). $$

4. Hypotheses¶

Null hypothesis¶

$$ H_0: \mu_1 = \mu_2 = \dots = \mu_k $$

Alternative hypothesis¶

$$ H_1: \text{At least one } \mu_i \text{ differs} $$

Important:

ANOVA does not tell which means differ
it only tests whether any difference exists

5. Key Idea Behind ANOVA¶

ANOVA is based on variance decomposition.

Total variability in the data can be split into:

variability between groups
variability within groups

If group means are truly equal, between-group variability should be small relative to within-group variability.

6. Sample Means and Grand Mean¶

Group means: $$ \bar{X}_i = \frac{1}{n_i}\sum_{j=1}^{n_i} X_{ij} $$

Grand mean: $$ \bar{X} = \frac{1}{N}\sum_{i=1}^{k}\sum_{j=1}^{n_i} X_{ij} $$

7. Decomposition of Sums of Squares (Core Theory)¶

7.1 Total Sum of Squares (SST)¶

Measures total variability: $$ \text{SST} = \sum_{i=1}^{k}\sum_{j=1}^{n_i}(X_{ij} - \bar{X})^2 $$

7.2 Between-Group Sum of Squares (SSB)¶

Measures variability due to differences between group means: $$ \text{SSB} = \sum_{i=1}^{k} n_i(\bar{X}_i - \bar{X})^2 $$

7.3 Within-Group Sum of Squares (SSW)¶

Measures variability within groups: $$ \text{SSW} = \sum_{i=1}^{k}\sum_{j=1}^{n_i}(X_{ij} - \bar{X}_i)^2 $$

7.4 Fundamental identity¶

$$ \text{SST} = \text{SSB} + \text{SSW} $$

This decomposition is exact (not approximate).

8. Degrees of Freedom¶

Total¶

$$ \text{df}_{\text{total}} = N - 1 $$

Between groups¶

$$ \text{df}_{\text{between}} = k - 1 $$

Within groups¶

$$ \text{df}_{\text{within}} = N - k $$

And: $$ (N - 1) = (k - 1) + (N - k) $$

9. Mean Squares¶

To compare variances, sums of squares are normalized by degrees of freedom.

Mean square between¶

$$ \text{MSB} = \frac{\text{SSB}}{k - 1} $$

Mean square within¶

$$ \text{MSW} = \frac{\text{SSW}}{N - k} $$

Interpretation:

MSW estimates the common variance $\sigma^2$
MSB estimates $\sigma^2$ plus potential group effects

10. The F Statistic¶

The ANOVA test statistic is: $$ F_{\text{obs}} = \frac{\text{MSB}}{\text{MSW}} $$

Under $H_0$: $$ F_{\text{obs}} \sim F(k-1, N-k) $$

11. Why the F Distribution Appears¶

Key theoretical result:

$\text{SSB}/\sigma^2 \sim \chi^2(k-1)$
$\text{SSW}/\sigma^2 \sim \chi^2(N-k)$
SSB and SSW are independent

Therefore: $$ \frac{(\text{SSB}/(k-1))}{(\text{SSW}/(N-k))} \sim F(k-1, N-k) $$

12. Decision Rule¶

Given significance level $\alpha$:

Reject $H_0$ if: $$ F_{\text{obs}} > F_{1-\alpha}(k-1, N-k) $$
Equivalently, reject if: $$ \text{p-value} < \alpha $$

13. ANOVA Table¶

ANOVA Table (One-Way ANOVA)

</table>¶

14. Assumptions of One-Way ANOVA¶

Independence of observations
Normality within each group
Homogeneity of variances: $$ \sigma_1^2 = \sigma_2^2 = \dots = \sigma_k^2 $$

ANOVA is:

fairly robust to mild non-normality
not robust to strong variance heterogeneity (especially with unbalanced $n_i$)

15. What ANOVA Does and Does Not Do¶

ANOVA tests:¶

existence of any difference among means

ANOVA does not:¶

identify which groups differ
quantify effect size (by default)
establish causality

16. Relationship to t-Test¶

Special case:

One-way ANOVA with $k=2$ groups

Then: $$ F = t^2 $$

ANOVA generalizes the two-sample $t$-test.

17. Practical Remarks¶

ANOVA answers "is there any effect?"
Always combine with:
- diagnostic plots
- effect sizes
- post-hoc tests
For unequal variances, consider:
- Welch ANOVA

18. Summary¶

ANOVA compares multiple means simultaneously
Based on variance decomposition
Test statistic: $$ F = \frac{\text{MSB}}{\text{MSW}} $$
Distribution: $$ F \sim F(k-1, N-k) $$
Requires independence, normality, equal variances

Graphical intuition for ANOVA¶

Before introducing the formal F-test, it is useful to develop a geometric and visual intuition.

In all examples below:

each column represents one group,
black squares are individual observations,
circles indicate group means.

The question ANOVA answers is not whether the means look different, but whether the between-group variability is large relative to the within-group variability.

Graphical intuition for ANOVA

Source	Sum of Squares	df	Mean Square	F
Between groups	SSB	k − 1	MSB = SSB/(k − 1)	MSB/MSW
Within groups	SSW	N − k	MSW = SSW/(N − k)
Total	SST	N − 1

Probably equal

Almost surely different

Ambiguous

Implementing One-way ANOVA¶

Problem: One-Way ANOVA (Faculty Ages by Rank)¶

A researcher claims that there is a difference in the average age of assistant professors, associate professors, and full professors at her university.

Faculty members are selected randomly, and their ages are recorded.
Assume that faculty ages are normally distributed.

Test the researcher’s claim at the $\alpha = 0.01$ significance level.

The observed data are:

Rank	Ages
Assistant Professor	28, 32, 36, 42, 50, 33, 38
Associate Professor	44, 61, 52, 54, 62, 45, 46
Professor	54, 56, 55, 65, 52, 50, 46

Post Hoc Tests After One-Way ANOVA

Why Post Hoc Tests Are Needed¶

In one-way ANOVA we test the global null hypothesis

$H_0:\ \mu_1 = \mu_2 = \dots = \mu_k$

If ANOVA rejects $H_0$, we only know that at least one mean differs, but:

❌ ANOVA does not tell us which groups differ.

To identify where the differences lie, we perform post hoc multiple comparison tests.

The Multiple Comparisons Problem¶

Suppose we have $k$ groups.

Number of pairwise comparisons:

$m = \binom{k}{2} = \frac{k(k-1)}{2}$

If we test each comparison at level $\alpha = 0.05$, then the probability of making at least one Type I error increases rapidly.

Why We Should NOT Use Multiple Two-Sample t-Tests¶

One should never use multiple two-sample t-tests when comparing more than two groups.
Doing so inflates the Type I error rate.

Inflation of Type I Error¶

Assume we perform hypothesis tests at significance level $\alpha = 0.05$.

For one test:

Probability of not making a Type I error: $1 - \alpha = 0.95$
Probability of a Type I error: $\alpha = 0.05$

What Happens With Multiple Comparisons?¶

Suppose we perform $m$ independent comparisons.

Probability of no Type I errors:

$(1 - \alpha)^m$
Probability of at least one Type I error (Family-Wise Error Rate):

$\boxed{\text{FWER} = 1 - (1 - \alpha)^m}$

This probability increases rapidly as $m$ grows.

Concrete Examples¶

Example 1: Two Comparisons¶

Let $\alpha = 0.05$ and $m = 2$.

Probability of no Type I error:

$(1 - 0.05)^2 = 0.9025$
Probability of at least one Type I error:

$1 - 0.9025 = 0.0975$

So the Type I error rate is almost doubled.

Example 2: Five Groups¶

For $k = 5$ groups:

$m = \binom{5}{2} = 10$

Probability of at least one Type I error:

$1 - (1 - 0.05)^{10} \approx 0.401$

➡️ 40% chance of falsely detecting a difference!

Interpretation¶

Even if all group means are truly equal, using multiple two-sample t-tests:

makes false discoveries very likely
produces misleading scientific conclusions
invalidates reported p-values

Why ANOVA Fixes This¶

One-way ANOVA performs a single global test
Controls the Type I error at level $\alpha$
Tests:

$H_0:\ \mu_1 = \mu_2 = \dots = \mu_k$

Only after rejecting ANOVA do we proceed to post hoc tests that explicitly control the family-wise error rate.

Key Takeaway (Exam-Ready Sentence)¶

Performing multiple two-sample t-tests inflates the Type I error rate, with
$\text{FWER} = 1 - (1 - \alpha)^m$,
which is why ANOVA followed by post hoc tests must be used instead.

Types of Post Hoc Tests (Big Picture)¶

Method	Controls FWER	Assumptions	Notes
Bonferroni	Yes	Minimal	Conservative
Holm–Bonferroni	Yes	Minimal	Less conservative
Tukey HSD	Yes	Equal variances	Most common after ANOVA
Scheffé	Yes	Very general	Very conservative
Fisher LSD	No	Equal variances	Only valid if ANOVA significant

Bonferroni Correction (Core Idea)¶

Bonferroni is based on a simple inequality:

$\mathbb{P}\left(\bigcup_{i=1}^m A_i\right) \le \sum_{i=1}^m \mathbb{P}(A_i)$

To ensure:

$\text{FWER} \le \alpha$

we test each hypothesis at level:

$\boxed{\alpha_{\text{Bonf}} = \frac{\alpha}{m}}$

Bonferroni Post Hoc Test (Step by Step)

Let $m = \binom{k}{2}$ pairwise comparisons.

Step 1: Form pairwise hypotheses¶

For each pair $(i,j)$:

$H_0^{(ij)}:\ \mu_i = \mu_j$

$H_1^{(ij)}:\ \mu_i \neq \mu_j$

Step 2: Compute test statistics¶

Typically use two-sample t-tests:

Pooled variance if equal variances assumed
Welch t-test otherwise

Using the pooled within-group variance estimate from ANOVA (Mean Square Error):

$\text{MSE} = \text{MSW}$

the Bonferroni test statistic is

$t_{ij}¶

\frac{\bar{x}_i - \bar{x}_j} {\sqrt{\text{MSE}\left(\frac{1}{n_i} + \frac{1}{n_j}\right)}}$

where:

$\bar{x}_i,\bar{x}_j$ are the sample means of groups $i$ and $j$
$n_i,n_j$ are the corresponding sample sizes
$\text{MSE}$ is taken from the ANOVA table

Degrees of Freedom¶

$df = N - k$

where:

$N$ is the total sample size
$k$ is the number of groups

Bonferroni Adjustment¶

If $m = \binom{k}{2}$ pairwise comparisons are performed, the Bonferroni-adjusted significance level is

$\alpha_{\text{Bonf}} = \frac{\alpha}{m}$

Decision Rule (Two-Sided)¶

Reject $H_0^{(ij)}$ if either of the following equivalent conditions holds:

$|t_{ij}| > t_{1-\alpha_{\text{Bonf}}/2,\,df}$

$p_{ij} < \alpha_{\text{Bonf}}$

Equivalent p-Value Formulation¶

Alternatively, define the adjusted p-value

$p^{\text{Bonf}}_{ij} = \min(m \cdot p_{ij},\ 1)$

Reject $H_0^{(ij)}$ if

$p^{\text{Bonf}}_{ij} < \alpha$

Interpretation¶

If the Bonferroni-adjusted test rejects $H_0^{(ij)}$, we conclude that the mean responses of groups $i$ and $j$ differ, while maintaining family-wise error rate control at level $\alpha$.

Properties of Bonferroni¶

Advantages¶

✔ Very simple
✔ Works with any test statistic
✔ No distributional assumptions beyond the base test
✔ Valid for unbalanced designs

Disadvantages¶

❌ Conservative, especially when $m$ is large
❌ Reduced power (more Type II errors)

When to Use Bonferroni¶

Bonferroni is appropriate when:

Number of comparisons is small
Strong control of Type I error is required
Assumptions for Tukey HSD are doubtful
You want a safe default method

Comparison with Tukey HSD¶

Aspect	Bonferroni	Tukey
Power	Lower	Higher
FWER control	Guaranteed	Guaranteed
Assumes equal variances	No	Yes
Uses ANOVA MSE	Optional	Yes
Typical use	General	Standard ANOVA

Summary (Exam-Ready)¶

ANOVA answers whether differences exist
Post hoc tests answer where differences exist
Bonferroni controls FWER by splitting $\alpha$
Simple, robust, but conservative
Often a baseline method to compare with Tukey HSD

Key sentence:
Bonferroni correction controls the family-wise error rate by testing each comparison at level $\alpha/m$.

Tukey’s HSD (Honestly Significant Difference) Test

Context: Post Hoc Testing After ANOVA¶

Recall that one-way ANOVA tests the global hypothesis

$H_0:\ \mu_1 = \mu_2 = \dots = \mu_k$

If ANOVA rejects $H_0$, we conclude that at least one mean differs, but we still do not know which pairs of means differ.

👉 Tukey’s HSD is a post hoc multiple comparison procedure designed specifically for all pairwise comparisons after ANOVA.

What Tukey’s HSD Tests¶

For every pair of groups $(i,j)$, Tukey’s HSD tests

$H_0^{(ij)}:\ \mu_i = \mu_j$

$H_1^{(ij)}:\ \mu_i \neq \mu_j$

while controlling the family-wise error rate (FWER) at level $\alpha$.

Key Idea Behind Tukey’s HSD¶

Tukey’s HSD uses the studentized range distribution, which accounts for the fact that:

we are comparing many means simultaneously
the maximum difference among sample means is more variable than a single difference

Instead of adjusting $\alpha$ (like Bonferroni), Tukey adjusts the critical value.

Assumptions of Tukey’s HSD¶

Tukey’s HSD relies on the same assumptions as one-way ANOVA:

Independence of observations
Normality within each group
Equal population variances
Balanced or approximately balanced design (robust if mildly unbalanced)

If variances are unequal, Tukey’s HSD may not be valid.

Test Statistic¶

Let:

$\bar x_i$ = sample mean of group $i$
$n_i$ = sample size of group $i$
$MSE$ = mean square error from ANOVA (IMPORTANT: This is SSW = Sum of Squares Within groups from ANOVA table)
$k$ = number of groups

For groups with equal sample sizes $n$:

$\text{SE} = \sqrt{\frac{MSE}{n}}$

For unequal sample sizes (Tukey–Kramer):

$\text{SE}_{ij} = \sqrt{\frac{MSE}{2}\left(\frac{1}{n_i} + \frac{1}{n_j}\right)}$

Tukey HSD Test Statistic¶

The Tukey test compares the absolute mean difference to a critical threshold:

$|\bar x_i - \bar x_j|$

Reject $H_0^{(ij)}$ if:

$|\bar x_i - \bar x_j| > q_{\alpha,k,df}\cdot \text{SE}_{ij}$

where:

$q_{\alpha,k,df}$ is the upper $\alpha$ quantile of the studentized range distribution
$df = N - k$ (within-group degrees of freedom)

Studentized Range Distribution¶

The studentized range statistic is:

$q = \frac{\max(\bar X_1,\dots,\bar X_k) - \min(\bar X_1,\dots,\bar X_k)}{S}$

where $S$ is an estimate of the standard deviation.

This distribution explicitly accounts for multiple comparisons among means.

Family-Wise Error Control¶

Tukey’s HSD guarantees:

$\mathbb{P}(\text{at least one Type I error}) \le \alpha$

for all pairwise mean comparisons.

This is exact control, not an approximation.

Tukey HSD Confidence Intervals¶

For each pair $(i,j)$, Tukey’s method produces simultaneous confidence intervals:

$(\bar x_i - \bar x_j) \ \pm\ q_{1-\alpha,k,df}\cdot \text{SE}_{ij}$

All intervals jointly have coverage probability at least $1-\alpha$.

If an interval does not contain 0, the corresponding means differ significantly.

Comparison with Bonferroni¶

Aspect	Tukey HSD	Bonferroni
Designed for pairwise means	Yes	No (general)
Uses ANOVA MSE	Yes	Optional
Equal variance assumption	Yes	No
Power	Higher	Lower
FWER control	Exact	Upper bound
Conservativeness	Moderate	Often very conservative

When to Use Tukey’s HSD¶

Use Tukey’s HSD when:

✔ ANOVA is significant
✔ You want all pairwise comparisons
✔ Variances are approximately equal
✔ You want higher power than Bonferroni

Avoid Tukey’s HSD when variances differ substantially.

Interpretation Example¶

If Tukey’s HSD finds that:

Group A vs B: significant
Group A vs C: significant
Group B vs C: not significant

then we conclude:

Group A differs from both B and C, while B and C are statistically indistinguishable.

Common Misconceptions¶

❌ Tukey’s HSD can be used without ANOVA
✔ It can, but it is intended as a post hoc method

❌ Tukey tests variances
✔ Tukey compares means, not variances

❌ Tukey is always better than Bonferroni
✔ Only when assumptions hold

Exam-Ready Summary¶

Tukey’s HSD is a post hoc test for all pairwise mean comparisons
Controls family-wise error rate exactly
Based on the studentized range distribution
More powerful than Bonferroni under equal variances
Standard choice after one-way ANOVA

Key sentence:
Tukey’s HSD controls the family-wise error rate by using the studentized range distribution to compare all pairwise mean differences simultaneously.

System 1	System 2	System 3
48	60	57
56	56	55
46	53	52
45	60	50
50	51	51