In many applications we want to compare more than two population means.
Examples:
A naive approach would be to perform many pairwise $t$-tests.
This is incorrect, because it inflates the Type I error rate.
ANOVA provides a single global test for comparing multiple means.
Are all group means equal, or does at least one group differ?
ANOVA tests equality of means, not variances (despite the name).
Suppose we have $k$ groups.
Group $i$ has observations: $$ X_{i1}, X_{i2}, \dots, X_{in_i}, \quad i = 1,\dots,k. $$
Total sample size: $$ N = \sum_{i=1}^{k} n_i. $$
The one-way ANOVA model is: $$ X_{ij} = \mu_i + \varepsilon_{ij}, $$ where:
Assumptions on errors: $$ \varepsilon_{ij} \sim \mathcal{N}(0, \sigma^2), $$ independently for all $i,j$.
Equivalently: $$ X_{ij} \sim \mathcal{N}(\mu_i, \sigma^2). $$
Important:
ANOVA is based on variance decomposition.
Total variability in the data can be split into:
If group means are truly equal, between-group variability should be small relative to within-group variability.
Group means: $$ \bar{X}_i = \frac{1}{n_i}\sum_{j=1}^{n_i} X_{ij} $$
Grand mean: $$ \bar{X} = \frac{1}{N}\sum_{i=1}^{k}\sum_{j=1}^{n_i} X_{ij} $$
Measures total variability: $$ \text{SST} = \sum_{i=1}^{k}\sum_{j=1}^{n_i}(X_{ij} - \bar{X})^2 $$
Measures variability due to differences between group means: $$ \text{SSB} = \sum_{i=1}^{k} n_i(\bar{X}_i - \bar{X})^2 $$
Measures variability within groups: $$ \text{SSW} = \sum_{i=1}^{k}\sum_{j=1}^{n_i}(X_{ij} - \bar{X}_i)^2 $$
This decomposition is exact (not approximate).
And: $$ (N - 1) = (k - 1) + (N - k) $$
To compare variances, sums of squares are normalized by degrees of freedom.
Interpretation:
The ANOVA test statistic is: $$ F_{\text{obs}} = \frac{\text{MSB}}{\text{MSW}} $$
Under $H_0$: $$ F_{\text{obs}} \sim F(k-1, N-k) $$
Key theoretical result:
Therefore: $$ \frac{(\text{SSB}/(k-1))}{(\text{SSW}/(N-k))} \sim F(k-1, N-k) $$
Given significance level $\alpha$:
Reject $H_0$ if: $$ F_{\text{obs}} > F_{1-\alpha}(k-1, N-k) $$
Equivalently, reject if: $$ \text{p-value} < \alpha $$
| Source | Sum of Squares | df | Mean Square | F |
|---|---|---|---|---|
| Between groups | SSB | k − 1 | MSB = SSB/(k − 1) | MSB/MSW |
| Within groups | SSW | N − k | MSW = SSW/(N − k) | |
| Total | SST | N − 1 |
Probably equal |
Almost surely different |
Ambiguous |
A researcher claims that there is a difference in the average age of assistant professors, associate professors, and full professors at her university.
Faculty members are selected randomly, and their ages are recorded.
Assume that faculty ages are normally distributed.
Test the researcher’s claim at the $\alpha = 0.01$ significance level.
The observed data are:
| Rank | Ages |
|---|---|
| Assistant Professor | 28, 32, 36, 42, 50, 33, 38 |
| Associate Professor | 44, 61, 52, 54, 62, 45, 46 |
| Professor | 54, 56, 55, 65, 52, 50, 46 |
In one-way ANOVA we test the global null hypothesis
$H_0:\ \mu_1 = \mu_2 = \dots = \mu_k$
If ANOVA rejects $H_0$, we only know that at least one mean differs, but:
❌ ANOVA does not tell us which groups differ.
To identify where the differences lie, we perform post hoc multiple comparison tests.
Suppose we have $k$ groups.
$m = \binom{k}{2} = \frac{k(k-1)}{2}$
If we test each comparison at level $\alpha = 0.05$, then the probability of making at least one Type I error increases rapidly.
One should never use multiple two-sample t-tests when comparing more than two groups.
Doing so inflates the Type I error rate.
Assume we perform hypothesis tests at significance level $\alpha = 0.05$.
For one test:
Suppose we perform $m$ independent comparisons.
Probability of no Type I errors:
$(1 - \alpha)^m$
Probability of at least one Type I error (Family-Wise Error Rate):
$\boxed{\text{FWER} = 1 - (1 - \alpha)^m}$
This probability increases rapidly as $m$ grows.
Let $\alpha = 0.05$ and $m = 2$.
Probability of no Type I error:
$(1 - 0.05)^2 = 0.9025$
Probability of at least one Type I error:
$1 - 0.9025 = 0.0975$
So the Type I error rate is almost doubled.
For $k = 5$ groups:
$m = \binom{5}{2} = 10$
Probability of at least one Type I error:
$1 - (1 - 0.05)^{10} \approx 0.401$
➡️ 40% chance of falsely detecting a difference!
Even if all group means are truly equal, using multiple two-sample t-tests:
Tests:
$H_0:\ \mu_1 = \mu_2 = \dots = \mu_k$
Only after rejecting ANOVA do we proceed to post hoc tests that explicitly control the family-wise error rate.
Performing multiple two-sample t-tests inflates the Type I error rate, with
$\text{FWER} = 1 - (1 - \alpha)^m$,
which is why ANOVA followed by post hoc tests must be used instead.
| Method | Controls FWER | Assumptions | Notes |
|---|---|---|---|
| Bonferroni | Yes | Minimal | Conservative |
| Holm–Bonferroni | Yes | Minimal | Less conservative |
| Tukey HSD | Yes | Equal variances | Most common after ANOVA |
| Scheffé | Yes | Very general | Very conservative |
| Fisher LSD | No | Equal variances | Only valid if ANOVA significant |
Bonferroni is based on a simple inequality:
$\mathbb{P}\left(\bigcup_{i=1}^m A_i\right) \le \sum_{i=1}^m \mathbb{P}(A_i)$
To ensure:
$\text{FWER} \le \alpha$
we test each hypothesis at level:
$\boxed{\alpha_{\text{Bonf}} = \frac{\alpha}{m}}$
Let $m = \binom{k}{2}$ pairwise comparisons.
For each pair $(i,j)$:
$H_0^{(ij)}:\ \mu_i = \mu_j$
vs
$H_1^{(ij)}:\ \mu_i \neq \mu_j$
Typically use two-sample t-tests:
Using the pooled within-group variance estimate from ANOVA (Mean Square Error):
$\text{MSE} = \text{MSW}$
the Bonferroni test statistic is
\frac{\bar{x}_i - \bar{x}_j} {\sqrt{\text{MSE}\left(\frac{1}{n_i} + \frac{1}{n_j}\right)}}$
where:
$df = N - k$
where:
If $m = \binom{k}{2}$ pairwise comparisons are performed, the Bonferroni-adjusted significance level is
$\alpha_{\text{Bonf}} = \frac{\alpha}{m}$
Reject $H_0^{(ij)}$ if either of the following equivalent conditions holds:
$|t_{ij}| > t_{1-\alpha_{\text{Bonf}}/2,\,df}$
or
$p_{ij} < \alpha_{\text{Bonf}}$
Alternatively, define the adjusted p-value
$p^{\text{Bonf}}_{ij} = \min(m \cdot p_{ij},\ 1)$
Reject $H_0^{(ij)}$ if
$p^{\text{Bonf}}_{ij} < \alpha$
If the Bonferroni-adjusted test rejects $H_0^{(ij)}$, we conclude that the mean responses of groups $i$ and $j$ differ, while maintaining family-wise error rate control at level $\alpha$.
✔ Very simple
✔ Works with any test statistic
✔ No distributional assumptions beyond the base test
✔ Valid for unbalanced designs
❌ Conservative, especially when $m$ is large
❌ Reduced power (more Type II errors)
Bonferroni is appropriate when:
| Aspect | Bonferroni | Tukey |
|---|---|---|
| Power | Lower | Higher |
| FWER control | Guaranteed | Guaranteed |
| Assumes equal variances | No | Yes |
| Uses ANOVA MSE | Optional | Yes |
| Typical use | General | Standard ANOVA |
Key sentence:
Bonferroni correction controls the family-wise error rate by testing each comparison at level $\alpha/m$.
Recall that one-way ANOVA tests the global hypothesis
$H_0:\ \mu_1 = \mu_2 = \dots = \mu_k$
If ANOVA rejects $H_0$, we conclude that at least one mean differs, but we still do not know which pairs of means differ.
👉 Tukey’s HSD is a post hoc multiple comparison procedure designed specifically for all pairwise comparisons after ANOVA.
For every pair of groups $(i,j)$, Tukey’s HSD tests
$H_0^{(ij)}:\ \mu_i = \mu_j$
vs
$H_1^{(ij)}:\ \mu_i \neq \mu_j$
while controlling the family-wise error rate (FWER) at level $\alpha$.
Tukey’s HSD uses the studentized range distribution, which accounts for the fact that:
Instead of adjusting $\alpha$ (like Bonferroni), Tukey adjusts the critical value.
Tukey’s HSD relies on the same assumptions as one-way ANOVA:
If variances are unequal, Tukey’s HSD may not be valid.
Let:
For groups with equal sample sizes $n$:
$\text{SE} = \sqrt{\frac{MSE}{n}}$
For unequal sample sizes (Tukey–Kramer):
$\text{SE}_{ij} = \sqrt{\frac{MSE}{2}\left(\frac{1}{n_i} + \frac{1}{n_j}\right)}$
The Tukey test compares the absolute mean difference to a critical threshold:
$|\bar x_i - \bar x_j|$
Reject $H_0^{(ij)}$ if:
$|\bar x_i - \bar x_j| > q_{\alpha,k,df}\cdot \text{SE}_{ij}$
where:
The studentized range statistic is:
$q = \frac{\max(\bar X_1,\dots,\bar X_k) - \min(\bar X_1,\dots,\bar X_k)}{S}$
where $S$ is an estimate of the standard deviation.
This distribution explicitly accounts for multiple comparisons among means.
Tukey’s HSD guarantees:
$\mathbb{P}(\text{at least one Type I error}) \le \alpha$
for all pairwise mean comparisons.
This is exact control, not an approximation.
For each pair $(i,j)$, Tukey’s method produces simultaneous confidence intervals:
$(\bar x_i - \bar x_j) \ \pm\ q_{1-\alpha,k,df}\cdot \text{SE}_{ij}$
All intervals jointly have coverage probability at least $1-\alpha$.
If an interval does not contain 0, the corresponding means differ significantly.
| Aspect | Tukey HSD | Bonferroni |
|---|---|---|
| Designed for pairwise means | Yes | No (general) |
| Uses ANOVA MSE | Yes | Optional |
| Equal variance assumption | Yes | No |
| Power | Higher | Lower |
| FWER control | Exact | Upper bound |
| Conservativeness | Moderate | Often very conservative |
Use Tukey’s HSD when:
✔ ANOVA is significant
✔ You want all pairwise comparisons
✔ Variances are approximately equal
✔ You want higher power than Bonferroni
Avoid Tukey’s HSD when variances differ substantially.
If Tukey’s HSD finds that:
then we conclude:
Group A differs from both B and C, while B and C are statistically indistinguishable.
❌ Tukey’s HSD can be used without ANOVA
✔ It can, but it is intended as a post hoc method
❌ Tukey tests variances
✔ Tukey compares means, not variances
❌ Tukey is always better than Bonferroni
✔ Only when assumptions hold
Key sentence:
Tukey’s HSD controls the family-wise error rate by using the studentized range distribution to compare all pairwise mean differences simultaneously.
Three fuel injection systems are tested for efficiency, and the following coded data are obtained:
| System 1 | System 2 | System 3 |
|---|---|---|
| 48 | 60 | 57 |
| 56 | 56 | 55 |
| 46 | 53 | 52 |
| 45 | 60 | 50 |
| 50 | 51 | 51 |
Do the data support the hypothesis that the three fuel injection systems offer equivalent levels of efficiency?