Seminar 5

Analysis of Variance (ANOVA)

One-Way ANOVA: Theory, Assumptions, and Interpretation¶


1. Motivation¶

In many applications we want to compare more than two population means.

Examples:

  • mean income across several regions
  • average exam score across multiple teaching methods
  • mean response time for different algorithms
  • effect of different treatments in an experiment

A naive approach would be to perform many pairwise $t$-tests.
This is incorrect, because it inflates the Type I error rate.

ANOVA provides a single global test for comparing multiple means.


2. Statistical Question¶

Are all group means equal, or does at least one group differ?

ANOVA tests equality of means, not variances (despite the name).


3. One-Way ANOVA Model¶

3.1 Data structure¶

Suppose we have $k$ groups.

Group $i$ has observations: $$ X_{i1}, X_{i2}, \dots, X_{in_i}, \quad i = 1,\dots,k. $$

Total sample size: $$ N = \sum_{i=1}^{k} n_i. $$


3.2 Model assumption¶

The one-way ANOVA model is: $$ X_{ij} = \mu_i + \varepsilon_{ij}, $$ where:

  • $\mu_i$ = mean of group $i$
  • $\varepsilon_{ij}$ = random error

Assumptions on errors: $$ \varepsilon_{ij} \sim \mathcal{N}(0, \sigma^2), $$ independently for all $i,j$.

Equivalently: $$ X_{ij} \sim \mathcal{N}(\mu_i, \sigma^2). $$


4. Hypotheses¶

Null hypothesis¶

$$ H_0: \mu_1 = \mu_2 = \dots = \mu_k $$

Alternative hypothesis¶

$$ H_1: \text{At least one } \mu_i \text{ differs} $$

Important:

  • ANOVA does not tell which means differ
  • it only tests whether any difference exists

5. Key Idea Behind ANOVA¶

ANOVA is based on variance decomposition.

Total variability in the data can be split into:

  1. variability between groups
  2. variability within groups

If group means are truly equal, between-group variability should be small relative to within-group variability.


6. Sample Means and Grand Mean¶

Group means: $$ \bar{X}_i = \frac{1}{n_i}\sum_{j=1}^{n_i} X_{ij} $$

Grand mean: $$ \bar{X} = \frac{1}{N}\sum_{i=1}^{k}\sum_{j=1}^{n_i} X_{ij} $$


7. Decomposition of Sums of Squares (Core Theory)¶

7.1 Total Sum of Squares (SST)¶

Measures total variability: $$ \text{SST} = \sum_{i=1}^{k}\sum_{j=1}^{n_i}(X_{ij} - \bar{X})^2 $$


7.2 Between-Group Sum of Squares (SSB)¶

Measures variability due to differences between group means: $$ \text{SSB} = \sum_{i=1}^{k} n_i(\bar{X}_i - \bar{X})^2 $$


7.3 Within-Group Sum of Squares (SSW)¶

Measures variability within groups: $$ \text{SSW} = \sum_{i=1}^{k}\sum_{j=1}^{n_i}(X_{ij} - \bar{X}_i)^2 $$


7.4 Fundamental identity¶

$$ \text{SST} = \text{SSB} + \text{SSW} $$

This decomposition is exact (not approximate).


8. Degrees of Freedom¶

Total¶

$$ \text{df}_{\text{total}} = N - 1 $$

Between groups¶

$$ \text{df}_{\text{between}} = k - 1 $$

Within groups¶

$$ \text{df}_{\text{within}} = N - k $$

And: $$ (N - 1) = (k - 1) + (N - k) $$


9. Mean Squares¶

To compare variances, sums of squares are normalized by degrees of freedom.

Mean square between¶

$$ \text{MSB} = \frac{\text{SSB}}{k - 1} $$

Mean square within¶

$$ \text{MSW} = \frac{\text{SSW}}{N - k} $$

Interpretation:

  • MSW estimates the common variance $\sigma^2$
  • MSB estimates $\sigma^2$ plus potential group effects

10. The F Statistic¶

The ANOVA test statistic is: $$ F_{\text{obs}} = \frac{\text{MSB}}{\text{MSW}} $$

Under $H_0$: $$ F_{\text{obs}} \sim F(k-1, N-k) $$


11. Why the F Distribution Appears¶

Key theoretical result:

  • $\text{SSB}/\sigma^2 \sim \chi^2(k-1)$
  • $\text{SSW}/\sigma^2 \sim \chi^2(N-k)$
  • SSB and SSW are independent

Therefore: $$ \frac{(\text{SSB}/(k-1))}{(\text{SSW}/(N-k))} \sim F(k-1, N-k) $$


12. Decision Rule¶

Given significance level $\alpha$:

  • Reject $H_0$ if: $$ F_{\text{obs}} > F_{1-\alpha}(k-1, N-k) $$

  • Equivalently, reject if: $$ \text{p-value} < \alpha $$


13. ANOVA Table¶

ANOVA Table (One-Way ANOVA)

</table>¶

14. Assumptions of One-Way ANOVA¶

  1. Independence of observations
  2. Normality within each group
  3. Homogeneity of variances: $$ \sigma_1^2 = \sigma_2^2 = \dots = \sigma_k^2 $$

ANOVA is:

  • fairly robust to mild non-normality
  • not robust to strong variance heterogeneity (especially with unbalanced $n_i$)

15. What ANOVA Does and Does Not Do¶

ANOVA tests:¶

  • existence of any difference among means

ANOVA does not:¶

  • identify which groups differ
  • quantify effect size (by default)
  • establish causality

16. Relationship to t-Test¶

Special case:

  • One-way ANOVA with $k=2$ groups

Then: $$ F = t^2 $$

ANOVA generalizes the two-sample $t$-test.


17. Practical Remarks¶

  • ANOVA answers "is there any effect?"
  • Always combine with:
    • diagnostic plots
    • effect sizes
    • post-hoc tests
  • For unequal variances, consider:
    • Welch ANOVA

18. Summary¶

  • ANOVA compares multiple means simultaneously
  • Based on variance decomposition
  • Test statistic: $$ F = \frac{\text{MSB}}{\text{MSW}} $$
  • Distribution: $$ F \sim F(k-1, N-k) $$
  • Requires independence, normality, equal variances

Graphical intuition for ANOVA¶

Before introducing the formal F-test, it is useful to develop a geometric and visual intuition.

In all examples below:

  • each column represents one group,
  • black squares are individual observations,
  • circles indicate group means.

The question ANOVA answers is not whether the means look different, but whether the between-group variability is large relative to the within-group variability.

Graphical intuition for ANOVA

Source Sum of Squares df Mean Square F
Between groups SSB k − 1 MSB = SSB/(k − 1) MSB/MSW
Within groups SSW N − k MSW = SSW/(N − k)
Total SST N − 1

Probably equal

Almost surely different

Ambiguous

Implementing One-way ANOVA¶

In [36]:
import math
from scipy.stats import f


def one_way_anova(
    groups,
    alpha=0.05
):
    """
    One-way ANOVA.

    H0: All group means are equal
        μ1 = μ2 = ... = μk
    H1: At least one group mean differs

    Input
    -----
    groups : list of lists (or arrays)
        groups[i] contains observations from group i
        Need at least 2 groups, each with at least 2 observations

    alpha : significance level

    Test statistic:
        F = MS_between / MS_within

    Degrees of freedom:
        df_between = k - 1
        df_within  = N - k

    Uses BOTH:
      (1) p-value method
      (2) critical region method
    """

    # ---------- Validate input ----------
    if groups is None or len(groups) < 2:
        raise ValueError("At least two groups are required.")

    k = len(groups)
    n_i = []
    means = []

    for g in groups:
        if len(g) < 2:
            raise ValueError("Each group must have at least 2 observations.")
        n_i.append(len(g))
        means.append(sum(g) / len(g))

    N = sum(n_i)

    # ---------- Grand mean ----------
    grand_mean = sum(
        means[i] * n_i[i] for i in range(k)
    ) / N

    # ---------- Sum of Squares ----------
    # Between groups
    ss_between = sum(
        n_i[i] * (means[i] - grand_mean) ** 2
        for i in range(k)
    )

    # Within groups
    ss_within = 0.0
    for i in range(k):
        ss_within += sum(
            (x - means[i]) ** 2 for x in groups[i]
        )

    # Total (optional check)
    ss_total = ss_between + ss_within

    # ---------- Degrees of freedom ----------
    df_between = k - 1
    df_within = N - k

    if df_within <= 0:
        raise ValueError("Not enough data to compute within-group variance.")

    # ---------- Mean Squares ----------
    ms_between = ss_between / df_between
    ms_within = ss_within / df_within

    # ---------- F statistic ----------
    F_obs = ms_between / ms_within

    # ---------- p-value method ----------
    p_value = 1 - f.cdf(F_obs, df_between, df_within)
    reject_by_pvalue = p_value < alpha

    # ---------- Critical region method ----------
    F_crit = f.ppf(1 - alpha, df_between, df_within)
    reject_by_critical = F_obs > F_crit
    critical_region = f"F > {F_crit:.4f}"

    # ---------- Return results ----------
    return {
        "inputs": {
            "k": k,
            "group_sizes": n_i,
            "alpha": alpha
        },
        "means": {
            "group_means": means,
            "grand_mean": grand_mean
        },
        "anova_table": {
            "SS_between": ss_between,
            "df_between": df_between,
            "MS_between": ms_between,
            "SS_within": ss_within,
            "df_within": df_within,
            "MS_within": ms_within,
            "SS_total": ss_total,
            "df_total": N - 1
        },
        "statistic": {
            "F_obs": F_obs
        },
        "p_value_method": {
            "p_value": p_value,
            "reject_H0": reject_by_pvalue
        },
        "critical_region_method": {
            "critical_region": critical_region,
            "F_crit": F_crit,
            "reject_H0": reject_by_critical
        }
    }

Problem: One-Way ANOVA (Faculty Ages by Rank)¶

A researcher claims that there is a difference in the average age of assistant professors, associate professors, and full professors at her university.

Faculty members are selected randomly, and their ages are recorded.
Assume that faculty ages are normally distributed.

Test the researcher’s claim at the $\alpha = 0.01$ significance level.

The observed data are:

Rank Ages
Assistant Professor 28, 32, 36, 42, 50, 33, 38
Associate Professor 44, 61, 52, 54, 62, 45, 46
Professor 54, 56, 55, 65, 52, 50, 46
In [43]:
# Real-ish example data: exam scores from 3 teaching methods
group_A = [28, 32, 36, 42, 50, 33, 38]
group_B = [44, 61, 52, 54, 62, 45, 46]
group_C = [54, 56, 55, 65, 52, 50, 46]

groups = [group_A, group_B, group_C]

anova_res = one_way_anova(groups=groups, alpha=0.01)
anova_res
Out[43]:
{'inputs': {'k': 3, 'group_sizes': [7, 7, 7], 'alpha': 0.01},
 'means': {'group_means': [37.0, 52.0, 54.0],
  'grand_mean': 47.666666666666664},
 'anova_table': {'SS_between': 1208.6666666666667,
  'df_between': 2,
  'MS_between': 604.3333333333334,
  'SS_within': 862.0,
  'df_within': 18,
  'MS_within': 47.888888888888886,
  'SS_total': 2070.666666666667,
  'df_total': 20},
 'statistic': {'F_obs': 12.619489559164736},
 'p_value_method': {'p_value': 0.00037546863291471055, 'reject_H0': True},
 'critical_region_method': {'critical_region': 'F > 6.0129',
  'F_crit': 6.012904834800529,
  'reject_H0': True}}
In [44]:
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 4.5))
plt.boxplot(groups, labels=["Group A", "Group B", "Group C"])
plt.title("One-way ANOVA data: boxplots by group")
plt.xlabel("Group")
plt.ylabel("Value")
plt.grid(True, alpha=0.3)
plt.show()
In [45]:
import numpy as np
import matplotlib.pyplot as plt

# We assume these already exist from previous cells:
# group_A, group_B, group_C

groups = {
    "A": group_A,
    "B": group_B,
    "C": group_C
}

rng = np.random.default_rng(0)   # reproducible jitter
jitter = 0.08                    # horizontal spread

plt.figure(figsize=(8, 4.5))

labels = list(groups.keys())
x_pos = np.arange(len(labels))

for i, label in enumerate(labels):
    y = np.array(groups[label], dtype=float)
    x = i + rng.uniform(-jitter, jitter, size=len(y))

    # individual observations
    plt.scatter(x, y, s=35, marker="s", alpha=0.9)

    # group mean
    plt.scatter(i, y.mean(), s=120, marker="o", zorder=3)

plt.xticks(x_pos, labels)
plt.ylabel("Response")
plt.title("Almost surely different")
plt.grid(True, alpha=0.25)
plt.tight_layout()
plt.show()
In [48]:
import numpy as np
from scipy.stats import f

df1 = anova_res["anova_table"]["df_between"]
df2 = anova_res["anova_table"]["df_within"]
alpha = anova_res["inputs"]["alpha"]
F_obs = anova_res["statistic"]["F_obs"]

F_crit = f.ppf(1 - alpha, df1, df2)

# x-grid (go far enough to include F_obs and the critical value)
x_max = max(F_obs, F_crit) + 3
x = np.linspace(0, x_max, 1200)
y = f.pdf(x, df1, df2)

plt.figure(figsize=(9, 4.5))
plt.plot(x, y, label=f"F density (df1={df1}, df2={df2})")

# Shade critical region (right tail)
mask = x >= F_crit
plt.fill_between(x[mask], y[mask], alpha=0.3, label="Critical region")

# Mark critical value and observed statistic
plt.axvline(F_crit, linestyle="--", label=f"F_crit = {F_crit:.3f}")
plt.axvline(F_obs, linestyle="-.", label=f"F_obs = {F_obs:.3f}")

plt.title(f"F-test in one-way ANOVA (reject H0 = {F_obs > F_crit})")
plt.xlabel("F")
plt.ylabel("density")
plt.grid(True, alpha=0.3)
plt.legend()
plt.show()
In [ ]:
 

Post Hoc Tests After One-Way ANOVA

Why Post Hoc Tests Are Needed¶

In one-way ANOVA we test the global null hypothesis

$H_0:\ \mu_1 = \mu_2 = \dots = \mu_k$

If ANOVA rejects $H_0$, we only know that at least one mean differs, but:

❌ ANOVA does not tell us which groups differ.

To identify where the differences lie, we perform post hoc multiple comparison tests.


The Multiple Comparisons Problem¶

Suppose we have $k$ groups.

  • Number of pairwise comparisons:

$m = \binom{k}{2} = \frac{k(k-1)}{2}$

If we test each comparison at level $\alpha = 0.05$, then the probability of making at least one Type I error increases rapidly.

Why We Should NOT Use Multiple Two-Sample t-Tests¶

One should never use multiple two-sample t-tests when comparing more than two groups.
Doing so inflates the Type I error rate.


Inflation of Type I Error¶

Assume we perform hypothesis tests at significance level $\alpha = 0.05$.

For one test:

  • Probability of not making a Type I error: $1 - \alpha = 0.95$
  • Probability of a Type I error: $\alpha = 0.05$

What Happens With Multiple Comparisons?¶

Suppose we perform $m$ independent comparisons.

  • Probability of no Type I errors:

    $(1 - \alpha)^m$

  • Probability of at least one Type I error (Family-Wise Error Rate):

    $\boxed{\text{FWER} = 1 - (1 - \alpha)^m}$

This probability increases rapidly as $m$ grows.


Concrete Examples¶

Example 1: Two Comparisons¶

Let $\alpha = 0.05$ and $m = 2$.

  • Probability of no Type I error:

    $(1 - 0.05)^2 = 0.9025$

  • Probability of at least one Type I error:

    $1 - 0.9025 = 0.0975$

So the Type I error rate is almost doubled.


Example 2: Five Groups¶

For $k = 5$ groups:

$m = \binom{5}{2} = 10$

  • Probability of at least one Type I error:

    $1 - (1 - 0.05)^{10} \approx 0.401$

➡️ 40% chance of falsely detecting a difference!


Interpretation¶

Even if all group means are truly equal, using multiple two-sample t-tests:

  • makes false discoveries very likely
  • produces misleading scientific conclusions
  • invalidates reported p-values

Why ANOVA Fixes This¶

  • One-way ANOVA performs a single global test
  • Controls the Type I error at level $\alpha$
  • Tests:

    $H_0:\ \mu_1 = \mu_2 = \dots = \mu_k$

Only after rejecting ANOVA do we proceed to post hoc tests that explicitly control the family-wise error rate.


Key Takeaway (Exam-Ready Sentence)¶

Performing multiple two-sample t-tests inflates the Type I error rate, with
$\text{FWER} = 1 - (1 - \alpha)^m$,
which is why ANOVA followed by post hoc tests must be used instead.

Types of Post Hoc Tests (Big Picture)¶

Method Controls FWER Assumptions Notes
Bonferroni Yes Minimal Conservative
Holm–Bonferroni Yes Minimal Less conservative
Tukey HSD Yes Equal variances Most common after ANOVA
Scheffé Yes Very general Very conservative
Fisher LSD No Equal variances Only valid if ANOVA significant

Bonferroni Correction (Core Idea)¶

Bonferroni is based on a simple inequality:

$\mathbb{P}\left(\bigcup_{i=1}^m A_i\right) \le \sum_{i=1}^m \mathbb{P}(A_i)$

To ensure:

$\text{FWER} \le \alpha$

we test each hypothesis at level:

$\boxed{\alpha_{\text{Bonf}} = \frac{\alpha}{m}}$


Bonferroni Post Hoc Test (Step by Step)

Let $m = \binom{k}{2}$ pairwise comparisons.

Step 1: Form pairwise hypotheses¶

For each pair $(i,j)$:

$H_0^{(ij)}:\ \mu_i = \mu_j$

vs

$H_1^{(ij)}:\ \mu_i \neq \mu_j$


Step 2: Compute test statistics¶

Typically use two-sample t-tests:

  • Pooled variance if equal variances assumed
  • Welch t-test otherwise

Using the pooled within-group variance estimate from ANOVA (Mean Square Error):

$\text{MSE} = \text{MSW}$

the Bonferroni test statistic is

$t_{ij}¶

\frac{\bar{x}_i - \bar{x}_j} {\sqrt{\text{MSE}\left(\frac{1}{n_i} + \frac{1}{n_j}\right)}}$

where:

  • $\bar{x}_i,\bar{x}_j$ are the sample means of groups $i$ and $j$
  • $n_i,n_j$ are the corresponding sample sizes
  • $\text{MSE}$ is taken from the ANOVA table

Degrees of Freedom¶

$df = N - k$

where:

  • $N$ is the total sample size
  • $k$ is the number of groups

Bonferroni Adjustment¶

If $m = \binom{k}{2}$ pairwise comparisons are performed, the Bonferroni-adjusted significance level is

$\alpha_{\text{Bonf}} = \frac{\alpha}{m}$


Decision Rule (Two-Sided)¶

Reject $H_0^{(ij)}$ if either of the following equivalent conditions holds:

$|t_{ij}| > t_{1-\alpha_{\text{Bonf}}/2,\,df}$

or

$p_{ij} < \alpha_{\text{Bonf}}$


Equivalent p-Value Formulation¶

Alternatively, define the adjusted p-value

$p^{\text{Bonf}}_{ij} = \min(m \cdot p_{ij},\ 1)$

Reject $H_0^{(ij)}$ if

$p^{\text{Bonf}}_{ij} < \alpha$


Interpretation¶

If the Bonferroni-adjusted test rejects $H_0^{(ij)}$, we conclude that the mean responses of groups $i$ and $j$ differ, while maintaining family-wise error rate control at level $\alpha$.


Properties of Bonferroni¶

Advantages¶

✔ Very simple
✔ Works with any test statistic
✔ No distributional assumptions beyond the base test
✔ Valid for unbalanced designs

Disadvantages¶

❌ Conservative, especially when $m$ is large
❌ Reduced power (more Type II errors)


When to Use Bonferroni¶

Bonferroni is appropriate when:

  • Number of comparisons is small
  • Strong control of Type I error is required
  • Assumptions for Tukey HSD are doubtful
  • You want a safe default method

Comparison with Tukey HSD¶

Aspect Bonferroni Tukey
Power Lower Higher
FWER control Guaranteed Guaranteed
Assumes equal variances No Yes
Uses ANOVA MSE Optional Yes
Typical use General Standard ANOVA

Summary (Exam-Ready)¶

  • ANOVA answers whether differences exist
  • Post hoc tests answer where differences exist
  • Bonferroni controls FWER by splitting $\alpha$
  • Simple, robust, but conservative
  • Often a baseline method to compare with Tukey HSD

Key sentence:
Bonferroni correction controls the family-wise error rate by testing each comparison at level $\alpha/m$.

Tukey’s HSD (Honestly Significant Difference) Test

Context: Post Hoc Testing After ANOVA¶

Recall that one-way ANOVA tests the global hypothesis

$H_0:\ \mu_1 = \mu_2 = \dots = \mu_k$

If ANOVA rejects $H_0$, we conclude that at least one mean differs, but we still do not know which pairs of means differ.

👉 Tukey’s HSD is a post hoc multiple comparison procedure designed specifically for all pairwise comparisons after ANOVA.


What Tukey’s HSD Tests¶

For every pair of groups $(i,j)$, Tukey’s HSD tests

$H_0^{(ij)}:\ \mu_i = \mu_j$

vs

$H_1^{(ij)}:\ \mu_i \neq \mu_j$

while controlling the family-wise error rate (FWER) at level $\alpha$.


Key Idea Behind Tukey’s HSD¶

Tukey’s HSD uses the studentized range distribution, which accounts for the fact that:

  • we are comparing many means simultaneously
  • the maximum difference among sample means is more variable than a single difference

Instead of adjusting $\alpha$ (like Bonferroni), Tukey adjusts the critical value.


Assumptions of Tukey’s HSD¶

Tukey’s HSD relies on the same assumptions as one-way ANOVA:

  1. Independence of observations
  2. Normality within each group
  3. Equal population variances
  4. Balanced or approximately balanced design (robust if mildly unbalanced)

If variances are unequal, Tukey’s HSD may not be valid.


Test Statistic¶

Let:

  • $\bar x_i$ = sample mean of group $i$
  • $n_i$ = sample size of group $i$
  • $MSE$ = mean square error from ANOVA (IMPORTANT: This is SSW = Sum of Squares Within groups from ANOVA table)
  • $k$ = number of groups

For groups with equal sample sizes $n$:

$\text{SE} = \sqrt{\frac{MSE}{n}}$

For unequal sample sizes (Tukey–Kramer):

$\text{SE}_{ij} = \sqrt{\frac{MSE}{2}\left(\frac{1}{n_i} + \frac{1}{n_j}\right)}$


Tukey HSD Test Statistic¶

The Tukey test compares the absolute mean difference to a critical threshold:

$|\bar x_i - \bar x_j|$

Reject $H_0^{(ij)}$ if:

$|\bar x_i - \bar x_j| > q_{\alpha,k,df}\cdot \text{SE}_{ij}$

where:

  • $q_{\alpha,k,df}$ is the upper $\alpha$ quantile of the studentized range distribution
  • $df = N - k$ (within-group degrees of freedom)

Studentized Range Distribution¶

The studentized range statistic is:

$q = \frac{\max(\bar X_1,\dots,\bar X_k) - \min(\bar X_1,\dots,\bar X_k)}{S}$

where $S$ is an estimate of the standard deviation.

This distribution explicitly accounts for multiple comparisons among means.


Family-Wise Error Control¶

Tukey’s HSD guarantees:

$\mathbb{P}(\text{at least one Type I error}) \le \alpha$

for all pairwise mean comparisons.

This is exact control, not an approximation.


Tukey HSD Confidence Intervals¶

For each pair $(i,j)$, Tukey’s method produces simultaneous confidence intervals:

$(\bar x_i - \bar x_j) \ \pm\ q_{1-\alpha,k,df}\cdot \text{SE}_{ij}$

All intervals jointly have coverage probability at least $1-\alpha$.

If an interval does not contain 0, the corresponding means differ significantly.


Comparison with Bonferroni¶

Aspect Tukey HSD Bonferroni
Designed for pairwise means Yes No (general)
Uses ANOVA MSE Yes Optional
Equal variance assumption Yes No
Power Higher Lower
FWER control Exact Upper bound
Conservativeness Moderate Often very conservative

When to Use Tukey’s HSD¶

Use Tukey’s HSD when:

✔ ANOVA is significant
✔ You want all pairwise comparisons
✔ Variances are approximately equal
✔ You want higher power than Bonferroni

Avoid Tukey’s HSD when variances differ substantially.


Interpretation Example¶

If Tukey’s HSD finds that:

  • Group A vs B: significant
  • Group A vs C: significant
  • Group B vs C: not significant

then we conclude:

Group A differs from both B and C, while B and C are statistically indistinguishable.


Common Misconceptions¶

❌ Tukey’s HSD can be used without ANOVA
✔ It can, but it is intended as a post hoc method

❌ Tukey tests variances
✔ Tukey compares means, not variances

❌ Tukey is always better than Bonferroni
✔ Only when assumptions hold


Exam-Ready Summary¶

  • Tukey’s HSD is a post hoc test for all pairwise mean comparisons
  • Controls family-wise error rate exactly
  • Based on the studentized range distribution
  • More powerful than Bonferroni under equal variances
  • Standard choice after one-way ANOVA

Key sentence:
Tukey’s HSD controls the family-wise error rate by using the studentized range distribution to compare all pairwise mean differences simultaneously.

In [49]:
import math
import itertools
import numpy as np
import pandas as pd
from scipy.stats import t, studentized_range


def _anova_mse_and_df_within(groups):
    """
    Compute MSE (= MSW) and df_within from raw groups.
    groups: list of arrays/lists
    """
    k = len(groups)
    n_i = [len(g) for g in groups]
    if k < 2 or any(n < 2 for n in n_i):
        raise ValueError("Need at least 2 groups and each group must have n>=2.")
    N = sum(n_i)

    means = [sum(g) / len(g) for g in groups]
    ss_within = 0.0
    for i, g in enumerate(groups):
        ss_within += sum((x - means[i]) ** 2 for x in g)

    df_within = N - k
    mse = ss_within / df_within
    return mse, df_within, means, n_i


def bonferroni_posthoc(groups, labels=None, alpha=0.05):
    """
    Bonferroni post hoc for all pairwise mean comparisons after one-way ANOVA.
    Uses ANOVA MSE (= MSW) as common variance estimator.
    """
    if labels is None:
        labels = [f"G{i+1}" for i in range(len(groups))]
    if len(labels) != len(groups):
        raise ValueError("labels must have same length as groups.")

    mse, dfw, means, n_i = _anova_mse_and_df_within(groups)
    k = len(groups)
    m = k * (k - 1) // 2  # number of pairwise comparisons
    alpha_bonf = alpha / m

    rows = []
    for i, j in itertools.combinations(range(k), 2):
        diff = means[i] - means[j]
        se_ij = math.sqrt(mse * (1 / n_i[i] + 1 / n_i[j]))
        t_obs = diff / se_ij

        # two-sided p-value for t
        p_raw = 2 * (1 - t.cdf(abs(t_obs), dfw))
        p_adj = min(m * p_raw, 1.0)

        reject_adj = p_adj < alpha  # equivalent to p_raw < alpha/m
        reject_alpha_div_m = p_raw < alpha_bonf

        rows.append({
            "pair": f"{labels[i]} - {labels[j]}",
            "mean_diff": diff,
            "t_obs": t_obs,
            "df": dfw,
            "p_raw": p_raw,
            "p_adj_bonf": p_adj,
            "alpha/m": alpha_bonf,
            "reject (p_adj<alpha)": reject_adj,
            "reject (p_raw<alpha/m)": reject_alpha_div_m
        })

    df = pd.DataFrame(rows)
    return {
        "method": "Bonferroni (pairwise t using ANOVA MSE)",
        "alpha": alpha,
        "m": m,
        "alpha_bonf": alpha_bonf,
        "MSE": mse,
        "df_within": dfw,
        "table": df.sort_values("p_adj_bonf").reset_index(drop=True)
    }


def tukey_hsd_posthoc(groups, labels=None, alpha=0.05):
    """
    Tukey HSD (Tukey–Kramer for possibly unequal n) for all pairwise comparisons.
    Uses studentized range distribution.

    Reject H0 for pair (i,j) if:
        |xbar_i - xbar_j| > q_{1-alpha,k,df} * sqrt(MSE/2 * (1/ni + 1/nj))
    """
    if labels is None:
        labels = [f"G{i+1}" for i in range(len(groups))]
    if len(labels) != len(groups):
        raise ValueError("labels must have same length as groups.")

    mse, dfw, means, n_i = _anova_mse_and_df_within(groups)
    k = len(groups)

    # Tukey critical value for studentized range
    q_crit = studentized_range.ppf(1 - alpha, k, dfw)

    rows = []
    for i, j in itertools.combinations(range(k), 2):
        diff = means[i] - means[j]
        se_ij = math.sqrt(mse / 2 * (1 / n_i[i] + 1 / n_i[j]))
        q_obs = abs(diff) / se_ij

        # Tukey p-value (right tail)
        p_value = 1 - studentized_range.cdf(q_obs, k, dfw)

        # Tukey-Kramer HSD threshold
        hsd = q_crit * se_ij
        reject = abs(diff) > hsd

        rows.append({
            "pair": f"{labels[i]} - {labels[j]}",
            "mean_diff": diff,
            "q_obs": q_obs,
            "df_within": dfw,
            "k": k,
            "q_crit": q_crit,
            "HSD": hsd,
            "p_value": p_value,
            "reject_H0": reject
        })

    df = pd.DataFrame(rows)
    return {
        "method": "Tukey HSD (Tukey–Kramer)",
        "alpha": alpha,
        "k": k,
        "MSE": mse,
        "df_within": dfw,
        "q_crit": q_crit,
        "table": df.sort_values("p_value").reset_index(drop=True)
    }
In [54]:
# --- EXECUTION CELL (uses data from previous ANOVA cells) ---
# Assumes you already have: group_A, group_B, group_C defined above

import pandas as pd
from scipy.stats import t, studentized_range

groups = [group_A, group_B, group_C]
labels = ["A", "B", "C"]

alpha = 0.01

tukey_res = tukey_hsd_posthoc(groups=groups, labels=labels, alpha=alpha)
bonf_res  = bonferroni_posthoc(groups=groups, labels=labels, alpha=alpha)

# -------- Add global critical values (nice for hand-checking) --------
k = tukey_res["k"]
dfw = tukey_res["df_within"]
m = bonf_res["m"]

alpha_bonf = bonf_res["alpha_bonf"]          # = alpha / m
q_crit = tukey_res["q_crit"]                 # Tukey critical q
t_crit_bonf = t.ppf(1 - alpha_bonf/2, dfw)   # Bonferroni critical t (two-sided)

print("=== GLOBAL CRITICAL VALUES ===")
print(f"alpha = {alpha}")
print(f"k = {k}, df_within = {dfw}, number of pairs m = {m}")
print(f"Bonferroni alpha/m = {alpha_bonf:.8f}")
print(f"Bonferroni t_crit (two-sided) = {t_crit_bonf:.6f}")
print(f"Tukey q_crit = {q_crit:.6f}")

# -------- Add per-pair critical thresholds into the tables --------
# Tukey per-pair HSD threshold is already in column 'HSD'
# Add a clearer column name too:
tukey_table = tukey_res["table"].copy()
tukey_table["critical_rule"] = "Reject if |mean_diff| > HSD"

# Bonferroni: add per-pair half-width threshold: t_crit * SE_ij
bonf_table = bonf_res["table"].copy()
bonf_table["t_crit_bonf"] = t_crit_bonf
bonf_table["crit_abs_diff"] = t_crit_bonf * (
    # SE_ij = |mean_diff| / |t_obs|  (stable unless t_obs=0)
    (bonf_table["mean_diff"].abs() / bonf_table["t_obs"].abs()).replace([float("inf")], 0.0)
)
bonf_table["critical_rule"] = "Reject if |mean_diff| > t_crit * SE"

# Display tables
display(tukey_table.style.format({
    "mean_diff": "{:.4f}",
    "q_obs": "{:.4f}",
    "q_crit": "{:.4f}",
    "HSD": "{:.4f}",
    "p_value": "{:.6g}"
}))

display(bonf_table.style.format({
    "mean_diff": "{:.4f}",
    "t_obs": "{:.4f}",
    "p_raw": "{:.6g}",
    "p_adj_bonf": "{:.6g}",
    "alpha/m": "{:.6g}",
    "t_crit_bonf": "{:.4f}",
    "crit_abs_diff": "{:.4f}"
}))
=== GLOBAL CRITICAL VALUES ===
alpha = 0.01
k = 3, df_within = 18, number of pairs m = 3
Bonferroni alpha/m = 0.00333333
Bonferroni t_crit (two-sided) = 3.380362
Tukey q_crit = 4.703370
  pair mean_diff q_obs df_within k q_crit HSD p_value reject_H0 critical_rule
0 A - C -17.0000 6.4995 18 3 4.7034 12.3021 0.000625717 True Reject if |mean_diff| > HSD
1 A - B -15.0000 5.7349 18 3 4.7034 12.3021 0.00204056 True Reject if |mean_diff| > HSD
2 B - C -2.0000 0.7646 18 3 4.7034 12.3021 0.852447 False Reject if |mean_diff| > HSD
  pair mean_diff t_obs df p_raw p_adj_bonf alpha/m reject (p_adj reject (p_raw t_crit_bonf crit_abs_diff critical_rule
0 A - C -17.0000 -4.5958 18 0.000224288 0.000672863 0.00333333 True True 3.3804 12.5039 Reject if |mean_diff| > t_crit * SE
1 A - B -15.0000 -4.0552 18 0.000742764 0.00222829 0.00333333 True True 3.3804 12.5039 Reject if |mean_diff| > t_crit * SE
2 B - C -2.0000 -0.5407 18 0.595351 1 0.00333333 False False 3.3804 12.5039 Reject if |mean_diff| > t_crit * SE
In [ ]:
 

Example 2¶

Problem¶

Three fuel injection systems are tested for efficiency, and the following coded data are obtained:

System 1 System 2 System 3
48 60 57
56 56 55
46 53 52
45 60 50
50 51 51

Question¶

Do the data support the hypothesis that the three fuel injection systems offer equivalent levels of efficiency?

Answer¶

Hypotheses¶

The appropriate hypotheses are

$H_0:\ \mu_1 = \mu_2 = \mu_3$

$H_1:$ At least one mean is different from the others.


Variation Between Samples¶

The observed data are:

System 1 System 2 System 3
48 60 57
56 56 55
46 53 52
45 60 50
50 51 51

The sample means are:

$\bar X_1 = 49,\quad \bar X_2 = 56,\quad \bar X_3 = 53$


Mean of the Sample Means¶

The mean of the sample means is

$\bar{\bar X} = \frac{49 + 56 + 53}{3} = 52.67$


Variation Between Samples¶

The variation present between samples is

$S^2_{Tr} = \frac{1}{n - 1} \sum_{i=1}^{3} (\bar X_i - \bar{\bar X})^2$

$= \frac{1}{3 - 1}\left[(49 - 52.67)^2 + (56 - 52.67)^2 + (53 - 52.67)^2\right]$

$= 12.33$


Variation Within Samples¶

System 1¶

$\sum (X - \bar X_1)^2 = (48 - 49)^2 + (56 - 49)^2 + (46 - 49)^2 + (45 - 49)^2 + (51 - 49)^2 = 76$


System 2¶

$\sum (X - \bar X_2)^2 = (60 - 56)^2 + (56 - 56)^2 + (53 - 56)^2 + (60 - 56)^2 + (51 - 56)^2 = 66$


System 3¶

$\sum (X - \bar X_3)^2 = (57 - 53)^2 + (55 - 53)^2 + (52 - 53)^2 + (50 - 53)^2 + (51 - 53)^2 = 34$


Pooled Within-Sample Variation¶

Hence,

$S^2_E = \frac{\sum (X - \bar X_1)^2 + \sum (X - \bar X_2)^2 + \sum (X - \bar X_3)^2} {(n_1 - 1) + (n_2 - 1) + (n_3 - 1)}$

$= \frac{76 + 66 + 34}{4 + 4 + 4} = 14.67$


Test Statistic¶

The value of the $F$ statistic is given by

$F = \frac{n S^2_{Tr}}{S^2_E} = \frac{5 \times 12.33}{14.67} = 4.20$


Degrees of Freedom¶

  • Degrees of freedom for $S^2_{Tr}$:

    $\text{No. of samples} - 1 = 3 - 1 = 2$

  • Degrees of freedom for $S^2_E$:

    $\text{No. of samples} \times (\text{sample size} - 1) = 3 \times 4 = 12$


Decision¶

The critical value at the $5\%$ level of significance from the $F$ tables is

$F_{(2,12)} = 3.89$

Since

$4.20 > 3.89$

we conclude that there is sufficient evidence to reject $H_0$, and therefore the fuel injection systems are not of equivalent efficiency.

Bonferroni Post Hoc Test (after One-Way ANOVA)¶

We perform Bonferroni pairwise comparisons between the three fuel injection systems, using the ANOVA within-sample variance estimate.

From the ANOVA solution:

  • Group means: $\bar X_1 = 49,\ \bar X_2 = 56,\ \bar X_3 = 53$
  • Within-samples variance estimate (from the worked solution): $S_E^2 = 14.67$
  • Total sample size: $N = 15$, number of groups: $k=3$
  • Degrees of freedom for error: $df = N-k = 12$
  • Each group has size $n_1=n_2=n_3=5$

Step 1. Number of comparisons and Bonferroni level¶

Number of pairwise comparisons:

$m = \binom{3}{2} = 3$

Bonferroni-adjusted significance level:

$\alpha_{\text{Bonf}} = \frac{\alpha}{m} = \frac{0.05}{3} = 0.01667$

For two-sided tests, the critical value uses $\alpha_{\text{Bonf}}/2$.


Step 2. Standard error for pairwise mean differences¶

Using the ANOVA MSE (here $S_E^2$), the standard error for comparing means $i$ and $j$ is

$\text{SE}_{ij} = \sqrt{S_E^2\left(\frac{1}{n_i}+\frac{1}{n_j}\right)}$

Since $n_i=n_j=5$:

$\text{SE} = \sqrt{14.67\left(\frac{1}{5}+\frac{1}{5}\right)} = \sqrt{14.67\cdot 0.4} = \sqrt{5.868} = 2.422$


Step 3. Test statistic and critical region¶

For each pair $(i,j)$ we test $H_0^{(ij)}:\mu_i=\mu_j$ using

$t_{ij} = \frac{\bar X_i - \bar X_j}{\text{SE}}$

Bonferroni critical value (two-sided) is:

$t_{\text{crit}} = t_{1-\alpha_{\text{Bonf}}/2,\ df} = t_{1-0.01667/2,\ 12} = t_{0.991667,\ 12} = 2.779$

So the critical difference in means is:

$|\bar X_i-\bar X_j| > t_{\text{crit}}\cdot \text{SE} = 2.779\cdot 2.422 = 6.732$


Step 4. Compute all pairwise comparisons¶

Pairwise differences¶

  • System 2 vs System 1: $56-49 = 7$
  • System 3 vs System 1: $53-49 = 4$
  • System 2 vs System 3: $56-53 = 3$

Test statistics¶

  • $t_{21} = \dfrac{7}{2.422} = 2.890$
  • $t_{31} = \dfrac{4}{2.422} = 1.651$
  • $t_{23} = \dfrac{3}{2.422} = 1.238$

Two-sided p-values (df = 12)¶

  • $p_{21} = 0.01358$
  • $p_{31} = 0.12459$
  • $p_{23} = 0.23923$

Bonferroni-adjusted p-values: $p^{\text{Bonf}}_{ij}=\min(3p_{ij},1)$

  • $p^{\text{Bonf}}_{21} = 0.04075$
  • $p^{\text{Bonf}}_{31} = 0.37378$
  • $p^{\text{Bonf}}_{23} = 0.71770$

Summary Table¶

Comparison Mean diff SE $t$ Raw p-value Bonferroni p-value Reject at $\alpha=0.05$ (Bonferroni)?
System 2 − System 1 7 2.422 2.890 0.01358 0.04075 Yes
System 3 − System 1 4 2.422 1.651 0.12459 0.37378 No
System 2 − System 3 3 2.422 1.238 0.23923 0.71770 No

(Equivalently, using the critical-difference rule $|\bar X_i-\bar X_j|>6.732$, only the difference 7 is significant.)


Conclusion¶

At overall significance level $\alpha=0.05$ with Bonferroni correction:

  • System 2 differs significantly from System 1.
  • System 3 does not differ significantly from System 1.
  • System 2 does not differ significantly from System 3.

So, the evidence (under Bonferroni control of family-wise error) points to System 2 having higher mean efficiency than System 1, while System 3 is not conclusively different from either.