In many applications we want to compare more than two population means.
Examples:
A naive approach would be to perform many pairwise $t$-tests.
This is incorrect, because it inflates the Type I error rate.
ANOVA provides a single global test for comparing multiple means.
Are all group means equal, or does at least one group differ?
ANOVA tests equality of means, not variances (despite the name).
Suppose we have $k$ groups.
Group $i$ has observations: $$ X_{i1}, X_{i2}, \dots, X_{in_i}, \quad i = 1,\dots,k. $$
Total sample size: $$ N = \sum_{i=1}^{k} n_i. $$
The one-way ANOVA model is: $$ X_{ij} = \mu_i + \varepsilon_{ij}, $$ where:
Assumptions on errors: $$ \varepsilon_{ij} \sim \mathcal{N}(0, \sigma^2), $$ independently for all $i,j$.
Equivalently: $$ X_{ij} \sim \mathcal{N}(\mu_i, \sigma^2). $$
Important:
ANOVA is based on variance decomposition.
Total variability in the data can be split into:
If group means are truly equal, between-group variability should be small relative to within-group variability.
Group means: $$ \bar{X}_i = \frac{1}{n_i}\sum_{j=1}^{n_i} X_{ij} $$
Grand mean: $$ \bar{X} = \frac{1}{N}\sum_{i=1}^{k}\sum_{j=1}^{n_i} X_{ij} $$
Measures total variability: $$ \text{SST} = \sum_{i=1}^{k}\sum_{j=1}^{n_i}(X_{ij} - \bar{X})^2 $$
Measures variability due to differences between group means: $$ \text{SSB} = \sum_{i=1}^{k} n_i(\bar{X}_i - \bar{X})^2 $$
Measures variability within groups: $$ \text{SSW} = \sum_{i=1}^{k}\sum_{j=1}^{n_i}(X_{ij} - \bar{X}_i)^2 $$
This decomposition is exact (not approximate).
And: $$ (N - 1) = (k - 1) + (N - k) $$
To compare variances, sums of squares are normalized by degrees of freedom.
Interpretation:
The ANOVA test statistic is: $$ F_{\text{obs}} = \frac{\text{MSB}}{\text{MSW}} $$
Under $H_0$: $$ F_{\text{obs}} \sim F(k-1, N-k) $$
Key theoretical result:
Therefore: $$ \frac{(\text{SSB}/(k-1))}{(\text{SSW}/(N-k))} \sim F(k-1, N-k) $$
Given significance level $\alpha$:
Reject $H_0$ if: $$ F_{\text{obs}} > F_{1-\alpha}(k-1, N-k) $$
Equivalently, reject if: $$ \text{p-value} < \alpha $$
| Source | Sum of Squares | df | Mean Square | F | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Between groups | SSB | k − 1 | MSB = SSB/(k − 1) | MSB/MSW | ||||||||
| Within groups | SSW | N − k | MSW = SSW/(N − k) | |||||||||
| Total | SST | N − 1 |
Probably equal |
Almost surely different |
Ambiguous |
import math
from scipy.stats import f
def one_way_anova(
groups,
alpha=0.05
):
"""
One-way ANOVA.
H0: All group means are equal
μ1 = μ2 = ... = μk
H1: At least one group mean differs
Input
-----
groups : list of lists (or arrays)
groups[i] contains observations from group i
Need at least 2 groups, each with at least 2 observations
alpha : significance level
Test statistic:
F = MS_between / MS_within
Degrees of freedom:
df_between = k - 1
df_within = N - k
Uses BOTH:
(1) p-value method
(2) critical region method
"""
# ---------- Validate input ----------
if groups is None or len(groups) < 2:
raise ValueError("At least two groups are required.")
k = len(groups)
n_i = []
means = []
for g in groups:
if len(g) < 2:
raise ValueError("Each group must have at least 2 observations.")
n_i.append(len(g))
means.append(sum(g) / len(g))
N = sum(n_i)
# ---------- Grand mean ----------
grand_mean = sum(
means[i] * n_i[i] for i in range(k)
) / N
# ---------- Sum of Squares ----------
# Between groups
ss_between = sum(
n_i[i] * (means[i] - grand_mean) ** 2
for i in range(k)
)
# Within groups
ss_within = 0.0
for i in range(k):
ss_within += sum(
(x - means[i]) ** 2 for x in groups[i]
)
# Total (optional check)
ss_total = ss_between + ss_within
# ---------- Degrees of freedom ----------
df_between = k - 1
df_within = N - k
if df_within <= 0:
raise ValueError("Not enough data to compute within-group variance.")
# ---------- Mean Squares ----------
ms_between = ss_between / df_between
ms_within = ss_within / df_within
# ---------- F statistic ----------
F_obs = ms_between / ms_within
# ---------- p-value method ----------
p_value = 1 - f.cdf(F_obs, df_between, df_within)
reject_by_pvalue = p_value < alpha
# ---------- Critical region method ----------
F_crit = f.ppf(1 - alpha, df_between, df_within)
reject_by_critical = F_obs > F_crit
critical_region = f"F > {F_crit:.4f}"
# ---------- Return results ----------
return {
"inputs": {
"k": k,
"group_sizes": n_i,
"alpha": alpha
},
"means": {
"group_means": means,
"grand_mean": grand_mean
},
"anova_table": {
"SS_between": ss_between,
"df_between": df_between,
"MS_between": ms_between,
"SS_within": ss_within,
"df_within": df_within,
"MS_within": ms_within,
"SS_total": ss_total,
"df_total": N - 1
},
"statistic": {
"F_obs": F_obs
},
"p_value_method": {
"p_value": p_value,
"reject_H0": reject_by_pvalue
},
"critical_region_method": {
"critical_region": critical_region,
"F_crit": F_crit,
"reject_H0": reject_by_critical
}
}
A researcher claims that there is a difference in the average age of assistant professors, associate professors, and full professors at her university.
Faculty members are selected randomly, and their ages are recorded.
Assume that faculty ages are normally distributed.
Test the researcher’s claim at the $\alpha = 0.01$ significance level.
The observed data are:
| Rank | Ages |
|---|---|
| Assistant Professor | 28, 32, 36, 42, 50, 33, 38 |
| Associate Professor | 44, 61, 52, 54, 62, 45, 46 |
| Professor | 54, 56, 55, 65, 52, 50, 46 |
# Real-ish example data: exam scores from 3 teaching methods
group_A = [28, 32, 36, 42, 50, 33, 38]
group_B = [44, 61, 52, 54, 62, 45, 46]
group_C = [54, 56, 55, 65, 52, 50, 46]
groups = [group_A, group_B, group_C]
anova_res = one_way_anova(groups=groups, alpha=0.01)
anova_res
{'inputs': {'k': 3, 'group_sizes': [7, 7, 7], 'alpha': 0.01},
'means': {'group_means': [37.0, 52.0, 54.0],
'grand_mean': 47.666666666666664},
'anova_table': {'SS_between': 1208.6666666666667,
'df_between': 2,
'MS_between': 604.3333333333334,
'SS_within': 862.0,
'df_within': 18,
'MS_within': 47.888888888888886,
'SS_total': 2070.666666666667,
'df_total': 20},
'statistic': {'F_obs': 12.619489559164736},
'p_value_method': {'p_value': 0.00037546863291471055, 'reject_H0': True},
'critical_region_method': {'critical_region': 'F > 6.0129',
'F_crit': 6.012904834800529,
'reject_H0': True}}
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 4.5))
plt.boxplot(groups, labels=["Group A", "Group B", "Group C"])
plt.title("One-way ANOVA data: boxplots by group")
plt.xlabel("Group")
plt.ylabel("Value")
plt.grid(True, alpha=0.3)
plt.show()
import numpy as np
import matplotlib.pyplot as plt
# We assume these already exist from previous cells:
# group_A, group_B, group_C
groups = {
"A": group_A,
"B": group_B,
"C": group_C
}
rng = np.random.default_rng(0) # reproducible jitter
jitter = 0.08 # horizontal spread
plt.figure(figsize=(8, 4.5))
labels = list(groups.keys())
x_pos = np.arange(len(labels))
for i, label in enumerate(labels):
y = np.array(groups[label], dtype=float)
x = i + rng.uniform(-jitter, jitter, size=len(y))
# individual observations
plt.scatter(x, y, s=35, marker="s", alpha=0.9)
# group mean
plt.scatter(i, y.mean(), s=120, marker="o", zorder=3)
plt.xticks(x_pos, labels)
plt.ylabel("Response")
plt.title("Almost surely different")
plt.grid(True, alpha=0.25)
plt.tight_layout()
plt.show()
import numpy as np
from scipy.stats import f
df1 = anova_res["anova_table"]["df_between"]
df2 = anova_res["anova_table"]["df_within"]
alpha = anova_res["inputs"]["alpha"]
F_obs = anova_res["statistic"]["F_obs"]
F_crit = f.ppf(1 - alpha, df1, df2)
# x-grid (go far enough to include F_obs and the critical value)
x_max = max(F_obs, F_crit) + 3
x = np.linspace(0, x_max, 1200)
y = f.pdf(x, df1, df2)
plt.figure(figsize=(9, 4.5))
plt.plot(x, y, label=f"F density (df1={df1}, df2={df2})")
# Shade critical region (right tail)
mask = x >= F_crit
plt.fill_between(x[mask], y[mask], alpha=0.3, label="Critical region")
# Mark critical value and observed statistic
plt.axvline(F_crit, linestyle="--", label=f"F_crit = {F_crit:.3f}")
plt.axvline(F_obs, linestyle="-.", label=f"F_obs = {F_obs:.3f}")
plt.title(f"F-test in one-way ANOVA (reject H0 = {F_obs > F_crit})")
plt.xlabel("F")
plt.ylabel("density")
plt.grid(True, alpha=0.3)
plt.legend()
plt.show()
In one-way ANOVA we test the global null hypothesis
$H_0:\ \mu_1 = \mu_2 = \dots = \mu_k$
If ANOVA rejects $H_0$, we only know that at least one mean differs, but:
❌ ANOVA does not tell us which groups differ.
To identify where the differences lie, we perform post hoc multiple comparison tests.
Suppose we have $k$ groups.
$m = \binom{k}{2} = \frac{k(k-1)}{2}$
If we test each comparison at level $\alpha = 0.05$, then the probability of making at least one Type I error increases rapidly.
One should never use multiple two-sample t-tests when comparing more than two groups.
Doing so inflates the Type I error rate.
Assume we perform hypothesis tests at significance level $\alpha = 0.05$.
For one test:
Suppose we perform $m$ independent comparisons.
Probability of no Type I errors:
$(1 - \alpha)^m$
Probability of at least one Type I error (Family-Wise Error Rate):
$\boxed{\text{FWER} = 1 - (1 - \alpha)^m}$
This probability increases rapidly as $m$ grows.
Let $\alpha = 0.05$ and $m = 2$.
Probability of no Type I error:
$(1 - 0.05)^2 = 0.9025$
Probability of at least one Type I error:
$1 - 0.9025 = 0.0975$
So the Type I error rate is almost doubled.
For $k = 5$ groups:
$m = \binom{5}{2} = 10$
Probability of at least one Type I error:
$1 - (1 - 0.05)^{10} \approx 0.401$
➡️ 40% chance of falsely detecting a difference!
Even if all group means are truly equal, using multiple two-sample t-tests:
Tests:
$H_0:\ \mu_1 = \mu_2 = \dots = \mu_k$
Only after rejecting ANOVA do we proceed to post hoc tests that explicitly control the family-wise error rate.
Performing multiple two-sample t-tests inflates the Type I error rate, with
$\text{FWER} = 1 - (1 - \alpha)^m$,
which is why ANOVA followed by post hoc tests must be used instead.
| Method | Controls FWER | Assumptions | Notes |
|---|---|---|---|
| Bonferroni | Yes | Minimal | Conservative |
| Holm–Bonferroni | Yes | Minimal | Less conservative |
| Tukey HSD | Yes | Equal variances | Most common after ANOVA |
| Scheffé | Yes | Very general | Very conservative |
| Fisher LSD | No | Equal variances | Only valid if ANOVA significant |
Bonferroni is based on a simple inequality:
$\mathbb{P}\left(\bigcup_{i=1}^m A_i\right) \le \sum_{i=1}^m \mathbb{P}(A_i)$
To ensure:
$\text{FWER} \le \alpha$
we test each hypothesis at level:
$\boxed{\alpha_{\text{Bonf}} = \frac{\alpha}{m}}$
Let $m = \binom{k}{2}$ pairwise comparisons.
For each pair $(i,j)$:
$H_0^{(ij)}:\ \mu_i = \mu_j$
vs
$H_1^{(ij)}:\ \mu_i \neq \mu_j$
Typically use two-sample t-tests:
Using the pooled within-group variance estimate from ANOVA (Mean Square Error):
$\text{MSE} = \text{MSW}$
the Bonferroni test statistic is
\frac{\bar{x}_i - \bar{x}_j} {\sqrt{\text{MSE}\left(\frac{1}{n_i} + \frac{1}{n_j}\right)}}$
where:
$df = N - k$
where:
If $m = \binom{k}{2}$ pairwise comparisons are performed, the Bonferroni-adjusted significance level is
$\alpha_{\text{Bonf}} = \frac{\alpha}{m}$
Reject $H_0^{(ij)}$ if either of the following equivalent conditions holds:
$|t_{ij}| > t_{1-\alpha_{\text{Bonf}}/2,\,df}$
or
$p_{ij} < \alpha_{\text{Bonf}}$
Alternatively, define the adjusted p-value
$p^{\text{Bonf}}_{ij} = \min(m \cdot p_{ij},\ 1)$
Reject $H_0^{(ij)}$ if
$p^{\text{Bonf}}_{ij} < \alpha$
If the Bonferroni-adjusted test rejects $H_0^{(ij)}$, we conclude that the mean responses of groups $i$ and $j$ differ, while maintaining family-wise error rate control at level $\alpha$.
✔ Very simple
✔ Works with any test statistic
✔ No distributional assumptions beyond the base test
✔ Valid for unbalanced designs
❌ Conservative, especially when $m$ is large
❌ Reduced power (more Type II errors)
Bonferroni is appropriate when:
| Aspect | Bonferroni | Tukey |
|---|---|---|
| Power | Lower | Higher |
| FWER control | Guaranteed | Guaranteed |
| Assumes equal variances | No | Yes |
| Uses ANOVA MSE | Optional | Yes |
| Typical use | General | Standard ANOVA |
Key sentence:
Bonferroni correction controls the family-wise error rate by testing each comparison at level $\alpha/m$.
Recall that one-way ANOVA tests the global hypothesis
$H_0:\ \mu_1 = \mu_2 = \dots = \mu_k$
If ANOVA rejects $H_0$, we conclude that at least one mean differs, but we still do not know which pairs of means differ.
👉 Tukey’s HSD is a post hoc multiple comparison procedure designed specifically for all pairwise comparisons after ANOVA.
For every pair of groups $(i,j)$, Tukey’s HSD tests
$H_0^{(ij)}:\ \mu_i = \mu_j$
vs
$H_1^{(ij)}:\ \mu_i \neq \mu_j$
while controlling the family-wise error rate (FWER) at level $\alpha$.
Tukey’s HSD uses the studentized range distribution, which accounts for the fact that:
Instead of adjusting $\alpha$ (like Bonferroni), Tukey adjusts the critical value.
Tukey’s HSD relies on the same assumptions as one-way ANOVA:
If variances are unequal, Tukey’s HSD may not be valid.
Let:
For groups with equal sample sizes $n$:
$\text{SE} = \sqrt{\frac{MSE}{n}}$
For unequal sample sizes (Tukey–Kramer):
$\text{SE}_{ij} = \sqrt{\frac{MSE}{2}\left(\frac{1}{n_i} + \frac{1}{n_j}\right)}$
The Tukey test compares the absolute mean difference to a critical threshold:
$|\bar x_i - \bar x_j|$
Reject $H_0^{(ij)}$ if:
$|\bar x_i - \bar x_j| > q_{\alpha,k,df}\cdot \text{SE}_{ij}$
where:
The studentized range statistic is:
$q = \frac{\max(\bar X_1,\dots,\bar X_k) - \min(\bar X_1,\dots,\bar X_k)}{S}$
where $S$ is an estimate of the standard deviation.
This distribution explicitly accounts for multiple comparisons among means.
Tukey’s HSD guarantees:
$\mathbb{P}(\text{at least one Type I error}) \le \alpha$
for all pairwise mean comparisons.
This is exact control, not an approximation.
For each pair $(i,j)$, Tukey’s method produces simultaneous confidence intervals:
$(\bar x_i - \bar x_j) \ \pm\ q_{1-\alpha,k,df}\cdot \text{SE}_{ij}$
All intervals jointly have coverage probability at least $1-\alpha$.
If an interval does not contain 0, the corresponding means differ significantly.
| Aspect | Tukey HSD | Bonferroni |
|---|---|---|
| Designed for pairwise means | Yes | No (general) |
| Uses ANOVA MSE | Yes | Optional |
| Equal variance assumption | Yes | No |
| Power | Higher | Lower |
| FWER control | Exact | Upper bound |
| Conservativeness | Moderate | Often very conservative |
Use Tukey’s HSD when:
✔ ANOVA is significant
✔ You want all pairwise comparisons
✔ Variances are approximately equal
✔ You want higher power than Bonferroni
Avoid Tukey’s HSD when variances differ substantially.
If Tukey’s HSD finds that:
then we conclude:
Group A differs from both B and C, while B and C are statistically indistinguishable.
❌ Tukey’s HSD can be used without ANOVA
✔ It can, but it is intended as a post hoc method
❌ Tukey tests variances
✔ Tukey compares means, not variances
❌ Tukey is always better than Bonferroni
✔ Only when assumptions hold
Key sentence:
Tukey’s HSD controls the family-wise error rate by using the studentized range distribution to compare all pairwise mean differences simultaneously.
import math
import itertools
import numpy as np
import pandas as pd
from scipy.stats import t, studentized_range
def _anova_mse_and_df_within(groups):
"""
Compute MSE (= MSW) and df_within from raw groups.
groups: list of arrays/lists
"""
k = len(groups)
n_i = [len(g) for g in groups]
if k < 2 or any(n < 2 for n in n_i):
raise ValueError("Need at least 2 groups and each group must have n>=2.")
N = sum(n_i)
means = [sum(g) / len(g) for g in groups]
ss_within = 0.0
for i, g in enumerate(groups):
ss_within += sum((x - means[i]) ** 2 for x in g)
df_within = N - k
mse = ss_within / df_within
return mse, df_within, means, n_i
def bonferroni_posthoc(groups, labels=None, alpha=0.05):
"""
Bonferroni post hoc for all pairwise mean comparisons after one-way ANOVA.
Uses ANOVA MSE (= MSW) as common variance estimator.
"""
if labels is None:
labels = [f"G{i+1}" for i in range(len(groups))]
if len(labels) != len(groups):
raise ValueError("labels must have same length as groups.")
mse, dfw, means, n_i = _anova_mse_and_df_within(groups)
k = len(groups)
m = k * (k - 1) // 2 # number of pairwise comparisons
alpha_bonf = alpha / m
rows = []
for i, j in itertools.combinations(range(k), 2):
diff = means[i] - means[j]
se_ij = math.sqrt(mse * (1 / n_i[i] + 1 / n_i[j]))
t_obs = diff / se_ij
# two-sided p-value for t
p_raw = 2 * (1 - t.cdf(abs(t_obs), dfw))
p_adj = min(m * p_raw, 1.0)
reject_adj = p_adj < alpha # equivalent to p_raw < alpha/m
reject_alpha_div_m = p_raw < alpha_bonf
rows.append({
"pair": f"{labels[i]} - {labels[j]}",
"mean_diff": diff,
"t_obs": t_obs,
"df": dfw,
"p_raw": p_raw,
"p_adj_bonf": p_adj,
"alpha/m": alpha_bonf,
"reject (p_adj<alpha)": reject_adj,
"reject (p_raw<alpha/m)": reject_alpha_div_m
})
df = pd.DataFrame(rows)
return {
"method": "Bonferroni (pairwise t using ANOVA MSE)",
"alpha": alpha,
"m": m,
"alpha_bonf": alpha_bonf,
"MSE": mse,
"df_within": dfw,
"table": df.sort_values("p_adj_bonf").reset_index(drop=True)
}
def tukey_hsd_posthoc(groups, labels=None, alpha=0.05):
"""
Tukey HSD (Tukey–Kramer for possibly unequal n) for all pairwise comparisons.
Uses studentized range distribution.
Reject H0 for pair (i,j) if:
|xbar_i - xbar_j| > q_{1-alpha,k,df} * sqrt(MSE/2 * (1/ni + 1/nj))
"""
if labels is None:
labels = [f"G{i+1}" for i in range(len(groups))]
if len(labels) != len(groups):
raise ValueError("labels must have same length as groups.")
mse, dfw, means, n_i = _anova_mse_and_df_within(groups)
k = len(groups)
# Tukey critical value for studentized range
q_crit = studentized_range.ppf(1 - alpha, k, dfw)
rows = []
for i, j in itertools.combinations(range(k), 2):
diff = means[i] - means[j]
se_ij = math.sqrt(mse / 2 * (1 / n_i[i] + 1 / n_i[j]))
q_obs = abs(diff) / se_ij
# Tukey p-value (right tail)
p_value = 1 - studentized_range.cdf(q_obs, k, dfw)
# Tukey-Kramer HSD threshold
hsd = q_crit * se_ij
reject = abs(diff) > hsd
rows.append({
"pair": f"{labels[i]} - {labels[j]}",
"mean_diff": diff,
"q_obs": q_obs,
"df_within": dfw,
"k": k,
"q_crit": q_crit,
"HSD": hsd,
"p_value": p_value,
"reject_H0": reject
})
df = pd.DataFrame(rows)
return {
"method": "Tukey HSD (Tukey–Kramer)",
"alpha": alpha,
"k": k,
"MSE": mse,
"df_within": dfw,
"q_crit": q_crit,
"table": df.sort_values("p_value").reset_index(drop=True)
}
# --- EXECUTION CELL (uses data from previous ANOVA cells) ---
# Assumes you already have: group_A, group_B, group_C defined above
import pandas as pd
from scipy.stats import t, studentized_range
groups = [group_A, group_B, group_C]
labels = ["A", "B", "C"]
alpha = 0.01
tukey_res = tukey_hsd_posthoc(groups=groups, labels=labels, alpha=alpha)
bonf_res = bonferroni_posthoc(groups=groups, labels=labels, alpha=alpha)
# -------- Add global critical values (nice for hand-checking) --------
k = tukey_res["k"]
dfw = tukey_res["df_within"]
m = bonf_res["m"]
alpha_bonf = bonf_res["alpha_bonf"] # = alpha / m
q_crit = tukey_res["q_crit"] # Tukey critical q
t_crit_bonf = t.ppf(1 - alpha_bonf/2, dfw) # Bonferroni critical t (two-sided)
print("=== GLOBAL CRITICAL VALUES ===")
print(f"alpha = {alpha}")
print(f"k = {k}, df_within = {dfw}, number of pairs m = {m}")
print(f"Bonferroni alpha/m = {alpha_bonf:.8f}")
print(f"Bonferroni t_crit (two-sided) = {t_crit_bonf:.6f}")
print(f"Tukey q_crit = {q_crit:.6f}")
# -------- Add per-pair critical thresholds into the tables --------
# Tukey per-pair HSD threshold is already in column 'HSD'
# Add a clearer column name too:
tukey_table = tukey_res["table"].copy()
tukey_table["critical_rule"] = "Reject if |mean_diff| > HSD"
# Bonferroni: add per-pair half-width threshold: t_crit * SE_ij
bonf_table = bonf_res["table"].copy()
bonf_table["t_crit_bonf"] = t_crit_bonf
bonf_table["crit_abs_diff"] = t_crit_bonf * (
# SE_ij = |mean_diff| / |t_obs| (stable unless t_obs=0)
(bonf_table["mean_diff"].abs() / bonf_table["t_obs"].abs()).replace([float("inf")], 0.0)
)
bonf_table["critical_rule"] = "Reject if |mean_diff| > t_crit * SE"
# Display tables
display(tukey_table.style.format({
"mean_diff": "{:.4f}",
"q_obs": "{:.4f}",
"q_crit": "{:.4f}",
"HSD": "{:.4f}",
"p_value": "{:.6g}"
}))
display(bonf_table.style.format({
"mean_diff": "{:.4f}",
"t_obs": "{:.4f}",
"p_raw": "{:.6g}",
"p_adj_bonf": "{:.6g}",
"alpha/m": "{:.6g}",
"t_crit_bonf": "{:.4f}",
"crit_abs_diff": "{:.4f}"
}))
=== GLOBAL CRITICAL VALUES === alpha = 0.01 k = 3, df_within = 18, number of pairs m = 3 Bonferroni alpha/m = 0.00333333 Bonferroni t_crit (two-sided) = 3.380362 Tukey q_crit = 4.703370
| pair | mean_diff | q_obs | df_within | k | q_crit | HSD | p_value | reject_H0 | critical_rule | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | A - C | -17.0000 | 6.4995 | 18 | 3 | 4.7034 | 12.3021 | 0.000625717 | True | Reject if |mean_diff| > HSD |
| 1 | A - B | -15.0000 | 5.7349 | 18 | 3 | 4.7034 | 12.3021 | 0.00204056 | True | Reject if |mean_diff| > HSD |
| 2 | B - C | -2.0000 | 0.7646 | 18 | 3 | 4.7034 | 12.3021 | 0.852447 | False | Reject if |mean_diff| > HSD |
| pair | mean_diff | t_obs | df | p_raw | p_adj_bonf | alpha/m | reject (p_adj| reject (p_raw | t_crit_bonf |
crit_abs_diff |
critical_rule |
| |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | A - C | -17.0000 | -4.5958 | 18 | 0.000224288 | 0.000672863 | 0.00333333 | True | True | 3.3804 | 12.5039 | Reject if |mean_diff| > t_crit * SE |
| 1 | A - B | -15.0000 | -4.0552 | 18 | 0.000742764 | 0.00222829 | 0.00333333 | True | True | 3.3804 | 12.5039 | Reject if |mean_diff| > t_crit * SE |
| 2 | B - C | -2.0000 | -0.5407 | 18 | 0.595351 | 1 | 0.00333333 | False | False | 3.3804 | 12.5039 | Reject if |mean_diff| > t_crit * SE |
Three fuel injection systems are tested for efficiency, and the following coded data are obtained:
| System 1 | System 2 | System 3 |
|---|---|---|
| 48 | 60 | 57 |
| 56 | 56 | 55 |
| 46 | 53 | 52 |
| 45 | 60 | 50 |
| 50 | 51 | 51 |
Do the data support the hypothesis that the three fuel injection systems offer equivalent levels of efficiency?
The appropriate hypotheses are
$H_0:\ \mu_1 = \mu_2 = \mu_3$
$H_1:$ At least one mean is different from the others.
The observed data are:
| System 1 | System 2 | System 3 |
|---|---|---|
| 48 | 60 | 57 |
| 56 | 56 | 55 |
| 46 | 53 | 52 |
| 45 | 60 | 50 |
| 50 | 51 | 51 |
The sample means are:
$\bar X_1 = 49,\quad \bar X_2 = 56,\quad \bar X_3 = 53$
The mean of the sample means is
$\bar{\bar X} = \frac{49 + 56 + 53}{3} = 52.67$
The variation present between samples is
$S^2_{Tr} = \frac{1}{n - 1} \sum_{i=1}^{3} (\bar X_i - \bar{\bar X})^2$
$= \frac{1}{3 - 1}\left[(49 - 52.67)^2 + (56 - 52.67)^2 + (53 - 52.67)^2\right]$
$= 12.33$
$\sum (X - \bar X_1)^2 = (48 - 49)^2 + (56 - 49)^2 + (46 - 49)^2 + (45 - 49)^2 + (51 - 49)^2 = 76$
$\sum (X - \bar X_2)^2 = (60 - 56)^2 + (56 - 56)^2 + (53 - 56)^2 + (60 - 56)^2 + (51 - 56)^2 = 66$
$\sum (X - \bar X_3)^2 = (57 - 53)^2 + (55 - 53)^2 + (52 - 53)^2 + (50 - 53)^2 + (51 - 53)^2 = 34$
Hence,
$S^2_E = \frac{\sum (X - \bar X_1)^2 + \sum (X - \bar X_2)^2 + \sum (X - \bar X_3)^2} {(n_1 - 1) + (n_2 - 1) + (n_3 - 1)}$
$= \frac{76 + 66 + 34}{4 + 4 + 4} = 14.67$
The value of the $F$ statistic is given by
$F = \frac{n S^2_{Tr}}{S^2_E} = \frac{5 \times 12.33}{14.67} = 4.20$
Degrees of freedom for $S^2_{Tr}$:
$\text{No. of samples} - 1 = 3 - 1 = 2$
Degrees of freedom for $S^2_E$:
$\text{No. of samples} \times (\text{sample size} - 1) = 3 \times 4 = 12$
The critical value at the $5\%$ level of significance from the $F$ tables is
$F_{(2,12)} = 3.89$
Since
$4.20 > 3.89$
we conclude that there is sufficient evidence to reject $H_0$, and therefore the fuel injection systems are not of equivalent efficiency.
We perform Bonferroni pairwise comparisons between the three fuel injection systems, using the ANOVA within-sample variance estimate.
From the ANOVA solution:
Number of pairwise comparisons:
$m = \binom{3}{2} = 3$
Bonferroni-adjusted significance level:
$\alpha_{\text{Bonf}} = \frac{\alpha}{m} = \frac{0.05}{3} = 0.01667$
For two-sided tests, the critical value uses $\alpha_{\text{Bonf}}/2$.
Using the ANOVA MSE (here $S_E^2$), the standard error for comparing means $i$ and $j$ is
$\text{SE}_{ij} = \sqrt{S_E^2\left(\frac{1}{n_i}+\frac{1}{n_j}\right)}$
Since $n_i=n_j=5$:
$\text{SE} = \sqrt{14.67\left(\frac{1}{5}+\frac{1}{5}\right)} = \sqrt{14.67\cdot 0.4} = \sqrt{5.868} = 2.422$
For each pair $(i,j)$ we test $H_0^{(ij)}:\mu_i=\mu_j$ using
$t_{ij} = \frac{\bar X_i - \bar X_j}{\text{SE}}$
Bonferroni critical value (two-sided) is:
$t_{\text{crit}} = t_{1-\alpha_{\text{Bonf}}/2,\ df} = t_{1-0.01667/2,\ 12} = t_{0.991667,\ 12} = 2.779$
So the critical difference in means is:
$|\bar X_i-\bar X_j| > t_{\text{crit}}\cdot \text{SE} = 2.779\cdot 2.422 = 6.732$
Bonferroni-adjusted p-values: $p^{\text{Bonf}}_{ij}=\min(3p_{ij},1)$
| Comparison | Mean diff | SE | $t$ | Raw p-value | Bonferroni p-value | Reject at $\alpha=0.05$ (Bonferroni)? |
|---|---|---|---|---|---|---|
| System 2 − System 1 | 7 | 2.422 | 2.890 | 0.01358 | 0.04075 | Yes |
| System 3 − System 1 | 4 | 2.422 | 1.651 | 0.12459 | 0.37378 | No |
| System 2 − System 3 | 3 | 2.422 | 1.238 | 0.23923 | 0.71770 | No |
(Equivalently, using the critical-difference rule $|\bar X_i-\bar X_j|>6.732$, only the difference 7 is significant.)
At overall significance level $\alpha=0.05$ with Bonferroni correction:
So, the evidence (under Bonferroni control of family-wise error) points to System 2 having higher mean efficiency than System 1, while System 3 is not conclusively different from either.