The Mann–Whitney U test is a non-parametric test for comparing two independent samples.
It assesses whether one distribution tends to produce larger values than the other and is a robust alternative to the two-sample t-test.
Let $$ X_1,\dots,X_{n_1} \sim F, \qquad Y_1,\dots,Y_{n_2} \sim G, $$ where all observations are independent.
The goal is to compare the distributions $F$ and $G$ without assuming normality.
Null hypothesis $$ H_0: F = G $$
Alternative hypothesis $$ H_1: F \neq G $$ (or one-sided variants: $F$ stochastically dominates $G$ or vice versa)
⚠️ Important: this is not a test of equality of means in general.
Null hypothesis (Mann–Whitney U test).
Let $X$ and $Y$ be independent random variables representing observations from the two groups. The Mann–Whitney test is based on the null hypothesis $$ H_0:\; \mathbb P(X<Y) + \tfrac12\,\mathbb P(X=Y) = \tfrac12. $$
The Mann–Whitney statistics can be written as $$ U_X = n_1 n_2 + \frac{n_1(n_1+1)}{2} - R_X, \qquad U_Y = n_1 n_2 + \frac{n_2(n_2+1)}{2} - R_Y, $$ where $$ R_X = \sum_{i=1}^{n_1} R(X_i), \qquad R_Y = \sum_{j=1}^{n_2} R(Y_j) $$ are the rank sums of the $X$ and $Y$ samples, respectively.
The test statistic used in the Mann–Whitney test is $$ U = \min(U_X, U_Y). $$
This symmetrization ensures invariance under relabeling of the two samples.
Notice that the table for critical values gives the values for the 2-tailed test. In other words, we will fail to reject $H_0$ when The U statistic is greater than the critical value from the table.
Equivalently, $$ U_X = \sum_{i=1}^{n_1}\sum_{j=1}^{n_2} \mathbf{1}\{X_i < Y_j\}, $$ with ties handled via midranks in practice.
This representation is central for the theoretical interpretation of the test.
The rank-sum statistic $R_X$ and $U_X$ are affinely related: $$ R_X = U_X + \frac{n_1(n_1+1)}{2}. $$
All formulations $(R_X, U_X, U)$ induce identical tests and p-values, differing only by centering and symmetrization.
Under $H_0$:
Thus, $U_X$ has an exact permutation distribution depending only on $(n_1,n_2)$.
Formally: $$
\frac{#{\text{rank allocations yielding } u}}{\binom{n_1+n_2}{n_1}}. $$
This distribution is:
Consider two samples:
Under the null hypothesis $H_0$, the two samples come from the same continuous distribution.
Pool all observations and assign ranks
$1,2,3,4,5$.
Under $H_0$:
Total number of allocations: $$ \binom{n_1+n_2}{n_1} = \binom{5}{2} = 10. $$
Each allocation has probability $1/10$.
Let $R_X$ be the sum of the ranks assigned to sample $X$.
The Mann–Whitney statistic is defined as $$ U_X = R_X - \frac{n_1(n_1+1)}{2} = R_X - 3. $$
For $n_1 = 2$, the Mann–Whitney statistic is $$ U_X = R_X - 3. $$
| Ranks assigned to $X$ | $R_X$ | $U_X$ |
|---|---|---|
| $\{1,2\}$ | 3 | 0 |
| $\{1,3\}$ | 4 | 1 |
| $\{1,4\}$ | 5 | 2 |
| $\{1,5\}$ | 6 | 3 |
| $\{2,3\}$ | 5 | 2 |
| $\{2,4\}$ | 6 | 3 |
| $\{2,5\}$ | 7 | 4 |
| $\{3,4\}$ | 7 | 4 |
| $\{3,5\}$ | 8 | 5 |
| $\{4,5\}$ | 9 | 6 |
By counting how many allocations produce each value of $U_X$, we obtain:
| $u$ | Count | $\mathbb{P}(U_X = u)$ |
|---|---|---|
| 0 | 1 | 0.1 |
| 1 | 1 | 0.1 |
| 2 | 2 | 0.2 |
| 3 | 2 | 0.2 |
| 4 | 2 | 0.2 |
| 5 | 1 | 0.1 |
| 6 | 1 | 0.1 |
Formally, $$
\frac{#{\text{rank allocations yielding } u}}{\binom{5}{2}}. $$
Here, $$ n_1 n_2 = 2 \cdot 3 = 6, $$ so the distribution of $U_X$ is symmetric about $$ \frac{n_1 n_2}{2} = 3. $$
Indeed, $$ \mathbb{P}(U_X = 0) = \mathbb{P}(U_X = 6), \quad \mathbb{P}(U_X = 1) = \mathbb{P}(U_X = 5), \quad \mathbb{P}(U_X = 2) = \mathbb{P}(U_X = 4). $$
Moreover, $$ U_Y = n_1 n_2 - U_X = 6 - U_X, $$ so the two Mann–Whitney statistics are complementary for each allocation.
This example shows explicitly that under $H_0$:
The minimum and maximum possible values of $U_X$ are: $$ U_{X,\min} = 0, \qquad U_{X,\max} = n_1 n_2. $$
Thus: $$ U_X \in \{0,1,\dots,n_1 n_2\}. $$
Each value corresponds to the number of $(X_i,Y_j)$ pairs such that $X_i < Y_j$.
Under $H_0$: $$ \mathbb{E}[U_X] = \frac{n_1 n_2}{2}, $$ $$ \mathrm{Var}(U_X) = \frac{n_1 n_2 (n_1+n_2+1)}{12}. $$
The statistic $U=\min(U_X,U_Y)$ has the same null distribution by symmetry.
Define the population parameter $$
\mathbb{P}(X < Y) + \tfrac12 \mathbb{P}(X = Y). $$
Then: $$ \mathbb{E}\!\left[\frac{U_X}{n_1 n_2}\right] = \theta. $$
Under $H_0: F = G$, we have $$ \theta = \tfrac12. $$
Thus, the Mann–Whitney test is a test of $$ H_0:\; \theta = \tfrac12, $$ corresponding to absence of stochastic dominance.
The statistic $U_X$ is a U-statistic with kernel $$ h(x,y) = \mathbf{1}\{x < y\}. $$
By Hoeffding’s theory of U-statistics:
As $n_1,n_2 \to \infty$: $$ \frac{U_X - \mathbb{E}[U_X]}{\sqrt{\mathrm{Var}(U_X)}} \;\xrightarrow{d}\; N(0,1). $$
This follows from the Hoeffding decomposition and the CLT for U-statistics.
It is possible that two or more observations take the same value. In this case, the Mann–Whitney U statistic can still be computed by allocating half of each tie to sample $X$ and half to sample $Y$ (equivalently, by using mean ranks).
However, when ties are present, the normal approximation to the distribution of $U$ must be used with a correction to the standard deviation. The adjusted standard deviation of $U$ is
$
\sqrt{ \frac{n_x n_y}{N (N - 1)} \left[
\sum_{j=1}^{g} \frac{t_j^3 - t_j}{12} \right] }, $
where
In such cases, rejection does not necessarily correspond to a pure location shift.
The Mann–Whitney test is an exact, distribution-free U-statistic test of stochastic dominance whose null distribution arises from random rank allocations and converges asymptotically to a normal distribution.
The test counts how often observations from one sample are smaller than those from the other.
Student employees are a major part of employment on most college campuses. Two major departments that participate in student hiring are listed below, along with the number of hours worked by students during a month.
At the 0.05 level of significance, is there sufficient evidence to conclude that there is a difference in the average number of hours worked between the two departments?
Data:
Athletics:
20, 24, 17, 12, 18, 22, 25, 30, 15, 19
Library:
35, 28, 24, 20, 25, 18, 22, 26, 31, 21, 19
# Mann–Whitney U test (ties handled correctly):
# - ALWAYS uses mean ranks for ties
# - If there are NO ties: uses true exact (U-table equivalent) critical values + exact p-value
# - If there ARE ties: U-table critical values are NOT valid, so we use a permutation null
# (with mean ranks) to get:
# * permutation p-value (Monte Carlo by default)
# * permutation critical values (table-like), and decision
# - Also reports normal approximation with tie correction (+ optional continuity correction)
import math
import numpy as np
from scipy.stats import rankdata, norm
# ---------- Exact distribution WITHOUT ties (U-table equivalent) ----------
def _u_distribution_counts_no_ties(n1: int, n2: int):
"""
Exact distribution of U (for sample 1) under H0 for given (n1,n2),
assuming continuous data (no ties), i.e., classic U-table setting.
Returns counts_u for U=0..n1*n2 and total = C(n1+n2, n1)
"""
N = n1 + n2
max_w = n1 * (2 * N - n1 + 1) // 2 # max rank-sum W
dp = [np.zeros(max_w + 1, dtype=np.int64) for _ in range(n1 + 1)]
dp[0][0] = 1
for r in range(1, N + 1):
kmax = min(n1, r)
for k in range(kmax, 0, -1):
dp[k][r:max_w + 1] += dp[k - 1][0:max_w + 1 - r]
counts_w = dp[n1]
total = int(math.comb(N, n1))
shift = n1 * (n1 + 1) // 2
max_u = n1 * n2
counts_u = counts_w[shift:shift + max_u + 1].copy()
return counts_u, total
def _exact_p_and_crit_no_ties(U_obs_int: int, n1: int, n2: int, alpha=0.05, alternative="two-sided"):
counts_u, total = _u_distribution_counts_no_ties(n1, n2)
probs = counts_u / total
max_u = n1 * n2
cdf = np.cumsum(probs)
sf = np.cumsum(probs[::-1])[::-1] # P(U >= u)
if alternative == "less":
p = float(cdf[U_obs_int])
crit_low = int(np.max(np.where(cdf <= alpha)[0])) if np.any(cdf <= alpha) else -1
return p, (crit_low, None)
if alternative == "greater":
p = float(sf[U_obs_int])
crit_high = int(np.min(np.where(sf <= alpha)[0])) if np.any(sf <= alpha) else max_u + 1
return p, (None, crit_high)
if alternative == "two-sided":
a2 = alpha / 2.0
crit_low = int(np.max(np.where(cdf <= a2)[0])) if np.any(cdf <= a2) else -1
crit_high = int(np.min(np.where(sf <= a2)[0])) if np.any(sf <= a2) else max_u + 1
p_one = min(cdf[U_obs_int], sf[U_obs_int])
p = min(1.0, 2.0 * float(p_one))
return float(p), (crit_low, crit_high)
raise ValueError("alternative must be one of: 'two-sided', 'less', 'greater'")
# ---------- Approximation (normal) with tie correction ----------
def _normal_approx(U1: float, pooled: np.ndarray, alternative="two-sided", continuity=True, n1=None, n2=None):
if n1 is None or n2 is None:
raise ValueError("n1 and n2 must be provided")
N = n1 + n2
# tie correction factor: 1 - sum(t^3 - t)/(N^3 - N)
_, counts = np.unique(pooled, return_counts=True)
tie_term = np.sum(counts**3 - counts)
tie_correction = 1.0 - tie_term / (N**3 - N) if N > 1 else 1.0
mu = n1 * n2 / 2.0
var = (n1 * n2 * (N + 1) / 12.0) * tie_correction
sigma = math.sqrt(var) if var > 0 else 0.0
if sigma == 0:
return 0.0, 1.0, mu, sigma, tie_correction
cc = 0.0
if continuity:
if alternative == "less":
cc = -0.5
elif alternative == "greater":
cc = +0.5
else:
cc = 0.5 * np.sign(U1 - mu)
z = (U1 - mu - cc) / sigma
if alternative == "less":
p = norm.cdf(z)
elif alternative == "greater":
p = 1.0 - norm.cdf(z)
elif alternative == "two-sided":
p = 2.0 * min(norm.cdf(z), 1.0 - norm.cdf(z))
else:
raise ValueError("alternative must be one of: 'two-sided', 'less', 'greater'")
return float(z), float(p), float(mu), float(sigma), float(tie_correction)
def _approx_critical(mu, sigma, alpha=0.05, alternative="two-sided"):
if sigma == 0:
return (None, None)
if alternative == "less":
zc = norm.ppf(alpha)
return (mu + zc * sigma, None)
if alternative == "greater":
zc = norm.ppf(1 - alpha)
return (None, mu + zc * sigma)
if alternative == "two-sided":
zc = norm.ppf(1 - alpha / 2.0)
return (mu - zc * sigma, mu + zc * sigma)
raise ValueError("alternative must be one of: 'two-sided', 'less', 'greater'")
# ---------- Permutation null (ties OK because we keep midranks fixed) ----------
def _perm_null_U1_values(pooled: np.ndarray, n1: int, n_perm: int, rng: np.random.Generator):
"""
Monte Carlo permutation null for U1 using FIXED midranks of the observed pooled data.
This is the right way to handle ties: ranks are mean ranks and remain attached to values,
and the null shuffles group labels.
"""
ranks = rankdata(pooled, method="average")
N = len(pooled)
idx = np.arange(N)
# precompute constant
shift = n1 * (n1 + 1) / 2.0
U1_samples = np.empty(n_perm, dtype=float)
for b in range(n_perm):
rng.shuffle(idx)
Rx = ranks[idx[:n1]].sum()
U1_samples[b] = Rx - shift
return U1_samples
def _perm_p_and_crit(U1_obs: float, pooled: np.ndarray, n1: int, n2: int, alpha=0.05,
alternative="two-sided", n_perm=200_000, seed=0):
rng = np.random.default_rng(seed)
U1_null = _perm_null_U1_values(pooled, n1, n_perm, rng)
# p-values via Monte Carlo tail probabilities
if alternative == "less":
p = np.mean(U1_null <= U1_obs)
crit_low = np.quantile(U1_null, alpha, method="lower")
return float(p), (float(crit_low), None), U1_null
if alternative == "greater":
p = np.mean(U1_null >= U1_obs)
crit_high = np.quantile(U1_null, 1 - alpha, method="higher")
return float(p), (None, float(crit_high)), U1_null
if alternative == "two-sided":
mu = np.mean(U1_null)
# two-sided around center using min-tail approach
p_left = np.mean(U1_null <= U1_obs)
p_right = np.mean(U1_null >= U1_obs)
p = min(1.0, 2.0 * min(p_left, p_right))
crit_low = np.quantile(U1_null, alpha / 2.0, method="lower")
crit_high = np.quantile(U1_null, 1 - alpha / 2.0, method="higher")
return float(p), (float(crit_low), float(crit_high)), U1_null
raise ValueError("alternative must be one of: 'two-sided', 'less', 'greater'")
# ---------- Main function ----------
def mann_whitney_u_test(x, y, alpha=0.05, alternative="two-sided",
continuity=True, n_perm=200_000, seed=0):
"""
Mann–Whitney U Test with correct tie handling:
- mean ranks always
- exact/table only if NO ties
- permutation-based (Monte Carlo) "exact-like" p-value + critical values if ties exist
- normal approx with tie correction always
Returns a dict with results + decisions.
"""
x = np.asarray(x, dtype=float)
y = np.asarray(y, dtype=float)
n1, n2 = len(x), len(y)
pooled = np.concatenate([x, y])
ranks = rankdata(pooled, method="average")
Rx = ranks[:n1].sum()
Ry = ranks[n1:].sum()
U1 = Rx - n1 * (n1 + 1) / 2.0
U2 = Ry - n2 * (n2 + 1) / 2.0 # equals n1*n2 - U1 when no ties; with midranks still holds numerically
# detect ties in pooled data
_, counts = np.unique(pooled, return_counts=True)
has_ties = np.any(counts >= 2)
# Normal approximation (tie-corrected)
z, p_approx, mu, sigma, tie_corr = _normal_approx(
U1, pooled, alternative=alternative, continuity=continuity, n1=n1, n2=n2
)
crit_low_a, crit_high_a = _approx_critical(mu, sigma, alpha=alpha, alternative=alternative)
if alternative == "less":
reject_approx = (U1 <= crit_low_a) if crit_low_a is not None else False
elif alternative == "greater":
reject_approx = (U1 >= crit_high_a) if crit_high_a is not None else False
else:
reject_approx = ((crit_low_a is not None and U1 <= crit_low_a) or
(crit_high_a is not None and U1 >= crit_high_a))
# Exact / permutation branch for "table-like" criticals
if not has_ties:
# Classic U-table setting (true exact)
U1_int = int(round(U1))
p_exact, (crit_low_e, crit_high_e) = _exact_p_and_crit_no_ties(
U1_int, n1, n2, alpha=alpha, alternative=alternative
)
if alternative == "less":
reject_exact = (U1_int <= crit_low_e)
elif alternative == "greater":
reject_exact = (U1_int >= crit_high_e)
else:
reject_exact = (U1_int <= crit_low_e) or (U1_int >= crit_high_e)
exact_block = {
"mode": "exact_no_ties (U-table equivalent)",
"p_value": float(p_exact),
"critical_low": crit_low_e,
"critical_high": crit_high_e,
"reject_H0": bool(reject_exact),
"note": None
}
else:
# Ties present -> use permutation null on fixed mean ranks
p_perm, (crit_low_p, crit_high_p), _ = _perm_p_and_crit(
U1, pooled, n1, n2, alpha=alpha, alternative=alternative, n_perm=n_perm, seed=seed
)
if alternative == "less":
reject_perm = (U1 <= crit_low_p)
elif alternative == "greater":
reject_perm = (U1 >= crit_high_p)
else:
reject_perm = (U1 <= crit_low_p) or (U1 >= crit_high_p)
exact_block = {
"mode": f"permutation_with_ties (Monte Carlo, n_perm={n_perm})",
"p_value": float(p_perm),
"critical_low": crit_low_p,
"critical_high": crit_high_p,
"reject_H0": bool(reject_perm),
"note": "U-table exact critical values assume no ties. Using permutation null with mean ranks instead."
}
return {
"n1": n1, "n2": n2,
"alpha": alpha, "alternative": alternative,
"has_ties": bool(has_ties),
"U1": float(U1), "U2": float(U2),
"mean_rank_sum_x": float(Rx),
"mean_rank_sum_y": float(Ry),
"exact_or_perm": exact_block,
"approx": {
"z": float(z),
"p_value": float(p_approx),
"critical_low": float(crit_low_a) if crit_low_a is not None else None,
"critical_high": float(crit_high_a) if crit_high_a is not None else None,
"reject_H0": bool(reject_approx),
"continuity_correction": bool(continuity),
"tie_correction_factor": float(tie_corr),
"mu_U": float(mu),
"sigma_U": float(sigma)
}
}
# Cell for Problem 1: Athletics vs Library (two-sided difference)
athletics = [20, 24, 17, 12, 18, 22, 25, 30, 15, 19]
library = [35, 28, 24, 20, 25, 18, 22, 26, 31, 21, 19]
res = mann_whitney_u_test(
athletics, library,
alpha=0.05,
alternative="two-sided",
continuity=True,
n_perm=200_000,
seed=1
)
print("Mann–Whitney U Test: Athletics vs Library")
print("----------------------------------------")
print(f"n1 = {res['n1']} (Athletics), n2 = {res['n2']} (Library)")
print(f"U1 (Athletics) = {res['U1']:.6g}, U2 (Library) = {res['U2']:.6g}")
print(f"Has ties in pooled data? {res['has_ties']}")
print("\nExact / Permutation (table-like criticals) results")
ex = res["exact_or_perm"]
print(f" mode = {ex['mode']}")
print(f" p-value = {ex['p_value']:.6g}")
print(f" critical_low = {ex['critical_low']}, critical_high = {ex['critical_high']}")
print(f" decision = {'REJECT H0' if ex['reject_H0'] else 'FAIL TO REJECT H0'}")
if ex["note"]:
print(f" NOTE: {ex['note']}")
print("\nNormal approximation (tie-corrected) results")
ap = res["approx"]
print(f" z = {ap['z']:.6g}")
print(f" p-value = {ap['p_value']:.6g}")
print(f" critical_low = {ap['critical_low']}, critical_high = {ap['critical_high']}")
print(f" tie correction factor = {ap['tie_correction_factor']:.6g}")
print(f" decision = {'REJECT H0' if ap['reject_H0'] else 'FAIL TO REJECT H0'}")
Mann–Whitney U Test: Athletics vs Library ---------------------------------------- n1 = 10 (Athletics), n2 = 11 (Library) U1 (Athletics) = 30, U2 (Library) = 80 Has ties in pooled data? True Exact / Permutation (table-like criticals) results mode = permutation_with_ties (Monte Carlo, n_perm=200000) p-value = 0.08106 critical_low = 27.5, critical_high = 82.5 decision = FAIL TO REJECT H0 NOTE: U-table exact critical values assume no ties. Using permutation null with mean ranks instead. Normal approximation (tie-corrected) results z = -1.72861 p-value = 0.0838791 critical_low = 27.220944824684082, critical_high = 82.77905517531592 tie correction factor = 0.996104 decision = FAIL TO REJECT H0
In one-way ANOVA we test the global null hypothesis $$ H_0:\quad \mu_1=\mu_2=\cdots=\mu_k . $$
If this hypothesis is rejected, a natural next question is:
Which means differ, and by how much?
Using ordinary (single-parameter) confidence intervals for many comparisons leads to inflated Type I error, because several intervals are examined simultaneously.
Goal: Construct confidence intervals that hold simultaneously for a family of parameters with overall confidence level $1-\alpha$.
We consider the classical one-way ANOVA model $$ X_{ij} = \mu_i + \varepsilon_{ij}, \qquad i=1,\dots,k,\quad j=1,\dots,n_i, $$ where
Define: $$ \bar X_i = \frac{1}{n_i}\sum_{j=1}^{n_i} X_{ij}, \qquad N=\sum_{i=1}^k n_i . $$
The Mean Square Error (MSE) is $$
\frac{1}{N-k} \sum{i=1}^k\sum{j=1}^{ni} (X{ij}-\bar X_i)^2 , $$ with $\nu=N-k$ degrees of freedom.
Let $\theta_1,\dots,\theta_m$ be parameters of interest (e.g. mean differences).
Intervals $I_1,\dots,I_m$ are simultaneous confidence intervals with level $1-\alpha$ if $$ \mathbb P\big(\theta_1\in I_1,\dots,\theta_m\in I_m\big)\ge 1-\alpha . $$
This is stronger than marginal coverage $$ \mathbb P(\theta_\ell\in I_\ell)\ge 1-\alpha \quad \text{for each } \ell . $$
If we construct $m$ ordinary $1-\alpha$ confidence intervals independently, then $$ \mathbb P(\text{all correct}) \le (1-\alpha)^m , $$ which can be very small for large $m$.
Simultaneous methods control the family-wise error rate (FWER): $$ \mathbb P(\text{at least one false statement}) \le \alpha . $$
For any events $A_1,\dots,A_m$, $$ \mathbb P\Big(\bigcup_{\ell=1}^m A_\ell\Big) \le \sum_{\ell=1}^m \mathbb P(A_\ell). $$
This bound is distribution-free and does not require independence.
Suppose $\hat\theta_\ell$ estimates $\theta_\ell$ and $$ \frac{\hat\theta_\ell-\theta_\ell} {\widehat{\mathrm{SE}}(\hat\theta_\ell)} \sim t_\nu . $$
Define intervals $$ I_\ell:\quad \hat\theta_\ell \pm t_{1-\alpha/(2m),\nu} \,\widehat{\mathrm{SE}}(\hat\theta_\ell), \qquad \ell=1,\dots,m . $$
Then $$ \mathbb P\big(\theta_1\in I_1,\dots,\theta_m\in I_m\big) \ge 1-\alpha . $$
For comparisons $\mu_i-\mu_j$, $$ \widehat{\mu_i-\mu_j}=\bar X_i-\bar X_j , $$ with standard error $$
\sqrt{\text{MSE} \Big(\frac{1}{n_i}+\frac{1}{n_j}\Big)} . $$
If $m=\binom{k}{2}$, the Bonferroni confidence interval is $$ (\bar X_i-\bar X_j) \pm t_{1-\alpha/(2m),\nu} \sqrt{\text{MSE} \Big(\frac{1}{n_i}+\frac{1}{n_j}\Big)} . $$
Let $Z_1,\dots,Z_k \sim N(0,1)$ i.i.d. The studentized range is $$ Q = \frac{\max_i Z_i - \min_i Z_i}{\sqrt{S^2}}, $$ where $S^2$ is an independent variance estimator.
Its distribution depends on:
Quantiles are denoted $q_{1-\alpha}(k,\nu)$.
Assume equal sample sizes $$ n_1=\cdots=n_k=n . $$
Then $$
(\mu_i-\muj) + \sigma\sqrt{\frac{2}{n}}\,Z{ij}, $$ and $$ \sqrt{\frac{\text{MSE}}{n}} $$ estimates $\sigma/\sqrt{n}$.
The Tukey HSD confidence interval is $$ (\bar X_i-\bar X_j) \pm q_{1-\alpha}(k,\nu) \sqrt{\frac{\text{MSE}}{n}} . $$
These intervals are simultaneous for all $\binom{k}{2}$ pairwise differences.
When group sizes differ, the Tukey–Kramer interval is $$ (\bar X_i-\bar X_j) \pm q_{1-\alpha}(k,\nu) \sqrt{ \frac{\text{MSE}}{2} \Big(\frac{1}{n_i}+\frac{1}{n_j}\Big) } . $$
This reduces to Tukey HSD when $n_i=n$.
ANOVA F-test asks:
Is there at least one difference among means?
Simultaneous confidence intervals ask:
Which differences exist, and how large are they?
Key facts:
| Aspect | Bonferroni | Tukey (HSD / Kramer) |
|---|---|---|
| Comparisons | Arbitrary | All pairwise |
| Error control | Always valid | Exact under ANOVA |
| Interval width | Often wider | Usually narrower |
| Planning | Pre-specified | Exploratory |
| Variance assumption | None | Equal variances |
A researcher claims that there is a difference in the average age of assistant professors, associate professors, and full professors at her university.
Faculty members are selected randomly, and their ages are recorded.
Assume that faculty ages are normally distributed.
Test the researcher’s claim at the $\alpha = 0.01$ significance level.
The observed data are:
| Rank | Ages |
|---|---|
| Assistant Professor | 28, 32, 36, 42, 50, 33, 38 |
| Associate Professor | 44, 61, 52, 54, 62, 45, 46 |
| Professor | 54, 56, 55, 65, 52, 50, 46 |
import numpy as np
import pandas as pd
from itertools import combinations
from scipy import stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd
def one_way_anova(groups):
"""
Perform one-way ANOVA and return ANOVA table components.
Parameters
----------
groups : dict
keys = group names
values = lists or arrays of observations
Returns
-------
dict with:
- F_stat
- p_value
- MSE
- df_error
- group_means
- group_sizes
"""
group_names = list(groups.keys())
data = [np.asarray(groups[g]) for g in group_names]
# ANOVA F-test
F_stat, p_value = stats.f_oneway(*data)
# pooled variance (MSE)
N = sum(len(x) for x in data)
k = len(data)
df_error = N - k
SSE = sum(((x - x.mean())**2).sum() for x in data)
MSE = SSE / df_error
group_means = {g: np.mean(groups[g]) for g in group_names}
group_sizes = {g: len(groups[g]) for g in group_names}
return {
"F_stat": F_stat,
"p_value": p_value,
"MSE": MSE,
"df_error": df_error,
"group_means": group_means,
"group_sizes": group_sizes
}
def bonferroni_ci(groups, alpha=0.05):
"""
Bonferroni simultaneous confidence intervals
for all pairwise mean differences.
"""
anova = one_way_anova(groups)
means = anova["group_means"]
sizes = anova["group_sizes"]
MSE = anova["MSE"]
df = anova["df_error"]
pairs = list(combinations(means.keys(), 2))
m = len(pairs)
t_crit = stats.t.ppf(1 - alpha/(2*m), df)
rows = []
for g1, g2 in pairs:
diff = means[g1] - means[g2]
se = np.sqrt(MSE * (1/sizes[g1] + 1/sizes[g2]))
ci_low = diff - t_crit * se
ci_high = diff + t_crit * se
rows.append([g1, g2, diff, ci_low, ci_high])
return pd.DataFrame(
rows,
columns=["Group 1", "Group 2", "Mean diff", "CI lower", "CI upper"]
)
def tukey_ci(groups, alpha=0.05):
"""
Tukey (HSD / Tukey–Kramer) simultaneous confidence intervals
for all pairwise mean differences.
"""
values = []
labels = []
for g, obs in groups.items():
values.extend(obs)
labels.extend([g] * len(obs))
values = np.asarray(values)
labels = np.asarray(labels)
tukey = pairwise_tukeyhsd(
endog=values,
groups=labels,
alpha=alpha
)
return pd.DataFrame(
tukey._results_table.data[1:],
columns=tukey._results_table.data[0]
)
# Data: Faculty ages by rank
groups = {
"Assistant Professor": [28, 32, 36, 42, 50, 33, 38],
"Associate Professor": [44, 61, 52, 54, 62, 45, 46],
"Professor": [54, 56, 55, 65, 52, 50, 46]
}
alpha = 0.01
# --- One-way ANOVA ---
anova_results = one_way_anova(groups)
print("One-way ANOVA")
print("-------------")
print(f"F statistic = {anova_results['F_stat']:.4f}")
print(f"p-value = {anova_results['p_value']:.6f}")
print(f"MSE = {anova_results['MSE']:.4f}")
print(f"df_error = {anova_results['df_error']}")
# --- Bonferroni simultaneous CIs ---
print("\nBonferroni simultaneous confidence intervals")
print("-------------------------------------------")
bonf_ci = bonferroni_ci(groups, alpha=alpha)
display(bonf_ci)
# --- Tukey simultaneous CIs ---
print("\nTukey (HSD / Tukey–Kramer) simultaneous confidence intervals")
print("------------------------------------------------------------")
tukey_results = tukey_ci(groups, alpha=alpha)
display(tukey_results)
One-way ANOVA ------------- F statistic = 12.6195 p-value = 0.000375 MSE = 47.8889 df_error = 18 Bonferroni simultaneous confidence intervals -------------------------------------------
| Group 1 | Group 2 | Mean diff | CI lower | CI upper | |
|---|---|---|---|---|---|
| 0 | Assistant Professor | Associate Professor | -15.0 | -27.503932 | -2.496068 |
| 1 | Assistant Professor | Professor | -17.0 | -29.503932 | -4.496068 |
| 2 | Associate Professor | Professor | -2.0 | -14.503932 | 10.503932 |
Tukey (HSD / Tukey–Kramer) simultaneous confidence intervals ------------------------------------------------------------
| group1 | group2 | meandiff | p-adj | lower | upper | reject | |
|---|---|---|---|---|---|---|---|
| 0 | Assistant Professor | Associate Professor | 15.0 | 0.0020 | 2.6979 | 27.3021 | True |
| 1 | Assistant Professor | Professor | 17.0 | 0.0006 | 4.6979 | 29.3021 | True |
| 2 | Associate Professor | Professor | 2.0 | 0.8524 | -10.3021 | 14.3021 | False |