Seminar 8

Mann–Whitney U Test (Wilcoxon Rank-Sum Test)

The Mann–Whitney U test is a non-parametric test for comparing two independent samples.
It assesses whether one distribution tends to produce larger values than the other and is a robust alternative to the two-sample t-test.


1. Problem setup¶

Let $$ X_1,\dots,X_{n_1} \sim F, \qquad Y_1,\dots,Y_{n_2} \sim G, $$ where all observations are independent.

The goal is to compare the distributions $F$ and $G$ without assuming normality.


2. Hypotheses¶

  • Null hypothesis $$ H_0: F = G $$

  • Alternative hypothesis $$ H_1: F \neq G $$ (or one-sided variants: $F$ stochastically dominates $G$ or vice versa)

⚠️ Important: this is not a test of equality of means in general.

Null hypothesis (Mann–Whitney U test).

Let $X$ and $Y$ be independent random variables representing observations from the two groups. The Mann–Whitney test is based on the null hypothesis $$ H_0:\; \mathbb P(X<Y) + \tfrac12\,\mathbb P(X=Y) = \tfrac12. $$

This states that a randomly chosen observation from one group is equally likely to be smaller or larger than a randomly chosen observation from the other group, with ties counted as half in each direction. Equivalently, under $H_0$ there is no systematic tendency for one distribution to produce larger values than the other. In the continuous case, where $\mathbb P(X=Y)=0$, this reduces to $\mathbb P(X<Y)=1/2$. The null hypothesis therefore concerns stochastic ordering of the two distributions, not equality of medians, except under additional assumptions such as a pure location shift.¶

3. Assumptions¶

  • Independence within and between samples
  • Continuous distributions (no ties, for exact theory)
  • Identical shapes under $H_0$ (location-shift model interpretation)

4. Test statistic¶

Rank-based form¶

  1. Pool all observations $X_1,\dots,X_{n_1},Y_1,\dots,Y_{n_2}$
  2. Rank them from smallest to largest (average ranks in case of ties)

The Mann–Whitney statistics can be written as $$ U_X = n_1 n_2 + \frac{n_1(n_1+1)}{2} - R_X, \qquad U_Y = n_1 n_2 + \frac{n_2(n_2+1)}{2} - R_Y, $$ where $$ R_X = \sum_{i=1}^{n_1} R(X_i), \qquad R_Y = \sum_{j=1}^{n_2} R(Y_j) $$ are the rank sums of the $X$ and $Y$ samples, respectively.

The test statistic used in the Mann–Whitney test is $$ U = \min(U_X, U_Y). $$

This symmetrization ensures invariance under relabeling of the two samples.

Notice that the table for critical values gives the values for the 2-tailed test. In other words, we will fail to reject $H_0$ when The U statistic is greater than the critical value from the table.


Pairwise-comparison form (theoretical form)¶

Equivalently, $$ U_X = \sum_{i=1}^{n_1}\sum_{j=1}^{n_2} \mathbf{1}\{X_i < Y_j\}, $$ with ties handled via midranks in practice.

This representation is central for the theoretical interpretation of the test.


5. Relationship between statistics¶

The rank-sum statistic $R_X$ and $U_X$ are affinely related: $$ R_X = U_X + \frac{n_1(n_1+1)}{2}. $$

All formulations $(R_X, U_X, U)$ induce identical tests and p-values, differing only by centering and symmetrization.


6. Exact null distribution (finite sample)¶

Under $H_0$:

  • All $n_1+n_2$ ranks are fixed
  • Every allocation of ranks to the $X$ and $Y$ samples is equally likely

Thus, $U_X$ has an exact permutation distribution depending only on $(n_1,n_2)$.

Formally: $$

\mathbb{P}(U_X = u)¶

\frac{#{\text{rank allocations yielding } u}}{\binom{n_1+n_2}{n_1}}. $$

This distribution is:

  • discrete
  • distribution-free
  • symmetric about $\frac{n_1 n_2}{2}$

Exact null distribution: numerical example (Mann–Whitney U)¶

Consider two samples:

  • Sample $X$ with size $n_1 = 2$
  • Sample $Y$ with size $n_2 = 3$

Under the null hypothesis $H_0$, the two samples come from the same continuous distribution.


Step 1. Fixed ranks under $H_0$¶

Pool all observations and assign ranks
$1,2,3,4,5$.

Under $H_0$:

  • The ranks themselves are fixed
  • Every allocation of $n_1 = 2$ ranks to sample $X$ is equally likely

Total number of allocations: $$ \binom{n_1+n_2}{n_1} = \binom{5}{2} = 10. $$

Each allocation has probability $1/10$.


Step 2. Definition of the statistic¶

Let $R_X$ be the sum of the ranks assigned to sample $X$.

The Mann–Whitney statistic is defined as $$ U_X = R_X - \frac{n_1(n_1+1)}{2} = R_X - 3. $$


Step 3. Enumerate all rank allocations¶

For $n_1 = 2$, the Mann–Whitney statistic is $$ U_X = R_X - 3. $$

Ranks assigned to $X$ $R_X$ $U_X$
$\{1,2\}$ 3 0
$\{1,3\}$ 4 1
$\{1,4\}$ 5 2
$\{1,5\}$ 6 3
$\{2,3\}$ 5 2
$\{2,4\}$ 6 3
$\{2,5\}$ 7 4
$\{3,4\}$ 7 4
$\{3,5\}$ 8 5
$\{4,5\}$ 9 6

Step 4. Exact null distribution of $U_X$¶

By counting how many allocations produce each value of $U_X$, we obtain:

$u$ Count $\mathbb{P}(U_X = u)$
0 1 0.1
1 1 0.1
2 2 0.2
3 2 0.2
4 2 0.2
5 1 0.1
6 1 0.1

Formally, $$

\mathbb{P}(U_X = u)¶

\frac{#{\text{rank allocations yielding } u}}{\binom{5}{2}}. $$


Step 5. Symmetry¶

Here, $$ n_1 n_2 = 2 \cdot 3 = 6, $$ so the distribution of $U_X$ is symmetric about $$ \frac{n_1 n_2}{2} = 3. $$

Indeed, $$ \mathbb{P}(U_X = 0) = \mathbb{P}(U_X = 6), \quad \mathbb{P}(U_X = 1) = \mathbb{P}(U_X = 5), \quad \mathbb{P}(U_X = 2) = \mathbb{P}(U_X = 4). $$

Moreover, $$ U_Y = n_1 n_2 - U_X = 6 - U_X, $$ so the two Mann–Whitney statistics are complementary for each allocation.


Conclusion¶

This example shows explicitly that under $H_0$:

  • $U_X$ has a finite-sample exact permutation distribution
  • The distribution depends only on $(n_1, n_2)$
  • No assumptions on the underlying population distribution are required

7. Support of the distribution¶

The minimum and maximum possible values of $U_X$ are: $$ U_{X,\min} = 0, \qquad U_{X,\max} = n_1 n_2. $$

Thus: $$ U_X \in \{0,1,\dots,n_1 n_2\}. $$

Each value corresponds to the number of $(X_i,Y_j)$ pairs such that $X_i < Y_j$.


8. Mean and variance under $H_0$¶

Under $H_0$: $$ \mathbb{E}[U_X] = \frac{n_1 n_2}{2}, $$ $$ \mathrm{Var}(U_X) = \frac{n_1 n_2 (n_1+n_2+1)}{12}. $$

The statistic $U=\min(U_X,U_Y)$ has the same null distribution by symmetry.


9. Why the test works (core theoretical reason)¶

Define the population parameter $$

\theta¶

\mathbb{P}(X < Y) + \tfrac12 \mathbb{P}(X = Y). $$

Then: $$ \mathbb{E}\!\left[\frac{U_X}{n_1 n_2}\right] = \theta. $$

Under $H_0: F = G$, we have $$ \theta = \tfrac12. $$

Thus, the Mann–Whitney test is a test of $$ H_0:\; \theta = \tfrac12, $$ corresponding to absence of stochastic dominance.


10. U-statistic structure¶

The statistic $U_X$ is a U-statistic with kernel $$ h(x,y) = \mathbf{1}\{x < y\}. $$

By Hoeffding’s theory of U-statistics:

  • $U_X$ is unbiased for $\theta$
  • $U_X$ is consistent
  • $U_X$ is asymptotically normal

11. Asymptotic null distribution (CLT)¶

As $n_1,n_2 \to \infty$: $$ \frac{U_X - \mathbb{E}[U_X]}{\sqrt{\mathrm{Var}(U_X)}} \;\xrightarrow{d}\; N(0,1). $$

This follows from the Hoeffding decomposition and the CLT for U-statistics.

Dealing with Ties in the Mann–Whitney U Test¶

It is possible that two or more observations take the same value. In this case, the Mann–Whitney U statistic can still be computed by allocating half of each tie to sample $X$ and half to sample $Y$ (equivalently, by using mean ranks).

However, when ties are present, the normal approximation to the distribution of $U$ must be used with a correction to the standard deviation. The adjusted standard deviation of $U$ is

$

\sigma_U¶

\sqrt{ \frac{n_x n_y}{N (N - 1)} \left[

\frac{N^3 - N}{12}¶

\sum_{j=1}^{g} \frac{t_j^3 - t_j}{12} \right] }, $

where

  • $N = n_x + n_y$,
  • $g$ is the number of groups of tied observations,
  • $t_j$ is the number of tied ranks in group $j$.

12. Interpretation¶

  • Tests stochastic dominance
  • Detects location shifts when distributional shapes coincide
  • Robust to non-normality
  • Sensitive to changes in distribution shape

13. When the test may mislead¶

  • Strongly different shapes
  • Heteroscedasticity
  • Many ties (requires correction)
  • Dependence between observations

In such cases, rejection does not necessarily correspond to a pure location shift.


14. Relation to the t-test¶

  • If $F$ and $G$ are normal with equal variances, the t-test is more powerful
  • Under heavy tails or outliers, Mann–Whitney is more robust
  • Mann–Whitney does not estimate a mean difference

15. One-sentence summary (exam-perfect)¶

The Mann–Whitney test is an exact, distribution-free U-statistic test of stochastic dominance whose null distribution arises from random rank allocations and converges asymptotically to a normal distribution.


16. One-line intuition¶

The test counts how often observations from one sample are smaller than those from the other.

In [ ]:
 

Problem: Comparison of Student Work Hours¶

Student employees are a major part of employment on most college campuses. Two major departments that participate in student hiring are listed below, along with the number of hours worked by students during a month.

At the 0.05 level of significance, is there sufficient evidence to conclude that there is a difference in the average number of hours worked between the two departments?

Data:

  • Athletics:
    20, 24, 17, 12, 18, 22, 25, 30, 15, 19

  • Library:
    35, 28, 24, 20, 25, 18, 22, 26, 31, 21, 19

In [ ]:
 
In [6]:
# Mann–Whitney U test (ties handled correctly):
# - ALWAYS uses mean ranks for ties
# - If there are NO ties: uses true exact (U-table equivalent) critical values + exact p-value
# - If there ARE ties: U-table critical values are NOT valid, so we use a permutation null
#   (with mean ranks) to get:
#     * permutation p-value (Monte Carlo by default)
#     * permutation critical values (table-like), and decision
# - Also reports normal approximation with tie correction (+ optional continuity correction)

import math
import numpy as np
from scipy.stats import rankdata, norm


# ---------- Exact distribution WITHOUT ties (U-table equivalent) ----------

def _u_distribution_counts_no_ties(n1: int, n2: int):
    """
    Exact distribution of U (for sample 1) under H0 for given (n1,n2),
    assuming continuous data (no ties), i.e., classic U-table setting.

    Returns counts_u for U=0..n1*n2 and total = C(n1+n2, n1)
    """
    N = n1 + n2
    max_w = n1 * (2 * N - n1 + 1) // 2  # max rank-sum W

    dp = [np.zeros(max_w + 1, dtype=np.int64) for _ in range(n1 + 1)]
    dp[0][0] = 1

    for r in range(1, N + 1):
        kmax = min(n1, r)
        for k in range(kmax, 0, -1):
            dp[k][r:max_w + 1] += dp[k - 1][0:max_w + 1 - r]

    counts_w = dp[n1]
    total = int(math.comb(N, n1))

    shift = n1 * (n1 + 1) // 2
    max_u = n1 * n2
    counts_u = counts_w[shift:shift + max_u + 1].copy()

    return counts_u, total


def _exact_p_and_crit_no_ties(U_obs_int: int, n1: int, n2: int, alpha=0.05, alternative="two-sided"):
    counts_u, total = _u_distribution_counts_no_ties(n1, n2)
    probs = counts_u / total
    max_u = n1 * n2

    cdf = np.cumsum(probs)
    sf = np.cumsum(probs[::-1])[::-1]  # P(U >= u)

    if alternative == "less":
        p = float(cdf[U_obs_int])
        crit_low = int(np.max(np.where(cdf <= alpha)[0])) if np.any(cdf <= alpha) else -1
        return p, (crit_low, None)

    if alternative == "greater":
        p = float(sf[U_obs_int])
        crit_high = int(np.min(np.where(sf <= alpha)[0])) if np.any(sf <= alpha) else max_u + 1
        return p, (None, crit_high)

    if alternative == "two-sided":
        a2 = alpha / 2.0
        crit_low = int(np.max(np.where(cdf <= a2)[0])) if np.any(cdf <= a2) else -1
        crit_high = int(np.min(np.where(sf <= a2)[0])) if np.any(sf <= a2) else max_u + 1
        p_one = min(cdf[U_obs_int], sf[U_obs_int])
        p = min(1.0, 2.0 * float(p_one))
        return float(p), (crit_low, crit_high)

    raise ValueError("alternative must be one of: 'two-sided', 'less', 'greater'")


# ---------- Approximation (normal) with tie correction ----------

def _normal_approx(U1: float, pooled: np.ndarray, alternative="two-sided", continuity=True, n1=None, n2=None):
    if n1 is None or n2 is None:
        raise ValueError("n1 and n2 must be provided")

    N = n1 + n2

    # tie correction factor: 1 - sum(t^3 - t)/(N^3 - N)
    _, counts = np.unique(pooled, return_counts=True)
    tie_term = np.sum(counts**3 - counts)
    tie_correction = 1.0 - tie_term / (N**3 - N) if N > 1 else 1.0

    mu = n1 * n2 / 2.0
    var = (n1 * n2 * (N + 1) / 12.0) * tie_correction
    sigma = math.sqrt(var) if var > 0 else 0.0

    if sigma == 0:
        return 0.0, 1.0, mu, sigma, tie_correction

    cc = 0.0
    if continuity:
        if alternative == "less":
            cc = -0.5
        elif alternative == "greater":
            cc = +0.5
        else:
            cc = 0.5 * np.sign(U1 - mu)

    z = (U1 - mu - cc) / sigma

    if alternative == "less":
        p = norm.cdf(z)
    elif alternative == "greater":
        p = 1.0 - norm.cdf(z)
    elif alternative == "two-sided":
        p = 2.0 * min(norm.cdf(z), 1.0 - norm.cdf(z))
    else:
        raise ValueError("alternative must be one of: 'two-sided', 'less', 'greater'")

    return float(z), float(p), float(mu), float(sigma), float(tie_correction)


def _approx_critical(mu, sigma, alpha=0.05, alternative="two-sided"):
    if sigma == 0:
        return (None, None)

    if alternative == "less":
        zc = norm.ppf(alpha)
        return (mu + zc * sigma, None)

    if alternative == "greater":
        zc = norm.ppf(1 - alpha)
        return (None, mu + zc * sigma)

    if alternative == "two-sided":
        zc = norm.ppf(1 - alpha / 2.0)
        return (mu - zc * sigma, mu + zc * sigma)

    raise ValueError("alternative must be one of: 'two-sided', 'less', 'greater'")


# ---------- Permutation null (ties OK because we keep midranks fixed) ----------

def _perm_null_U1_values(pooled: np.ndarray, n1: int, n_perm: int, rng: np.random.Generator):
    """
    Monte Carlo permutation null for U1 using FIXED midranks of the observed pooled data.
    This is the right way to handle ties: ranks are mean ranks and remain attached to values,
    and the null shuffles group labels.
    """
    ranks = rankdata(pooled, method="average")
    N = len(pooled)
    idx = np.arange(N)

    # precompute constant
    shift = n1 * (n1 + 1) / 2.0

    U1_samples = np.empty(n_perm, dtype=float)
    for b in range(n_perm):
        rng.shuffle(idx)
        Rx = ranks[idx[:n1]].sum()
        U1_samples[b] = Rx - shift
    return U1_samples


def _perm_p_and_crit(U1_obs: float, pooled: np.ndarray, n1: int, n2: int, alpha=0.05,
                     alternative="two-sided", n_perm=200_000, seed=0):
    rng = np.random.default_rng(seed)
    U1_null = _perm_null_U1_values(pooled, n1, n_perm, rng)

    # p-values via Monte Carlo tail probabilities
    if alternative == "less":
        p = np.mean(U1_null <= U1_obs)
        crit_low = np.quantile(U1_null, alpha, method="lower")
        return float(p), (float(crit_low), None), U1_null

    if alternative == "greater":
        p = np.mean(U1_null >= U1_obs)
        crit_high = np.quantile(U1_null, 1 - alpha, method="higher")
        return float(p), (None, float(crit_high)), U1_null

    if alternative == "two-sided":
        mu = np.mean(U1_null)
        # two-sided around center using min-tail approach
        p_left = np.mean(U1_null <= U1_obs)
        p_right = np.mean(U1_null >= U1_obs)
        p = min(1.0, 2.0 * min(p_left, p_right))

        crit_low = np.quantile(U1_null, alpha / 2.0, method="lower")
        crit_high = np.quantile(U1_null, 1 - alpha / 2.0, method="higher")
        return float(p), (float(crit_low), float(crit_high)), U1_null

    raise ValueError("alternative must be one of: 'two-sided', 'less', 'greater'")


# ---------- Main function ----------

def mann_whitney_u_test(x, y, alpha=0.05, alternative="two-sided",
                        continuity=True, n_perm=200_000, seed=0):
    """
    Mann–Whitney U Test with correct tie handling:
      - mean ranks always
      - exact/table only if NO ties
      - permutation-based (Monte Carlo) "exact-like" p-value + critical values if ties exist
      - normal approx with tie correction always

    Returns a dict with results + decisions.
    """
    x = np.asarray(x, dtype=float)
    y = np.asarray(y, dtype=float)
    n1, n2 = len(x), len(y)

    pooled = np.concatenate([x, y])
    ranks = rankdata(pooled, method="average")
    Rx = ranks[:n1].sum()
    Ry = ranks[n1:].sum()

    U1 = Rx - n1 * (n1 + 1) / 2.0
    U2 = Ry - n2 * (n2 + 1) / 2.0   # equals n1*n2 - U1 when no ties; with midranks still holds numerically

    # detect ties in pooled data
    _, counts = np.unique(pooled, return_counts=True)
    has_ties = np.any(counts >= 2)

    # Normal approximation (tie-corrected)
    z, p_approx, mu, sigma, tie_corr = _normal_approx(
        U1, pooled, alternative=alternative, continuity=continuity, n1=n1, n2=n2
    )
    crit_low_a, crit_high_a = _approx_critical(mu, sigma, alpha=alpha, alternative=alternative)

    if alternative == "less":
        reject_approx = (U1 <= crit_low_a) if crit_low_a is not None else False
    elif alternative == "greater":
        reject_approx = (U1 >= crit_high_a) if crit_high_a is not None else False
    else:
        reject_approx = ((crit_low_a is not None and U1 <= crit_low_a) or
                         (crit_high_a is not None and U1 >= crit_high_a))

    # Exact / permutation branch for "table-like" criticals
    if not has_ties:
        # Classic U-table setting (true exact)
        U1_int = int(round(U1))
        p_exact, (crit_low_e, crit_high_e) = _exact_p_and_crit_no_ties(
            U1_int, n1, n2, alpha=alpha, alternative=alternative
        )
        if alternative == "less":
            reject_exact = (U1_int <= crit_low_e)
        elif alternative == "greater":
            reject_exact = (U1_int >= crit_high_e)
        else:
            reject_exact = (U1_int <= crit_low_e) or (U1_int >= crit_high_e)

        exact_block = {
            "mode": "exact_no_ties (U-table equivalent)",
            "p_value": float(p_exact),
            "critical_low": crit_low_e,
            "critical_high": crit_high_e,
            "reject_H0": bool(reject_exact),
            "note": None
        }
    else:
        # Ties present -> use permutation null on fixed mean ranks
        p_perm, (crit_low_p, crit_high_p), _ = _perm_p_and_crit(
            U1, pooled, n1, n2, alpha=alpha, alternative=alternative, n_perm=n_perm, seed=seed
        )
        if alternative == "less":
            reject_perm = (U1 <= crit_low_p)
        elif alternative == "greater":
            reject_perm = (U1 >= crit_high_p)
        else:
            reject_perm = (U1 <= crit_low_p) or (U1 >= crit_high_p)

        exact_block = {
            "mode": f"permutation_with_ties (Monte Carlo, n_perm={n_perm})",
            "p_value": float(p_perm),
            "critical_low": crit_low_p,
            "critical_high": crit_high_p,
            "reject_H0": bool(reject_perm),
            "note": "U-table exact critical values assume no ties. Using permutation null with mean ranks instead."
        }

    return {
        "n1": n1, "n2": n2,
        "alpha": alpha, "alternative": alternative,
        "has_ties": bool(has_ties),
        "U1": float(U1), "U2": float(U2),
        "mean_rank_sum_x": float(Rx),
        "mean_rank_sum_y": float(Ry),
        "exact_or_perm": exact_block,
        "approx": {
            "z": float(z),
            "p_value": float(p_approx),
            "critical_low": float(crit_low_a) if crit_low_a is not None else None,
            "critical_high": float(crit_high_a) if crit_high_a is not None else None,
            "reject_H0": bool(reject_approx),
            "continuity_correction": bool(continuity),
            "tie_correction_factor": float(tie_corr),
            "mu_U": float(mu),
            "sigma_U": float(sigma)
        }
    }
In [ ]:
 
In [7]:
# Cell for Problem 1: Athletics vs Library (two-sided difference)

athletics = [20, 24, 17, 12, 18, 22, 25, 30, 15, 19]
library   = [35, 28, 24, 20, 25, 18, 22, 26, 31, 21, 19]

res = mann_whitney_u_test(
    athletics, library,
    alpha=0.05,
    alternative="two-sided",
    continuity=True,
    n_perm=200_000,
    seed=1
)

print("Mann–Whitney U Test: Athletics vs Library")
print("----------------------------------------")
print(f"n1 = {res['n1']} (Athletics), n2 = {res['n2']} (Library)")
print(f"U1 (Athletics) = {res['U1']:.6g}, U2 (Library) = {res['U2']:.6g}")
print(f"Has ties in pooled data? {res['has_ties']}")

print("\nExact / Permutation (table-like criticals) results")
ex = res["exact_or_perm"]
print(f"  mode        = {ex['mode']}")
print(f"  p-value     = {ex['p_value']:.6g}")
print(f"  critical_low  = {ex['critical_low']}, critical_high = {ex['critical_high']}")
print(f"  decision    = {'REJECT H0' if ex['reject_H0'] else 'FAIL TO REJECT H0'}")
if ex["note"]:
    print(f"  NOTE: {ex['note']}")

print("\nNormal approximation (tie-corrected) results")
ap = res["approx"]
print(f"  z           = {ap['z']:.6g}")
print(f"  p-value     = {ap['p_value']:.6g}")
print(f"  critical_low  = {ap['critical_low']}, critical_high = {ap['critical_high']}")
print(f"  tie correction factor = {ap['tie_correction_factor']:.6g}")
print(f"  decision    = {'REJECT H0' if ap['reject_H0'] else 'FAIL TO REJECT H0'}")
Mann–Whitney U Test: Athletics vs Library
----------------------------------------
n1 = 10 (Athletics), n2 = 11 (Library)
U1 (Athletics) = 30, U2 (Library) = 80
Has ties in pooled data? True

Exact / Permutation (table-like criticals) results
  mode        = permutation_with_ties (Monte Carlo, n_perm=200000)
  p-value     = 0.08106
  critical_low  = 27.5, critical_high = 82.5
  decision    = FAIL TO REJECT H0
  NOTE: U-table exact critical values assume no ties. Using permutation null with mean ranks instead.

Normal approximation (tie-corrected) results
  z           = -1.72861
  p-value     = 0.0838791
  critical_low  = 27.220944824684082, critical_high = 82.77905517531592
  tie correction factor = 0.996104
  decision    = FAIL TO REJECT H0
In [ ]:
 

Simultaneous Confidence Intervals in One-Way ANOVA - Bonferroni and Tukey (HSD / Tukey–Kramer) Methods


1. Motivation¶

In one-way ANOVA we test the global null hypothesis $$ H_0:\quad \mu_1=\mu_2=\cdots=\mu_k . $$

If this hypothesis is rejected, a natural next question is:

Which means differ, and by how much?

Using ordinary (single-parameter) confidence intervals for many comparisons leads to inflated Type I error, because several intervals are examined simultaneously.

Goal: Construct confidence intervals that hold simultaneously for a family of parameters with overall confidence level $1-\alpha$.


2. Model and notation¶

We consider the classical one-way ANOVA model $$ X_{ij} = \mu_i + \varepsilon_{ij}, \qquad i=1,\dots,k,\quad j=1,\dots,n_i, $$ where

  • $\varepsilon_{ij} \stackrel{\text{i.i.d.}}{\sim} N(0,\sigma^2)$,
  • samples are independent,
  • group variances are equal.

Define: $$ \bar X_i = \frac{1}{n_i}\sum_{j=1}^{n_i} X_{ij}, \qquad N=\sum_{i=1}^k n_i . $$

The Mean Square Error (MSE) is $$

\text{MSE}¶

\frac{1}{N-k} \sum{i=1}^k\sum{j=1}^{ni} (X{ij}-\bar X_i)^2 , $$ with $\nu=N-k$ degrees of freedom.


3. What does “simultaneous” mean?¶

Let $\theta_1,\dots,\theta_m$ be parameters of interest (e.g. mean differences).

Intervals $I_1,\dots,I_m$ are simultaneous confidence intervals with level $1-\alpha$ if $$ \mathbb P\big(\theta_1\in I_1,\dots,\theta_m\in I_m\big)\ge 1-\alpha . $$

This is stronger than marginal coverage $$ \mathbb P(\theta_\ell\in I_\ell)\ge 1-\alpha \quad \text{for each } \ell . $$


4. The multiple comparison problem¶

If we construct $m$ ordinary $1-\alpha$ confidence intervals independently, then $$ \mathbb P(\text{all correct}) \le (1-\alpha)^m , $$ which can be very small for large $m$.

Simultaneous methods control the family-wise error rate (FWER): $$ \mathbb P(\text{at least one false statement}) \le \alpha . $$


5. Bonferroni simultaneous confidence intervals¶

5.1 Bonferroni inequality¶

For any events $A_1,\dots,A_m$, $$ \mathbb P\Big(\bigcup_{\ell=1}^m A_\ell\Big) \le \sum_{\ell=1}^m \mathbb P(A_\ell). $$

This bound is distribution-free and does not require independence.


5.2 Bonferroni confidence intervals¶

Suppose $\hat\theta_\ell$ estimates $\theta_\ell$ and $$ \frac{\hat\theta_\ell-\theta_\ell} {\widehat{\mathrm{SE}}(\hat\theta_\ell)} \sim t_\nu . $$

Define intervals $$ I_\ell:\quad \hat\theta_\ell \pm t_{1-\alpha/(2m),\nu} \,\widehat{\mathrm{SE}}(\hat\theta_\ell), \qquad \ell=1,\dots,m . $$

Then $$ \mathbb P\big(\theta_1\in I_1,\dots,\theta_m\in I_m\big) \ge 1-\alpha . $$


5.3 Bonferroni CIs for pairwise mean differences¶

For comparisons $\mu_i-\mu_j$, $$ \widehat{\mu_i-\mu_j}=\bar X_i-\bar X_j , $$ with standard error $$

\widehat{\mathrm{SE}}(\bar X_i-\bar X_j)¶

\sqrt{\text{MSE} \Big(\frac{1}{n_i}+\frac{1}{n_j}\Big)} . $$

If $m=\binom{k}{2}$, the Bonferroni confidence interval is $$ (\bar X_i-\bar X_j) \pm t_{1-\alpha/(2m),\nu} \sqrt{\text{MSE} \Big(\frac{1}{n_i}+\frac{1}{n_j}\Big)} . $$


5.4 Properties of Bonferroni intervals¶

  • Valid for any collection of contrasts
  • No independence assumption required
  • Often conservative for large $m$
  • Most effective for few, pre-planned comparisons

6. Tukey’s method (HSD / Tukey–Kramer)¶

6.1 Studentized range distribution¶

Let $Z_1,\dots,Z_k \sim N(0,1)$ i.i.d. The studentized range is $$ Q = \frac{\max_i Z_i - \min_i Z_i}{\sqrt{S^2}}, $$ where $S^2$ is an independent variance estimator.

Its distribution depends on:

  • $k$: number of groups,
  • $\nu$: error degrees of freedom.

Quantiles are denoted $q_{1-\alpha}(k,\nu)$.


6.2 Tukey HSD (balanced design)¶

Assume equal sample sizes $$ n_1=\cdots=n_k=n . $$

Then $$

\bar X_i-\bar X_j¶

(\mu_i-\muj) + \sigma\sqrt{\frac{2}{n}}\,Z{ij}, $$ and $$ \sqrt{\frac{\text{MSE}}{n}} $$ estimates $\sigma/\sqrt{n}$.

The Tukey HSD confidence interval is $$ (\bar X_i-\bar X_j) \pm q_{1-\alpha}(k,\nu) \sqrt{\frac{\text{MSE}}{n}} . $$

These intervals are simultaneous for all $\binom{k}{2}$ pairwise differences.


6.3 Tukey–Kramer method (unbalanced design)¶

When group sizes differ, the Tukey–Kramer interval is $$ (\bar X_i-\bar X_j) \pm q_{1-\alpha}(k,\nu) \sqrt{ \frac{\text{MSE}}{2} \Big(\frac{1}{n_i}+\frac{1}{n_j}\Big) } . $$

This reduces to Tukey HSD when $n_i=n$.


6.4 Properties of Tukey intervals¶

  • Exact FWER control for all pairwise comparisons
  • Shorter than Bonferroni for many groups
  • Requires normality and homoscedasticity
  • Not valid for arbitrary contrasts

7. Relationship to the ANOVA F-test¶

  • ANOVA F-test asks:

    Is there at least one difference among means?

  • Simultaneous confidence intervals ask:

    Which differences exist, and how large are they?

Key facts:

  • If the ANOVA F-test is not significant, all Tukey intervals contain $0$
  • A difference is significant iff its simultaneous CI excludes $0$

8. Bonferroni vs Tukey: comparison¶

Aspect Bonferroni Tukey (HSD / Kramer)
Comparisons Arbitrary All pairwise
Error control Always valid Exact under ANOVA
Interval width Often wider Usually narrower
Planning Pre-specified Exploratory
Variance assumption None Equal variances

9. Practical guidance¶

  • Few planned contrasts $\rightarrow$ Bonferroni
  • All pairwise comparisons $\rightarrow$ Tukey
  • Many groups, exploratory analysis $\rightarrow$ Tukey
  • Teaching multiple testing theory $\rightarrow$ Bonferroni first

10. Summary¶

  • Simultaneous confidence intervals control family-wise error
  • Bonferroni is general, simple, and conservative
  • Tukey exploits ANOVA structure for efficient pairwise inference
  • Both methods extend naturally from one-way ANOVA

In [ ]:
 

Problem: One-Way ANOVA (Faculty Ages by Rank)¶

A researcher claims that there is a difference in the average age of assistant professors, associate professors, and full professors at her university.

Faculty members are selected randomly, and their ages are recorded.
Assume that faculty ages are normally distributed.

Test the researcher’s claim at the $\alpha = 0.01$ significance level.

The observed data are:

Rank Ages
Assistant Professor 28, 32, 36, 42, 50, 33, 38
Associate Professor 44, 61, 52, 54, 62, 45, 46
Professor 54, 56, 55, 65, 52, 50, 46
In [13]:
import numpy as np
import pandas as pd
from itertools import combinations
from scipy import stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd


def one_way_anova(groups):
    """
    Perform one-way ANOVA and return ANOVA table components.

    Parameters
    ----------
    groups : dict
        keys   = group names
        values = lists or arrays of observations

    Returns
    -------
    dict with:
        - F_stat
        - p_value
        - MSE
        - df_error
        - group_means
        - group_sizes
    """
    group_names = list(groups.keys())
    data = [np.asarray(groups[g]) for g in group_names]

    # ANOVA F-test
    F_stat, p_value = stats.f_oneway(*data)

    # pooled variance (MSE)
    N = sum(len(x) for x in data)
    k = len(data)
    df_error = N - k

    SSE = sum(((x - x.mean())**2).sum() for x in data)
    MSE = SSE / df_error

    group_means = {g: np.mean(groups[g]) for g in group_names}
    group_sizes = {g: len(groups[g]) for g in group_names}

    return {
        "F_stat": F_stat,
        "p_value": p_value,
        "MSE": MSE,
        "df_error": df_error,
        "group_means": group_means,
        "group_sizes": group_sizes
    }


def bonferroni_ci(groups, alpha=0.05):
    """
    Bonferroni simultaneous confidence intervals
    for all pairwise mean differences.
    """
    anova = one_way_anova(groups)
    means = anova["group_means"]
    sizes = anova["group_sizes"]
    MSE   = anova["MSE"]
    df    = anova["df_error"]

    pairs = list(combinations(means.keys(), 2))
    m = len(pairs)

    t_crit = stats.t.ppf(1 - alpha/(2*m), df)

    rows = []
    for g1, g2 in pairs:
        diff = means[g1] - means[g2]
        se = np.sqrt(MSE * (1/sizes[g1] + 1/sizes[g2]))
        ci_low  = diff - t_crit * se
        ci_high = diff + t_crit * se

        rows.append([g1, g2, diff, ci_low, ci_high])

    return pd.DataFrame(
        rows,
        columns=["Group 1", "Group 2", "Mean diff", "CI lower", "CI upper"]
    )


def tukey_ci(groups, alpha=0.05):
    """
    Tukey (HSD / Tukey–Kramer) simultaneous confidence intervals
    for all pairwise mean differences.
    """
    values = []
    labels = []

    for g, obs in groups.items():
        values.extend(obs)
        labels.extend([g] * len(obs))

    values = np.asarray(values)
    labels = np.asarray(labels)

    tukey = pairwise_tukeyhsd(
        endog=values,
        groups=labels,
        alpha=alpha
    )

    return pd.DataFrame(
        tukey._results_table.data[1:],
        columns=tukey._results_table.data[0]
    )
In [14]:
# Data: Faculty ages by rank
groups = {
    "Assistant Professor": [28, 32, 36, 42, 50, 33, 38],
    "Associate Professor": [44, 61, 52, 54, 62, 45, 46],
    "Professor": [54, 56, 55, 65, 52, 50, 46]
}

alpha = 0.01

# --- One-way ANOVA ---
anova_results = one_way_anova(groups)

print("One-way ANOVA")
print("-------------")
print(f"F statistic = {anova_results['F_stat']:.4f}")
print(f"p-value     = {anova_results['p_value']:.6f}")
print(f"MSE         = {anova_results['MSE']:.4f}")
print(f"df_error    = {anova_results['df_error']}")

# --- Bonferroni simultaneous CIs ---
print("\nBonferroni simultaneous confidence intervals")
print("-------------------------------------------")
bonf_ci = bonferroni_ci(groups, alpha=alpha)
display(bonf_ci)

# --- Tukey simultaneous CIs ---
print("\nTukey (HSD / Tukey–Kramer) simultaneous confidence intervals")
print("------------------------------------------------------------")
tukey_results = tukey_ci(groups, alpha=alpha)
display(tukey_results)
One-way ANOVA
-------------
F statistic = 12.6195
p-value     = 0.000375
MSE         = 47.8889
df_error    = 18

Bonferroni simultaneous confidence intervals
-------------------------------------------
Group 1 Group 2 Mean diff CI lower CI upper
0 Assistant Professor Associate Professor -15.0 -27.503932 -2.496068
1 Assistant Professor Professor -17.0 -29.503932 -4.496068
2 Associate Professor Professor -2.0 -14.503932 10.503932
Tukey (HSD / Tukey–Kramer) simultaneous confidence intervals
------------------------------------------------------------
group1 group2 meandiff p-adj lower upper reject
0 Assistant Professor Associate Professor 15.0 0.0020 2.6979 27.3021 True
1 Assistant Professor Professor 17.0 0.0006 4.6979 29.3021 True
2 Associate Professor Professor 2.0 0.8524 -10.3021 14.3021 False