The Mann–Whitney U test is a non-parametric test for comparing two independent samples.
It assesses whether one distribution tends to produce larger values than the other and is a robust alternative to the two-sample t-test.
Let $$ X_1,\dots,X_{n_1} \sim F, \qquad Y_1,\dots,Y_{n_2} \sim G, $$ where all observations are independent.
The goal is to compare the distributions $F$ and $G$ without assuming normality.
Null hypothesis $$ H_0: F = G $$
Alternative hypothesis $$ H_1: F \neq G $$ (or one-sided variants: $F$ stochastically dominates $G$ or vice versa)
⚠️ Important: this is not a test of equality of means in general.
Null hypothesis (Mann–Whitney U test).
Let $X$ and $Y$ be independent random variables representing observations from the two groups. The Mann–Whitney test is based on the null hypothesis $$ H_0:\; \mathbb P(X<Y) + \tfrac12\,\mathbb P(X=Y) = \tfrac12. $$
The Mann–Whitney statistics can be written as $$ U_X = n_1 n_2 + \frac{n_1(n_1+1)}{2} - R_X, \qquad U_Y = n_1 n_2 + \frac{n_2(n_2+1)}{2} - R_Y, $$ where $$ R_X = \sum_{i=1}^{n_1} R(X_i), \qquad R_Y = \sum_{j=1}^{n_2} R(Y_j) $$ are the rank sums of the $X$ and $Y$ samples, respectively.
The test statistic used in the Mann–Whitney test is $$ U = \min(U_X, U_Y). $$
This symmetrization ensures invariance under relabeling of the two samples.
Notice that the table for critical values gives the values for the 2-tailed test. In other words, we will fail to reject $H_0$ when The U statistic is greater than the critical value from the table.
Equivalently, $$ U_X = \sum_{i=1}^{n_1}\sum_{j=1}^{n_2} \mathbf{1}\{X_i < Y_j\}, $$ with ties handled via midranks in practice.
This representation is central for the theoretical interpretation of the test.
The rank-sum statistic $R_X$ and $U_X$ are affinely related: $$ R_X = U_X + \frac{n_1(n_1+1)}{2}. $$
All formulations $(R_X, U_X, U)$ induce identical tests and p-values, differing only by centering and symmetrization.
Under $H_0$:
Thus, $U_X$ has an exact permutation distribution depending only on $(n_1,n_2)$.
Formally: $$
\frac{#{\text{rank allocations yielding } u}}{\binom{n_1+n_2}{n_1}}. $$
This distribution is:
Consider two samples:
Under the null hypothesis $H_0$, the two samples come from the same continuous distribution.
Pool all observations and assign ranks
$1,2,3,4,5$.
Under $H_0$:
Total number of allocations: $$ \binom{n_1+n_2}{n_1} = \binom{5}{2} = 10. $$
Each allocation has probability $1/10$.
Let $R_X$ be the sum of the ranks assigned to sample $X$.
The Mann–Whitney statistic is defined as $$ U_X = R_X - \frac{n_1(n_1+1)}{2} = R_X - 3. $$
For $n_1 = 2$, the Mann–Whitney statistic is $$ U_X = R_X - 3. $$
| Ranks assigned to $X$ | $R_X$ | $U_X$ |
|---|---|---|
| $\{1,2\}$ | 3 | 0 |
| $\{1,3\}$ | 4 | 1 |
| $\{1,4\}$ | 5 | 2 |
| $\{1,5\}$ | 6 | 3 |
| $\{2,3\}$ | 5 | 2 |
| $\{2,4\}$ | 6 | 3 |
| $\{2,5\}$ | 7 | 4 |
| $\{3,4\}$ | 7 | 4 |
| $\{3,5\}$ | 8 | 5 |
| $\{4,5\}$ | 9 | 6 |
By counting how many allocations produce each value of $U_X$, we obtain:
| $u$ | Count | $\mathbb{P}(U_X = u)$ |
|---|---|---|
| 0 | 1 | 0.1 |
| 1 | 1 | 0.1 |
| 2 | 2 | 0.2 |
| 3 | 2 | 0.2 |
| 4 | 2 | 0.2 |
| 5 | 1 | 0.1 |
| 6 | 1 | 0.1 |
Formally, $$
\frac{#{\text{rank allocations yielding } u}}{\binom{5}{2}}. $$
Here, $$ n_1 n_2 = 2 \cdot 3 = 6, $$ so the distribution of $U_X$ is symmetric about $$ \frac{n_1 n_2}{2} = 3. $$
Indeed, $$ \mathbb{P}(U_X = 0) = \mathbb{P}(U_X = 6), \quad \mathbb{P}(U_X = 1) = \mathbb{P}(U_X = 5), \quad \mathbb{P}(U_X = 2) = \mathbb{P}(U_X = 4). $$
Moreover, $$ U_Y = n_1 n_2 - U_X = 6 - U_X, $$ so the two Mann–Whitney statistics are complementary for each allocation.
This example shows explicitly that under $H_0$:
The minimum and maximum possible values of $U_X$ are: $$ U_{X,\min} = 0, \qquad U_{X,\max} = n_1 n_2. $$
Thus: $$ U_X \in \{0,1,\dots,n_1 n_2\}. $$
Each value corresponds to the number of $(X_i,Y_j)$ pairs such that $X_i < Y_j$.
Under $H_0$: $$ \mathbb{E}[U_X] = \frac{n_1 n_2}{2}, $$ $$ \mathrm{Var}(U_X) = \frac{n_1 n_2 (n_1+n_2+1)}{12}. $$
The statistic $U=\min(U_X,U_Y)$ has the same null distribution by symmetry.
Define the population parameter $$
\mathbb{P}(X < Y) + \tfrac12 \mathbb{P}(X = Y). $$
Then: $$ \mathbb{E}\!\left[\frac{U_X}{n_1 n_2}\right] = \theta. $$
Under $H_0: F = G$, we have $$ \theta = \tfrac12. $$
Thus, the Mann–Whitney test is a test of $$ H_0:\; \theta = \tfrac12, $$ corresponding to absence of stochastic dominance.
The statistic $U_X$ is a U-statistic with kernel $$ h(x,y) = \mathbf{1}\{x < y\}. $$
By Hoeffding’s theory of U-statistics:
As $n_1,n_2 \to \infty$: $$ \frac{U_X - \mathbb{E}[U_X]}{\sqrt{\mathrm{Var}(U_X)}} \;\xrightarrow{d}\; N(0,1). $$
This follows from the Hoeffding decomposition and the CLT for U-statistics.
It is possible that two or more observations take the same value. In this case, the Mann–Whitney U statistic can still be computed by allocating half of each tie to sample $X$ and half to sample $Y$ (equivalently, by using mean ranks).
However, when ties are present, the normal approximation to the distribution of $U$ must be used with a correction to the standard deviation. The adjusted standard deviation of $U$ is
$
\sqrt{ \frac{n_x n_y}{N (N - 1)} \left[
\sum_{j=1}^{g} \frac{t_j^3 - t_j}{12} \right] }, $
where
In such cases, rejection does not necessarily correspond to a pure location shift.
The Mann–Whitney test is an exact, distribution-free U-statistic test of stochastic dominance whose null distribution arises from random rank allocations and converges asymptotically to a normal distribution.
The test counts how often observations from one sample are smaller than those from the other.
In one-way ANOVA we test the global null hypothesis $$ H_0:\quad \mu_1=\mu_2=\cdots=\mu_k . $$
If this hypothesis is rejected, a natural next question is:
Which means differ, and by how much?
Using ordinary (single-parameter) confidence intervals for many comparisons leads to inflated Type I error, because several intervals are examined simultaneously.
Goal: Construct confidence intervals that hold simultaneously for a family of parameters with overall confidence level $1-\alpha$.
We consider the classical one-way ANOVA model $$ X_{ij} = \mu_i + \varepsilon_{ij}, \qquad i=1,\dots,k,\quad j=1,\dots,n_i, $$ where
Define: $$ \bar X_i = \frac{1}{n_i}\sum_{j=1}^{n_i} X_{ij}, \qquad N=\sum_{i=1}^k n_i . $$
The Mean Square Error (MSE) is $$
\frac{1}{N-k} \sum{i=1}^k\sum{j=1}^{ni} (X{ij}-\bar X_i)^2 , $$ with $\nu=N-k$ degrees of freedom.
Let $\theta_1,\dots,\theta_m$ be parameters of interest (e.g. mean differences).
Intervals $I_1,\dots,I_m$ are simultaneous confidence intervals with level $1-\alpha$ if $$ \mathbb P\big(\theta_1\in I_1,\dots,\theta_m\in I_m\big)\ge 1-\alpha . $$
This is stronger than marginal coverage $$ \mathbb P(\theta_\ell\in I_\ell)\ge 1-\alpha \quad \text{for each } \ell . $$
If we construct $m$ ordinary $1-\alpha$ confidence intervals independently, then $$ \mathbb P(\text{all correct}) \le (1-\alpha)^m , $$ which can be very small for large $m$.
Simultaneous methods control the family-wise error rate (FWER): $$ \mathbb P(\text{at least one false statement}) \le \alpha . $$
For any events $A_1,\dots,A_m$, $$ \mathbb P\Big(\bigcup_{\ell=1}^m A_\ell\Big) \le \sum_{\ell=1}^m \mathbb P(A_\ell). $$
This bound is distribution-free and does not require independence.
Suppose $\hat\theta_\ell$ estimates $\theta_\ell$ and $$ \frac{\hat\theta_\ell-\theta_\ell} {\widehat{\mathrm{SE}}(\hat\theta_\ell)} \sim t_\nu . $$
Define intervals $$ I_\ell:\quad \hat\theta_\ell \pm t_{1-\alpha/(2m),\nu} \,\widehat{\mathrm{SE}}(\hat\theta_\ell), \qquad \ell=1,\dots,m . $$
Then $$ \mathbb P\big(\theta_1\in I_1,\dots,\theta_m\in I_m\big) \ge 1-\alpha . $$
For comparisons $\mu_i-\mu_j$, $$ \widehat{\mu_i-\mu_j}=\bar X_i-\bar X_j , $$ with standard error $$
\sqrt{\text{MSE} \Big(\frac{1}{n_i}+\frac{1}{n_j}\Big)} . $$
If $m=\binom{k}{2}$, the Bonferroni confidence interval is $$ (\bar X_i-\bar X_j) \pm t_{1-\alpha/(2m),\nu} \sqrt{\text{MSE} \Big(\frac{1}{n_i}+\frac{1}{n_j}\Big)} . $$
Let $Z_1,\dots,Z_k \sim N(0,1)$ i.i.d. The studentized range is $$ Q = \frac{\max_i Z_i - \min_i Z_i}{\sqrt{S^2}}, $$ where $S^2$ is an independent variance estimator.
Its distribution depends on:
Quantiles are denoted $q_{1-\alpha}(k,\nu)$.
Assume equal sample sizes $$ n_1=\cdots=n_k=n . $$
Then $$
(\mu_i-\muj) + \sigma\sqrt{\frac{2}{n}}\,Z{ij}, $$ and $$ \sqrt{\frac{\text{MSE}}{n}} $$ estimates $\sigma/\sqrt{n}$.
The Tukey HSD confidence interval is $$ (\bar X_i-\bar X_j) \pm q_{1-\alpha}(k,\nu) \sqrt{\frac{\text{MSE}}{n}} . $$
These intervals are simultaneous for all $\binom{k}{2}$ pairwise differences.
When group sizes differ, the Tukey–Kramer interval is $$ (\bar X_i-\bar X_j) \pm q_{1-\alpha}(k,\nu) \sqrt{ \frac{\text{MSE}}{2} \Big(\frac{1}{n_i}+\frac{1}{n_j}\Big) } . $$
This reduces to Tukey HSD when $n_i=n$.
ANOVA F-test asks:
Is there at least one difference among means?
Simultaneous confidence intervals ask:
Which differences exist, and how large are they?
Key facts:
| Aspect | Bonferroni | Tukey (HSD / Kramer) |
|---|---|---|
| Comparisons | Arbitrary | All pairwise |
| Error control | Always valid | Exact under ANOVA |
| Interval width | Often wider | Usually narrower |
| Planning | Pre-specified | Exploratory |
| Variance assumption | None | Equal variances |
A researcher claims that there is a difference in the average age of assistant professors, associate professors, and full professors at her university.
Faculty members are selected randomly, and their ages are recorded.
Assume that faculty ages are normally distributed.
Test the researcher’s claim at the $\alpha = 0.01$ significance level.
The observed data are:
| Rank | Ages |
|---|---|
| Assistant Professor | 28, 32, 36, 42, 50, 33, 38 |
| Associate Professor | 44, 61, 52, 54, 62, 45, 46 |
| Professor | 54, 56, 55, 65, 52, 50, 46 |