Correlation measures the strength and direction of association between two variables.
We study three main coefficients:
Let $(X, Y)$ be two random variables.
A correlation coefficient $\rho$ is a number in $[-1,1]$ such that:
The Pearson correlation coefficient is defined as:
$$ \rho_{X,Y} = \frac{\mathrm{Cov}(X,Y)}{\sigma_X \sigma_Y} $$where:
Given data $(x_1,y_1), \dots, (x_n,y_n)$:
$$ r = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})} {\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}} $$Let $(X_1, Y_1), \dots, (X_n, Y_n)$ be a sample.
Denote:
Then the Spearman rank correlation coefficient is defined as:
$$ \rho_s = \frac{\sum_{i=1}^n (R_i - \bar{R})(S_i - \bar{S})} {\sqrt{\sum_{i=1}^n (R_i - \bar{R})^2 \sum_{i=1}^n (S_i - \bar{S})^2}} $$where:
This is simply the Pearson correlation applied to the ranks:
$$ \rho_s = \mathrm{Corr}(R(X), R(Y)) $$If there are no ties (all ranks are distinct), then:
$$ \rho_s = 1 - \frac{6}{n^3 - n} \sum_{i=1}^n (R_i - S_i)^2 $$When there are no ties, ranks are:
$$ R_i, S_i \in \{1, 2, \dots, n\} $$So:
$$ \bar{R} = \bar{S} = \frac{n+1}{2} $$We compute:
$$ \sum_{i=1}^n (R_i - \bar{R})^2 = \sum_{i=1}^n \left(R_i - \frac{n+1}{2}\right)^2 $$Using the known formula:
$$ \sum_{i=1}^n R_i^2 = \frac{n(n+1)(2n+1)}{6} $$we obtain:
$$ \sum_{i=1}^n (R_i - \bar{R})^2 = \frac{n(n^2 - 1)}{12} $$The same holds for $S_i$.
Consider:
$$ \sum_{i=1}^n (R_i - \bar{R})(S_i - \bar{S}) $$Use the identity:
$$ (R_i - S_i)^2 = (R_i - \bar{R})^2 + (S_i - \bar{S})^2 - 2(R_i - \bar{R})(S_i - \bar{S}) $$Summing over $i$:
$$ \sum (R_i - S_i)^2 = \sum (R_i - \bar{R})^2 + \sum (S_i - \bar{S})^2 - 2 \sum (R_i - \bar{R})(S_i - \bar{S}) $$Since both variances are equal:
$$ \sum (R_i - \bar{R})^2 = \sum (S_i - \bar{S})^2 = \frac{n(n^2 - 1)}{12} $$we get:
$$ \sum (R_i - \bar{R})(S_i - \bar{S}) = \frac{1}{2} \left[ 2 \cdot \frac{n(n^2 - 1)}{12} - \sum (R_i - S_i)^2 \right] $$Simplifying:
$$ = \frac{n(n^2 - 1)}{12} - \frac{1}{2} \sum (R_i - S_i)^2 $$Recall:
$$ \rho_s = \frac{\sum (R_i - \bar{R})(S_i - \bar{S})} {\sqrt{\sum (R_i - \bar{R})^2 \sum (S_i - \bar{S})^2}} $$Denominator:
$$ \sqrt{ \left(\frac{n(n^2 - 1)}{12}\right)^2 } = \frac{n(n^2 - 1)}{12} $$So:
$$ \rho_s = \frac{ \frac{n(n^2 - 1)}{12} - \frac{1}{2} \sum (R_i - S_i)^2 } { \frac{n(n^2 - 1)}{12} } $$Divide:
$$ \rho_s = 1 - \frac{6}{n(n^2 - 1)} \sum (R_i - S_i)^2 $$or equivalently:
$$ \rho_s = 1 - \frac{6}{n^3 - n} \sum_{i=1}^n (R_i - S_i)^2 $$If ties exist, then:
In that case, always use:
$$ \rho_s = \mathrm{Corr}(R(X), R(Y)) $$Based on comparing pairs of observations.
For pairs $(i,j)$:
where:
| Property | Pearson | Spearman | Kendall |
|---|---|---|---|
| Measures | Linear dependence | Monotonic dependence | Pairwise agreement |
| Uses | Raw values | Ranks | Pair comparisons |
| Sensitive to outliers | Yes | Less | Very low |
| Captures nonlinear | No | Yes (monotonic) | Yes (monotonic) |
| Interpretation | Covariance-based | Rank correlation | Probability of concordance |
| Efficiency (normal data) | Highest | Medium | Lower |
| Robustness | Low | Medium | High |
Correlation does NOT imply causation.
Also: $$ \rho = 0 \nRightarrow X \text{ and } Y \text{ are independent} $$
Example: $Y = X^2$ with symmetric $X$ gives zero Pearson correlation but strong dependence.
In this section we illustrate the difference between:
The goal of these plots is to build intuition:
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from scipy.stats import pearsonr, spearmanr, kendalltau, rankdata
from itertools import combinations
def correlation_summary(x, y):
pearson = pearsonr(x, y)[0]
spearman = spearmanr(x, y)[0]
kendall = kendalltau(x, y)[0]
return pearson, spearman, kendall
Here all three coefficients should be large and positive. Pearson performs especially well because the relationship is linear.
np.random.seed(42)
n = 100
x = np.linspace(0, 10, n)
y = 2 * x + np.random.normal(scale=2, size=n)
pearson, spearman, kendall = correlation_summary(x, y)
df = pd.DataFrame({"x": x, "y": y})
fig = px.scatter(
df, x="x", y="y", trendline="ols",
title=f"Linear Relationship<br>Pearson={pearson:.3f}, Spearman={spearman:.3f}, Kendall={kendall:.3f}"
)
fig.show()
Here the relationship is increasing, but not linear. So:
np.random.seed(42)
x = np.linspace(0.1, 10, n)
y = np.log(x) + np.random.normal(scale=0.08, size=n)
pearson, spearman, kendall = correlation_summary(x, y)
df = pd.DataFrame({"x": x, "y": y})
fig = px.scatter(
df, x="x", y="y",
title=f"Monotonic but Nonlinear Relationship<br>Pearson={pearson:.3f}, Spearman={spearman:.3f}, Kendall={kendall:.3f}"
)
fig.show()
This is a key example.
Take something like $y = x^2 + \text{noise}$ with $x$ symmetric around $0$. Then there is a clear dependence, but it is not monotonic.
In such a case:
So all of them can fail to detect dependence if the dependence is not linear/monotonic.
np.random.seed(42)
x = np.linspace(-3, 3, n)
y = x**2 + np.random.normal(scale=0.8, size=n)
pearson, spearman, kendall = correlation_summary(x, y)
df = pd.DataFrame({"x": x, "y": y})
fig = px.scatter(
df, x="x", y="y",
title=f"Non-monotonic Relationship<br>Pearson={pearson:.3f}, Spearman={spearman:.3f}, Kendall={kendall:.3f}"
)
fig.show()
Pearson is very sensitive to outliers. Spearman and Kendall are typically much more stable.
np.random.seed(42)
x = np.linspace(0, 10, n)
y = x + np.random.normal(scale=1.0, size=n)
# add one extreme outlier
x_out = np.append(x, [10])
y_out = np.append(y, [40])
pearson, spearman, kendall = correlation_summary(x_out, y_out)
df = pd.DataFrame({"x": x_out, "y": y_out})
fig = px.scatter(
df, x="x", y="y", trendline="ols",
title=f"Linear Relationship with Outlier<br>Pearson={pearson:.3f}, Spearman={spearman:.3f}, Kendall={kendall:.3f}"
)
fig.show()
This final plot compares the three coefficients across several common dependence structures.
np.random.seed(42)
n = 200
datasets = {}
# Linear
x1 = np.linspace(0, 10, n)
y1 = 3 * x1 + np.random.normal(scale=3, size=n)
datasets["Linear"] = (x1, y1)
# Monotonic nonlinear
x2 = np.linspace(0.1, 10, n)
y2 = np.sqrt(x2) + np.random.normal(scale=0.12, size=n)
datasets["Monotonic nonlinear"] = (x2, y2)
# Non-monotonic
x3 = np.linspace(-3, 3, n)
y3 = x3**2 + np.random.normal(scale=0.7, size=n)
datasets["Non-monotonic"] = (x3, y3)
# Linear with outlier
x4 = np.linspace(0, 10, n)
y4 = x4 + np.random.normal(scale=1.0, size=n)
x4 = np.append(x4, 10)
y4 = np.append(y4, 40)
datasets["With outlier"] = (x4, y4)
rows = []
for name, (xv, yv) in datasets.items():
p, s, k = correlation_summary(xv, yv)
rows.append({
"Dataset": name,
"Pearson": p,
"Spearman": s,
"Kendall": k
})
corr_df = pd.DataFrame(rows)
corr_long = corr_df.melt(id_vars="Dataset", var_name="Coefficient", value_name="Value")
fig = px.bar(
corr_long,
x="Dataset",
y="Value",
color="Coefficient",
barmode="group",
title="Comparison of Pearson, Spearman, and Kendall Across Different Dependence Structures"
)
fig.show()
We now study how to test whether a correlation is statistically significant.
In all cases, the goal is to test:
$$ H_0: \text{no association} \quad \text{vs} \quad H_1: \text{association exists} $$Depending on the coefficient, this translates into different mathematical hypotheses.
We test:
$$ H_0: \rho = 0 \quad \text{vs} \quad H_1: \rho \ne 0 $$where $\rho$ is the population Pearson correlation.
Given sample correlation $r$, define:
$$ t = \frac{r \sqrt{n - 2}}{\sqrt{1 - r^2}} $$(Student's t-distribution with $n-2$ degrees of freedom)
Reject $H_0$ if:
$$ |t| > t_{n-2, 1-\alpha/2} $$or equivalently, if the p-value is small.
For small $n$, the distribution of $\rho_s$ can be computed exactly (via permutations).
For large $n$, we use:
$$ t = \frac{\rho_s \sqrt{n - 2}}{\sqrt{1 - \rho_s^2}} $$and approximate:
$$ t \approx t_{n-2} $$Another approximation:
$$ \sqrt{n-1} \, \rho_s \approx \mathcal{N}(0,1) $$Since Spearman is Pearson on ranks:
$$ \rho_s = \mathrm{Corr}(R(X), R(Y)) $$we are effectively testing linear correlation between ranks.
Under $H_0$, for large $n$:
$$ \tau \approx \mathcal{N}(0, \sigma^2) $$where:
$$ \sigma^2 = \frac{2(2n+5)}{9n(n-1)} $$Then:
$$ Z \sim \mathcal{N}(0,1) $$Reject $H_0$ if:
$$ |Z| > z_{1-\alpha/2} $$Recall:
$$ \tau = P(\text{concordant}) - P(\text{discordant}) $$So testing $\tau = 0$ means:
$$ P(\text{concordant}) = P(\text{discordant}) $$| Feature | Pearson Test | Spearman Test | Kendall Test |
|---|---|---|---|
| Null hypothesis | $\rho=0$ | $\rho_s=0$ | $\tau=0$ |
| Distribution | $t_{n-2}$ | approx $t$ or normal | normal |
| Assumptions | Normality | None | None |
| Measures | Linear dependence | Monotonic dependence | Pairwise concordance |
| Robustness | Low | Medium | High |
Under normality:
Failure to reject $H_0$ does NOT imply independence.
It only means:
The GDP growth rates of Russia for the years 2006–2012
(in percent relative to 2005) are:
The corresponding indicators for Belarus are:
$$ 107,\ 116,\ 118,\ 101,\ 105,\ 111,\ 111,\ 108,\ 117,\ 121,\ 103,\ 108,\ 114,\ 115. $$Test the hypothesis that these two samples are independent.
Use three rank/association measures:
Interpret the results and state whether there is evidence against independence.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import pearsonr, spearmanr, kendalltau
# Data
russia = np.array([108.2, 117.4, 123.5, 113.9, 119.0, 124.1, 128.4,
108.5, 118.0, 124.2, 114.5, 119.6, 124.6, 128.7])
belarus = np.array([107, 116, 118, 101, 105, 111, 111,
108, 117, 121, 103, 108, 114, 115])
# Compute coefficients and p-values
pearson_corr, pearson_p = pearsonr(russia, belarus)
spearman_corr, spearman_p = spearmanr(russia, belarus)
kendall_corr, kendall_p = kendalltau(russia, belarus)
print("Pearson correlation:")
print(f" coefficient = {pearson_corr:.6f}")
print(f" p-value = {pearson_p:.6f}\n")
print("Spearman correlation:")
print(f" coefficient = {spearman_corr:.6f}")
print(f" p-value = {spearman_p:.6f}\n")
print("Kendall correlation:")
print(f" coefficient = {kendall_corr:.6f}")
print(f" p-value = {kendall_p:.6f}")
# Scatter plot
plt.figure(figsize=(7, 5))
plt.scatter(russia, belarus)
plt.xlabel("Russia GDP growth rate")
plt.ylabel("Belarus GDP growth rate")
plt.title("Scatter Plot of GDP Growth Rates")
plt.grid(True)
plt.show()
Pearson correlation: coefficient = 0.554708 p-value = 0.039516 Spearman correlation: coefficient = 0.537446 p-value = 0.047474 Kendall correlation: coefficient = 0.411136 p-value = 0.042188
We are given two samples of equal size:
$$ X = \text{GDP growth rates of Russia}, \qquad Y = \text{GDP growth rates of Belarus}, $$with sample size
$$ n = 14. $$We want to test whether the two samples are independent.
If two variables are independent, then in particular there should be no systematic association between them.
To investigate this, we compute three different coefficients:
They measure different kinds of association:
If all three coefficients are significantly positive or negative, this is evidence against independence.
Russia:
$$ x = (108.2,117.4,123.5,113.9,119.0,124.1,128.4,108.5,118.0,124.2,114.5,119.6,124.6,128.7) $$Belarus:
$$ y = (107,116,118,101,105,111,111,108,117,121,103,108,114,115) $$The sample Pearson correlation coefficient is
$$ r = \frac{\sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})} {\sqrt{\sum_{i=1}^n (x_i-\bar{x})^2}\sqrt{\sum_{i=1}^n (y_i-\bar{y})^2}}. $$It measures the strength of the linear relationship between the variables.
For Pearson correlation we test:
$$ H_0: \rho = 0 \qquad \text{vs} \qquad H_1: \rho \ne 0. $$Under independence, we must have $\rho=0$, so this is a natural test.
If the joint distribution is approximately bivariate normal, then under $H_0$:
$$ T = \frac{r\sqrt{n-2}}{\sqrt{1-r^2}} \sim t_{n-2}. $$Here $n=14$, so the number of degrees of freedom is
$$ n-2=12. $$After computing $r$, we substitute it into this formula and obtain the test statistic.
If the corresponding p-value is small, we reject $H_0$.
Let $R_i$ be the rank of $x_i$ among $x_1,\dots,x_n$, and let $S_i$ be the rank of $y_i$ among $y_1,\dots,y_n$.
The Spearman correlation coefficient is defined by
$$ \rho_s = \frac{\sum_{i=1}^n (R_i-\bar{R})(S_i-\bar{S})} {\sqrt{\sum_{i=1}^n (R_i-\bar{R})^2}\sqrt{\sum_{i=1}^n (S_i-\bar{S})^2}}. $$So Spearman correlation is simply Pearson correlation applied to the ranks.
Spearman correlation measures whether the relationship is monotone:
It is less sensitive to outliers than Pearson correlation.
We test
$$ H_0: \rho_s = 0 \qquad \text{vs} \qquad H_1: \rho_s \ne 0. $$For small samples one may use exact permutation distributions; in practice we often use the p-value returned by statistical software.
For every pair $(i,j)$ with $i<j$, compare the relative order of $x_i,x_j$ and $y_i,y_j$.
A pair is:
Kendall's tau is
$$ \tau = \frac{C-D}{\binom{n}{2}}, $$where:
With ties, the corrected version of Kendall's tau is used automatically in software.
Kendall's tau measures the tendency of the two variables to move in the same order.
It has the probabilistic meaning
$$ \tau = P(\text{concordance}) - P(\text{discordance}), $$at least in the ideal no-tie case.
We test
$$ H_0: \tau = 0 \qquad \text{vs} \qquad H_1: \tau \ne 0. $$Again, the p-value can be computed directly using software.
After computing the three coefficients, we obtain:
These values are all positive and reasonably large.
So all three methods suggest a positive association between the GDP growth rates of Russia and Belarus in this dataset.
Using the corresponding hypothesis tests, the p-values are small (below standard significance levels such as $0.05$), so we reject the null hypothesis of no association.
Hence, the data provide evidence that the two samples are not independent.
The GDP growth rates of Russia and Belarus show a statistically significant positive association.
Therefore, based on Pearson, Spearman, and Kendall correlation analysis, we reject the hypothesis of independence of the two samples.
Strictly speaking:
Thus, in this problem, the observed positive correlations support the conclusion that the samples are not independent.