The coefficient of determination measures the proportion of variability in the response variable $Y$ explained by the regression model.
We start from the decomposition:
$$ Y_i = \hat{Y}_i + e_i $$where:
In simple regression: $$ R^2 = r^2 $$
where $r$ is the Pearson correlation coefficient.
import numpy as np
import matplotlib.pyplot as plt
def regression_r2_analysis(x, y, title="Regression Example"):
x = np.array(x, dtype=float)
y = np.array(y, dtype=float)
n = len(x)
x_bar = np.mean(x)
y_bar = np.mean(y)
# OLS estimators
Sxx = np.sum((x - x_bar)**2)
Sxy = np.sum((x - x_bar)*(y - y_bar))
b1 = Sxy / Sxx
b0 = y_bar - b1 * x_bar
# fitted values
y_hat = b0 + b1 * x
# sums of squares
SST = np.sum((y - y_bar)**2)
SSE = np.sum((y - y_hat)**2)
SSR = np.sum((y_hat - y_bar)**2)
R2 = SSR / SST
# print results
print(title)
print("-" * len(title))
print(f"Intercept b0 = {b0:.4f}")
print(f"Slope b1 = {b1:.4f}")
print(f"SST = {SST:.4f}")
print(f"SSE = {SSE:.4f}")
print(f"SSR = {SSR:.4f}")
print(f"R^2 = {R2:.4f}")
print(f"Check: 1 - SSE/SST = {1 - SSE/SST:.4f}")
# plot
x_line = np.linspace(np.min(x), np.max(x), 200)
y_line = b0 + b1 * x_line
plt.figure(figsize=(7,5))
plt.scatter(x, y, label="Data")
plt.plot(x_line, y_line, label="Least Squares Line")
plt.xlabel("x")
plt.ylabel("y")
plt.title(f"{title}\n$R^2 = {R2:.4f}$")
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
return {
"b0": b0,
"b1": b1,
"SST": SST,
"SSE": SSE,
"SSR": SSR,
"R2": R2
}
In this first example, the points lie very close to a straight line. Therefore, the residual sum of squares $SSE$ should be small, and so the coefficient of determination $R^2$ should be close to $1$.
This means that the regression line explains most of the variability in the response variable.
x1 = [1, 2, 3, 4, 5, 6, 7, 8]
y1 = [2.1, 4.2, 6.1, 8.3, 9.9, 12.2, 13.8, 16.1]
res1 = regression_r2_analysis(x1, y1, title="Example 1: Strong Linear Relationship")
Example 1: Strong Linear Relationship ------------------------------------- Intercept b0 = 0.2000 Slope b1 = 1.9750 SST = 163.9888 SSE = 0.1625 SSR = 163.8263 R^2 = 0.9990 Check: 1 - SSE/SST = 0.9990
In the second example, the data do not follow a clear straight-line pattern. The fitted regression line may still exist, but it explains only a small proportion of the variability in $Y$.
Thus, we expect $R^2$ to be relatively small, possibly close to $0$.
x3 = [1, 2, 3, 4, 5, 6, 7, 8]
y3 = [5.2, 7.8, 4.9, 6.1, 5.7, 7.0, 5.4, 6.3]
res3 = regression_r2_analysis(x3, y3, title="Example 3: Weak Linear Relationship")
Example 3: Weak Linear Relationship ----------------------------------- Intercept b0 = 5.9643 Slope b1 = 0.0190 SST = 6.6200 SSE = 6.6048 SSR = 0.0152 R^2 = 0.0023 Check: 1 - SSE/SST = 0.0023
We want to understand why:
$$ \hat{\beta}_1 \sim N\left(\beta_1, \frac{\sigma^2}{S_{xx}}\right), \quad \text{where } S_{xx} = \sum (x_i - \bar{x})^2 $$We consider the simple linear regression model:
$$ Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, $$where:
Thus,
$$ Y_i \sim N(\beta_0 + \beta_1 x_i, \sigma^2). $$The least squares estimator for the slope is:
$$ \hat{\beta}_1 = \frac{\sum (x_i - \bar{x}) Y_i}{S_{xx}}, \quad S_{xx} = \sum (x_i - \bar{x})^2. $$SINCE: The slope estimator can be written in either of the following equivalent forms:
$$ \hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^n (x_i-\bar{x})^2} $$or
$$ \hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i-\bar{x})y_i}{\sum_{i=1}^n (x_i-\bar{x})^2}. $$These are equivalent because
$$ \sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y}) = \sum_{i=1}^n (x_i-\bar{x})y_i - \bar{y}\sum_{i=1}^n (x_i-\bar{x}), $$and since
$$ \sum_{i=1}^n (x_i-\bar{x}) = 0, $$we obtain
$$ \sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y}) = \sum_{i=1}^n (x_i-\bar{x})y_i. $$Substitute $Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i$:
$$ \hat{\beta}_1 = \frac{\sum (x_i - \bar{x})(\beta_0 + \beta_1 x_i + \varepsilon_i)}{S_{xx}}. $$Expanding:
$$ \hat{\beta}_1 = \frac{ \beta_0 \sum (x_i - \bar{x}) + \beta_1 \sum (x_i - \bar{x}) x_i + \sum (x_i - \bar{x}) \varepsilon_i }{S_{xx}}. $$We use the identities:
$$ \sum (x_i - \bar{x}) = 0, $$$$ \sum (x_i - \bar{x}) x_i = S_{xx}. $$Thus:
$$ \hat{\beta}_1 = \frac{ 0 + \beta_1 S_{xx} + \sum (x_i - \bar{x}) \varepsilon_i }{S_{xx}}. $$We obtain:
$$ \hat{\beta}_1 = \beta_1 + \frac{\sum (x_i - \bar{x}) \varepsilon_i}{S_{xx}}. $$This is the key representation.
Since $E[\varepsilon_i] = 0$:
$$ E[\hat{\beta}_1] = \beta_1. $$Thus, $\hat{\beta}_1$ is unbiased.
Using independence of the errors:
$$ Var(\hat{\beta}_1) = \frac{1}{S_{xx}^2} \sum (x_i - \bar{x})^2 Var(\varepsilon_i). $$Since $Var(\varepsilon_i) = \sigma^2$:
$$ Var(\hat{\beta}_1) = \frac{\sigma^2}{S_{xx}^2} \sum (x_i - \bar{x})^2 = \frac{\sigma^2}{S_{xx}}. $$From the representation:
$$ \hat{\beta}_1 = \beta_1 + \frac{\sum (x_i - \bar{x}) \varepsilon_i}{S_{xx}}, $$we see that $\hat{\beta}_1$ is a linear combination of the errors $\varepsilon_i$.
Since:
any linear combination of them is also normal.
Thus:
$$ \sum (x_i - \bar{x}) \varepsilon_i \sim N\left(0, \sigma^2 S_{xx}\right). $$Dividing by $S_{xx}$:
$$ \hat{\beta}_1 \sim N\left(\beta_1, \frac{\sigma^2}{S_{xx}}\right). $$The sum of squared errors is:
$$ SSE = \sum_{i=1}^n e_i^2 = \sum_{i=1}^n (Y_i - \hat{Y}_i)^2. $$This measures the unexplained variation in the data.
We define the estimator of $\sigma^2$ as:
$$ s^2 = \frac{SSE}{n-2}. $$A fundamental result:
$$ E[SSE] = (n-2)\sigma^2. $$Therefore:
$$ E[s^2] = \sigma^2. $$So $s^2$ is an unbiased estimator of $\sigma^2$.
From previous derivation:
$$ \hat{\beta}_1 = \beta_1 + \frac{\sum (x_i - \bar{x})\varepsilon_i}{S_{xx}}. $$Thus:
$$ Var(\hat{\beta}_1) = \frac{\sigma^2}{S_{xx}}, \quad S_{xx} = \sum (x_i - \bar{x})^2. $$Since $\sigma^2$ is unknown, we replace it with $s^2$:
$$ \widehat{Var}(\hat{\beta}_1) = \frac{s^2}{S_{xx}}. $$This is a plug-in estimator.
The standard error is the square root of the estimated variance:
$$ SE(\hat{\beta}_1) = \sqrt{\widehat{Var}(\hat{\beta}_1)} = \sqrt{\frac{s^2}{S_{xx}}}. $$We know:
$$ \frac{\hat{\beta}_1 - \beta_1}{\sqrt{\sigma^2/S_{xx}}} \sim N(0,1). $$Also, from theory:
$$ \frac{SSE}{\sigma^2} \sim \chi^2_{n-2}. $$A key theorem in regression:
This is a non-trivial result and follows from orthogonality of projections in the linear model.
We replace $\sigma^2$ by $s^2$:
$$ T = \frac{\hat{\beta}_1 - \beta_1}{\sqrt{s^2/S_{xx}}}. $$Rewrite:
$$ T = \frac{ \frac{\hat{\beta}_1 - \beta_1}{\sqrt{\sigma^2/S_{xx}}} }{ \sqrt{\frac{SSE}{(n-2)\sigma^2}} }. $$We now have:
Therefore:
$$ T \sim t_{n-2}. $$Reject $H_0$ if:
$$ |T| > t_{\alpha/2, n-2} $$
We want to understand why:
$$ Var(\hat{Y}_0) = \sigma^2 \left( \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}} \right), \quad S_{xx} = \sum (x_i - \bar{x})^2. $$Recall:
$$ \hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{x}. $$Substitute into $\hat{Y}_0$:
$$ \hat{Y}_0 = \bar{Y} - \hat{\beta}_1 \bar{x} + \hat{\beta}_1 x_0 = \bar{Y} + \hat{\beta}_1 (x_0 - \bar{x}). $$We now compute:
$$ Var(\hat{Y}_0) = Var\left(\bar{Y} + \hat{\beta}_1 (x_0 - \bar{x})\right). $$Using variance properties:
$$ Var(aX + bY) = a^2 Var(X) + b^2 Var(Y) + 2ab \, Cov(X,Y), $$we get:
$$ Var(\hat{Y}_0) = Var(\bar{Y}) + (x_0 - \bar{x})^2 Var(\hat{\beta}_1) + 2(x_0 - \bar{x}) Cov(\bar{Y}, \hat{\beta}_1). $$Since $Y_i$ are independent with variance $\sigma^2$:
$$ Var(\bar{Y}) = \frac{\sigma^2}{n}. $$From previous results:
$$ Var(\hat{\beta}_1) = \frac{\sigma^2}{S_{xx}}. $$We compute:
$$ Cov(\bar{Y}, \hat{\beta}_1). $$Recall:
$$ \bar{Y} = \frac{1}{n}\sum Y_i, \quad \hat{\beta}_1 = \frac{\sum (x_i - \bar{x}) Y_i}{S_{xx}}. $$Thus:
$$ Cov(\bar{Y}, \hat{\beta}_1) = \frac{1}{n S_{xx}} \sum (x_i - \bar{x}) Cov(Y_i, Y_i). $$Since $Var(Y_i)=\sigma^2$:
$$ Cov(\bar{Y}, \hat{\beta}_1) = \frac{\sigma^2}{n S_{xx}} \sum (x_i - \bar{x}). $$But:
$$ \sum (x_i - \bar{x}) = 0, $$so:
$$ Cov(\bar{Y}, \hat{\beta}_1) = 0. $$Putting everything together:
$$ Var(\hat{Y}_0) = \frac{\sigma^2}{n} + (x_0 - \bar{x})^2 \frac{\sigma^2}{S_{xx}}. $$Factor out $\sigma^2$:
$$ \boxed{ Var(\hat{Y}_0) = \sigma^2 \left( \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}} \right). } $$
and hence
$$ SE(\hat{Y}_0) = \sqrt{ \sigma^2 \left( \frac{1}{n} + \frac{(x_0-\bar{x})^2}{S_{xx}} \right) }. $$Since $\sigma^2$ is unknown, we estimate it by
$$ s^2=\frac{SSE}{n-2}. $$Thus the estimated standard error is
$$ \widehat{SE}(\hat{Y}_0) = \sqrt{ s^2 \left( \frac{1}{n} + \frac{(x_0-\bar{x})^2}{S_{xx}} \right) }. $$We now standardize:
$$ \frac{ \hat{Y}_0 - (\beta_0+\beta_1 x_0) }{ \sqrt{ \sigma^2 \left( \frac{1}{n} + \frac{(x_0-\bar{x})^2}{S_{xx}} \right) } } \sim N(0,1). $$Also,
$$ \frac{SSE}{\sigma^2}\sim \chi^2_{n-2}, $$and this quantity is independent of $\hat{Y}_0$.
Therefore,
$$ \frac{ \hat{Y}_0 - (\beta_0+\beta_1 x_0) }{ \sqrt{ s^2 \left( \frac{1}{n} + \frac{(x_0-\bar{x})^2}{S_{xx}} \right) } } \sim t_{n-2}. $$Hence
$$ P\left( -t_{\alpha/2,n-2} \le \frac{ \hat{Y}_0 - (\beta_0+\beta_1 x_0) }{ \sqrt{ s^2 \left( \frac{1}{n} + \frac{(x_0-\bar{x})^2}{S_{xx}} \right) } } \le t_{\alpha/2,n-2} \right)=1-\alpha. $$Multiplying through by the standard error and rearranging gives
$$ P\left( \hat{Y}_0 - t_{\alpha/2,n-2} \sqrt{ s^2 \left( \frac{1}{n} + \frac{(x_0-\bar{x})^2}{S_{xx}} \right) } \le \beta_0+\beta_1x_0 \le \hat{Y}_0 + t_{\alpha/2,n-2} \sqrt{ s^2 \left( \frac{1}{n} + \frac{(x_0-\bar{x})^2}{S_{xx}} \right) } \right) = 1-\alpha. $$Therefore, the $(1-\alpha)$ confidence interval for the mean response is
$$ \boxed{ \hat{Y}_0 \pm t_{\alpha/2,n-2} \sqrt{ s^2 \left( \frac{1}{n} + \frac{(x_0-\bar{x})^2}{S_{xx}} \right) } } $$Now suppose we want to predict not the mean, but a new observation at $x_0$.
Let
$$ Y_{\text{new}} = \beta_0+\beta_1x_0+\varepsilon_{\text{new}}, $$where
$$ \varepsilon_{\text{new}} \sim N(0,\sigma^2) $$and $\varepsilon_{\text{new}}$ is independent of the original sample.
We want an interval for $Y_{\text{new}}$.
Consider
$$ Y_{\text{new}}-\hat{Y}_0. $$Write it as
$$ Y_{\text{new}}-\hat{Y}_0 = (\beta_0+\beta_1x_0+\varepsilon_{\text{new}})-\hat{Y}_0. $$Equivalently,
$$ Y_{\text{new}}-\hat{Y}_0 = \big((\beta_0+\beta_1x_0)-\hat{Y}_0\big) + \varepsilon_{\text{new}}. $$The first term comes from estimating the regression line; the second is the irreducible random noise of the future observation.
Since $\varepsilon_{\text{new}}$ is independent of $\hat{Y}_0$,
$$ Var(Y_{\text{new}}-\hat{Y}_0) = Var\big((\beta_0+\beta_1x_0)-\hat{Y}_0\big) + Var(\varepsilon_{\text{new}}). $$Because subtracting a constant does not change variance,
$$ Var\big((\beta_0+\beta_1x_0)-\hat{Y}_0\big) = Var(\hat{Y}_0). $$So
$$ Var(Y_{\text{new}}-\hat{Y}_0) = Var(\hat{Y}_0)+\sigma^2. $$Using the previous formula for $Var(\hat{Y}_0)$,
$$ Var(Y_{\text{new}}-\hat{Y}_0) = \sigma^2 \left( \frac{1}{n} + \frac{(x_0-\bar{x})^2}{S_{xx}} \right) +\sigma^2. $$Hence
$$ Var(Y_{\text{new}}-\hat{Y}_0) = \sigma^2 \left( 1+\frac{1}{n} + \frac{(x_0-\bar{x})^2}{S_{xx}} \right). $$Therefore,
$$ \boxed{ Var(Y_{\text{new}}-\hat{Y}_0) = \sigma^2 \left( 1+\frac{1}{n} + \frac{(x_0-\bar{x})^2}{S_{xx}} \right) } $$and the estimated standard error is
$$ \widehat{SE}(Y_{\text{new}}-\hat{Y}_0) = \sqrt{ s^2 \left( 1+\frac{1}{n} + \frac{(x_0-\bar{x})^2}{S_{xx}} \right) }. $$Using normality and independence,
$$ \frac{ Y_{\text{new}}-\hat{Y}_0 }{ \sqrt{ s^2 \left( 1+\frac{1}{n} + \frac{(x_0-\bar{x})^2}{S_{xx}} \right) } } \sim t_{n-2}. $$Therefore,
$$ P\left( -t_{\alpha/2,n-2} \le \frac{ Y_{\text{new}}-\hat{Y}_0 }{ \sqrt{ s^2 \left( 1+\frac{1}{n} + \frac{(x_0-\bar{x})^2}{S_{xx}} \right) } } \le t_{\alpha/2,n-2} \right) = 1-\alpha. $$Rearranging yields the prediction interval.
Thus, the $(1-\alpha)$ prediction interval for a new observation at $x_0$ is
$$ \boxed{ \hat{Y}_0 \pm t_{\alpha/2,n-2} \sqrt{ s^2 \left( 1+\frac{1}{n} + \frac{(x_0-\bar{x})^2}{S_{xx}} \right) } } $$Compare the two formulas:
The prediction interval has an additional term:
$$ 1. $$This extra $1$ appears because a future observation has its own random error term $\varepsilon_{\text{new}}$ with variance $\sigma^2$.
So:
That is why prediction intervals are always wider.
So far, we have constructed intervals at a single point $x_0$:
Now we want something stronger:
intervals that hold for all values of $x$ simultaneously
These are called:
At a fixed $x_0$:
$$ \hat{Y}_0 \pm t_{\alpha/2,n-2} \cdot SE(\hat{Y}_0) $$👉 Valid only for one specific $x_0$
We want:
$$ P\left( \text{the entire function } \beta_0 + \beta_1 x \text{ lies inside the band for all } x \right) = 1-\alpha $$We want a band for:
$$ E[Y \mid x] = \beta_0 + \beta_1 x. $$Instead of controlling error at a single point, we control:
$$ \sup_x \left| \frac{\hat{Y}(x) - (\beta_0 + \beta_1 x)} {SE(\hat{Y}(x))} \right| $$A classical result gives the band:
$$ \boxed{ \hat{Y}(x) \pm \sqrt{2 F_{\alpha,2,n-2}} \cdot \sqrt{ s^2 \left( \frac{1}{n} + \frac{(x-\bar{x})^2}{S_{xx}} \right) } } $$where:
With probability $1-\alpha$:
$$ \beta_0 + \beta_1 x \text{ lies inside the band for all } x. $$Pointwise CI:
$$ \hat{Y}(x) \pm t_{\alpha/2,n-2} \cdot SE(\hat{Y}(x)) $$Band:
$$ \hat{Y}(x) \pm \sqrt{2F_{\alpha,2,n-2}} \cdot SE(\hat{Y}(x)) $$👉 Bands are wider than pointwise intervals
Now we consider future observations:
$$ Y = \beta_0 + \beta_1 x + \varepsilon. $$Confidence band:
“With probability $1-\alpha$, the entire true regression line lies inside the band”
Prediction band:
“With probability $1-\alpha$, all future observations lie inside the band (simultaneously)”
In this example, we:
We use the formulas
$$ \hat{Y}(x) = \hat{\beta}_0 + \hat{\beta}_1 x $$$$ s^2 = \frac{SSE}{n-2} $$$$ SE_{\text{mean}}(x) = \sqrt{ s^2\left( \frac{1}{n} + \frac{(x-\bar{x})^2}{S_{xx}} \right) } $$$$ SE_{\text{pred}}(x) = \sqrt{ s^2\left( 1 + \frac{1}{n} + \frac{(x-\bar{x})^2}{S_{xx}} \right) } $$Pointwise intervals:
$$ \hat{Y}(x_0) \pm t_{\alpha/2,n-2} \, SE_{\text{mean}}(x_0) $$$$ \hat{Y}(x_0) \pm t_{\alpha/2,n-2} \, SE_{\text{pred}}(x_0) $$Working--Hotelling confidence band:
$$ \hat{Y}(x) \pm W \, SE_{\text{mean}}(x), \qquad W = \sqrt{2F_{\alpha,2,n-2}} $$Prediction band (using the same simultaneous multiplier here for illustration):
$$ \hat{Y}(x) \pm W \, SE_{\text{pred}}(x) $$import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import t, f
# Sample data
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype=float)
y = np.array([2.3, 2.9, 3.8, 5.1, 5.0, 6.4, 7.1, 7.9, 9.2, 9.8], dtype=float)
n = len(x)
x_bar = np.mean(x)
y_bar = np.mean(y)
# OLS estimates
Sxx = np.sum((x - x_bar)**2)
Sxy = np.sum((x - x_bar)*(y - y_bar))
b1 = Sxy / Sxx
b0 = y_bar - b1 * x_bar
# Fitted values and residuals
y_hat = b0 + b1 * x
residuals = y - y_hat
# SSE and variance estimate
SSE = np.sum(residuals**2)
s2 = SSE / (n - 2)
s = np.sqrt(s2)
print(f"b0 = {b0:.4f}")
print(f"b1 = {b1:.4f}")
print(f"SSE = {SSE:.4f}")
print(f"s^2 = {s2:.4f}")
print(f"s = {s:.4f}")
b0 = 1.3000 b1 = 0.8455 SSE = 0.6145 s^2 = 0.0768 s = 0.2772
# Choose the point x0 where we want pointwise intervals
x0 = 4.5
# Predicted mean at x0
y0_hat = b0 + b1 * x0
# Standard errors at x0
SE_mean_x0 = np.sqrt(s2 * (1/n + (x0 - x_bar)**2 / Sxx))
SE_pred_x0 = np.sqrt(s2 * (1 + 1/n + (x0 - x_bar)**2 / Sxx))
# Quantiles
alpha = 0.001
t_crit = t.ppf(1 - alpha/2, df=n-2)
W = np.sqrt(2 * f.ppf(1 - alpha, dfn=2, dfd=n-2))
# Pointwise confidence interval for mean response at x0
ci_mean_lower = y0_hat - t_crit * SE_mean_x0
ci_mean_upper = y0_hat + t_crit * SE_mean_x0
# Pointwise prediction interval for a new observation at x0
pi_lower = y0_hat - t_crit * SE_pred_x0
pi_upper = y0_hat + t_crit * SE_pred_x0
print(f"x0 = {x0}")
print(f"Predicted value y_hat(x0) = {y0_hat:.4f}")
print()
print("Pointwise confidence interval for mean response:")
print(f"({ci_mean_lower:.4f}, {ci_mean_upper:.4f})")
print()
print("Pointwise prediction interval:")
print(f"({pi_lower:.4f}, {pi_upper:.4f})")
print()
print(f"t critical value = {t_crit:.4f}")
print(f"Working-Hotelling multiplier W = {W:.4f}")
x0 = 4.5 Predicted value y_hat(x0) = 5.1045 Pointwise confidence interval for mean response: (4.6367, 5.5724) Pointwise prediction interval: (3.6310, 6.5780) t critical value = 5.0413 Working-Hotelling multiplier W = 6.0817
# Grid of x-values for plotting the fitted line and bands
x_grid = np.linspace(np.min(x), np.max(x), 400)
y_grid = b0 + b1 * x_grid
# Standard errors on the grid
SE_mean_grid = np.sqrt(s2 * (1/n + (x_grid - x_bar)**2 / Sxx))
SE_pred_grid = np.sqrt(s2 * (1 + 1/n + (x_grid - x_bar)**2 / Sxx))
# Pointwise confidence band (if you want to compare)
ci_mean_grid_lower = y_grid - t_crit * SE_mean_grid
ci_mean_grid_upper = y_grid + t_crit * SE_mean_grid
# Confidence band (Working-Hotelling)
conf_band_lower = y_grid - W * SE_mean_grid
conf_band_upper = y_grid + W * SE_mean_grid
# Prediction band (using same simultaneous multiplier W for illustration)
pred_band_lower = y_grid - W * SE_pred_grid
pred_band_upper = y_grid + W * SE_pred_grid
plt.figure(figsize=(10, 7))
# Scatter plot of data
plt.scatter(x, y, label="Data")
# Least squares line
plt.plot(x_grid, y_grid, linewidth=2, label="LS regression line")
# Confidence band
plt.fill_between(
x_grid, conf_band_lower, conf_band_upper,
alpha=0.20, label="Confidence band"
)
# Prediction band
plt.fill_between(
x_grid, pred_band_lower, pred_band_upper,
alpha=0.12, label="Prediction band"
)
# Vertical line at x0
plt.axvline(x=x0, linestyle="--", linewidth=1.5, label=r"$x_0$")
# Mark predicted point on regression line
plt.scatter([x0], [y0_hat], s=80, zorder=5, label=r"$\hat{Y}_0$")
# Draw pointwise confidence interval at x0
plt.plot([x0, x0], [ci_mean_lower, ci_mean_upper], linewidth=4, label="Pointwise CI at $x_0$")
# Draw pointwise prediction interval at x0
plt.plot([x0, x0], [pi_lower, pi_upper], linewidth=2, label="Pointwise PI at $x_0$")
# Optional horizontal markers for interval endpoints
plt.plot([x0-0.12, x0+0.12], [ci_mean_lower, ci_mean_lower], linewidth=2)
plt.plot([x0-0.12, x0+0.12], [ci_mean_upper, ci_mean_upper], linewidth=2)
plt.plot([x0-0.18, x0+0.18], [pi_lower, pi_lower], linewidth=2)
plt.plot([x0-0.18, x0+0.18], [pi_upper, pi_upper], linewidth=2)
plt.xlabel("x")
plt.ylabel("y")
plt.title("Simple Linear Regression with Pointwise Intervals and Bands")
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
We analyze a dataset of $n = 58$ hospitals.
Response variable: $$ y = \text{infection risk (percentage)} $$
Predictor variable: $$ x = \text{average length of stay (days)} $$
We will:
import pandas as pd
data = pd.read_csv(
"hospital.txt",
sep=r"\s+",
header=None
)
# If there are more than 2 columns, keep only first two
data = data.iloc[:, :2]
data.columns = ["Stay", "InfectionRisk"]
# Convert to numeric explicitly (very important)
data["Stay"] = pd.to_numeric(data["Stay"], errors="coerce")
data["InfectionRisk"] = pd.to_numeric(data["InfectionRisk"], errors="coerce")
# Drop bad rows if any
data = data.dropna()
print(data.head())
print(data.shape)
Stay InfectionRisk 1 5.0 11.20 2 10.0 8.84 3 11.0 11.07 4 13.0 12.78 5 18.0 11.62 (58, 2)
print(data.dtypes)
print(data.describe())
Stay float64
InfectionRisk float64
dtype: object
Stay InfectionRisk
count 58.000000 58.00000
mean 52.706897 10.04931
std 30.625645 1.44107
min 2.000000 7.39000
25% 28.250000 8.90500
50% 53.000000 9.93000
75% 77.750000 11.06000
max 109.000000 13.95000
import matplotlib.pyplot as plt
x = data["Stay"].values
y = data["InfectionRisk"].values
plt.figure(figsize=(8,5))
plt.scatter(x, y, s=50)
plt.xlabel("Average Length of Stay (days)", fontsize=12)
plt.ylabel("Infection Risk (%)", fontsize=12)
plt.title("Hospital Infection Data", fontsize=14)
plt.grid(True, linestyle="--", alpha=0.4)
plt.tight_layout()
plt.show()
#compute the regression quantities by hand
import numpy as np
n = len(x)
x_bar = np.mean(x)
y_bar = np.mean(y)
Sxx = np.sum((x - x_bar)**2)
Sxy = np.sum((x - x_bar) * (y - y_bar))
b1 = Sxy / Sxx
b0 = y_bar - b1 * x_bar
y_hat = b0 + b1 * x
residuals = y - y_hat
SSE = np.sum(residuals**2)
SSR = np.sum((y_hat - y_bar)**2)
SST = np.sum((y - y_bar)**2)
MSR = SSR / 1
MSE = SSE / (n - 2)
F_stat = MSR / MSE
R2 = SSR / SST
s = np.sqrt(MSE)
print(f"n = {n}")
print(f"b0 = {b0:.4f}")
print(f"b1 = {b1:.4f}")
print(f"SST = {SST:.4f}")
print(f"SSR = {SSR:.4f}")
print(f"SSE = {SSE:.4f}")
print(f"MSE = {MSE:.4f}")
print(f"s = {s:.5f}")
print(f"R^2 = {R2:.4f} = {100*R2:.2f}%")
print(f"F = {F_stat:.4f}")
n = 58 b0 = 10.2033 b1 = -0.0029 SST = 118.3710 SSR = 0.4563 SSE = 117.9147 MSE = 2.1056 s = 1.45108 R^2 = 0.0039 = 0.39% F = 0.2167
x_grid = np.linspace(np.min(x), np.max(x), 400)
y_grid = b0 + b1 * x_grid
plt.figure(figsize=(8, 5))
plt.scatter(x, y, s=50, label="Data")
plt.plot(x_grid, y_grid, linewidth=2, label="Least squares line")
plt.xlabel("Average Length of Stay (days)", fontsize=12)
plt.ylabel("Infection Risk (%)", fontsize=12)
plt.title("Hospital Infection Data with LS Regression Line", fontsize=14)
plt.grid(True, linestyle="--", alpha=0.4)
plt.legend()
plt.tight_layout()
plt.show()
from scipy.stats import t, f
SE_b1 = np.sqrt(MSE / Sxx)
SE_b0 = np.sqrt(MSE * (1/n + x_bar**2 / Sxx))
t_b1 = b1 / SE_b1
t_b0 = b0 / SE_b0
p_b1 = 2 * (1 - t.cdf(abs(t_b1), df=n-2))
p_b0 = 2 * (1 - t.cdf(abs(t_b0), df=n-2))
anova_table = {
"Source": ["Regression", "Error", "Total"],
"DF": [1, n-2, n-1],
"SS": [SSR, SSE, SST],
"MS": [MSR, MSE, np.nan],
"F": [F_stat, np.nan, np.nan],
"P-value": [1 - f.cdf(F_stat, 1, n-2), np.nan, np.nan]
}
coef_table = {
"Term": ["Intercept", "Stay"],
"Coefficient": [b0, b1],
"SE Coef": [SE_b0, SE_b1],
"t-value": [t_b0, t_b1],
"P-value": [p_b0, p_b1]
}
import pandas as pd
print("ANOVA table")
display(pd.DataFrame(anova_table))
print("Coefficient table")
display(pd.DataFrame(coef_table))
ANOVA table
| Source | DF | SS | MS | F | P-value | |
|---|---|---|---|---|---|---|
| 0 | Regression | 1 | 0.456263 | 0.456263 | 0.216688 | 0.64338 |
| 1 | Error | 56 | 117.914710 | 2.105620 | NaN | NaN |
| 2 | Total | 57 | 118.370972 | NaN | NaN | NaN |
Coefficient table
| Term | Coefficient | SE Coef | t-value | P-value | |
|---|---|---|---|---|---|
| 0 | Intercept | 10.203286 | 0.381729 | 26.729144 | 0.00000 |
| 1 | Stay | -0.002921 | 0.006276 | -0.465498 | 0.64338 |
# Choose any point of interest
x0 = 50
y0_hat = b0 + b1 * x0
SE_mean_x0 = np.sqrt(MSE * (1/n + (x0 - x_bar)**2 / Sxx))
SE_pred_x0 = np.sqrt(MSE * (1 + 1/n + (x0 - x_bar)**2 / Sxx))
alpha = 0.05
t_crit = t.ppf(1 - alpha/2, df=n-2)
ci_mean_lower = y0_hat - t_crit * SE_mean_x0
ci_mean_upper = y0_hat + t_crit * SE_mean_x0
pi_lower = y0_hat - t_crit * SE_pred_x0
pi_upper = y0_hat + t_crit * SE_pred_x0
print(f"x0 = {x0}")
print(f"Predicted mean response at x0: {y0_hat:.4f}")
print()
print(f"95% CI for the mean response at x0: ({ci_mean_lower:.4f}, {ci_mean_upper:.4f})")
print(f"95% PI for a new observation at x0: ({pi_lower:.4f}, {pi_upper:.4f})")
x0 = 50 Predicted mean response at x0: 10.0572 95% CI for the mean response at x0: (9.6740, 10.4404) 95% PI for a new observation at x0: (7.1252, 12.9892)
#confidence band and prediction band
alpha = 0.05
# Pointwise multiplier
t_crit = t.ppf(1 - alpha/2, df=n-2)
# Working-Hotelling multiplier for simultaneous confidence band
W = np.sqrt(2 * f.ppf(1 - alpha, dfn=2, dfd=n-2))
SE_mean_grid = np.sqrt(MSE * (1/n + (x_grid - x_bar)**2 / Sxx))
SE_pred_grid = np.sqrt(MSE * (1 + 1/n + (x_grid - x_bar)**2 / Sxx))
# Pointwise confidence interval curves
pointwise_ci_lower = y_grid - t_crit * SE_mean_grid
pointwise_ci_upper = y_grid + t_crit * SE_mean_grid
# Confidence band
conf_band_lower = y_grid - W * SE_mean_grid
conf_band_upper = y_grid + W * SE_mean_grid
# Prediction band
pred_band_lower = y_grid - W * SE_pred_grid
pred_band_upper = y_grid + W * SE_pred_grid
plt.figure(figsize=(10, 6))
plt.scatter(x, y, s=45, label="Data")
plt.plot(x_grid, y_grid, linewidth=2, label="LS regression line")
plt.fill_between(
x_grid, conf_band_lower, conf_band_upper,
alpha=0.20, label="Confidence band"
)
plt.fill_between(
x_grid, pred_band_lower, pred_band_upper,
alpha=0.10, label="Prediction band"
)
plt.axvline(x=x0, linestyle="--", linewidth=1.5, label=r"$x_0$")
plt.scatter([x0], [y0_hat], s=80, zorder=5, label=r"$\hat{Y}_0$")
# Pointwise CI at x0
plt.plot([x0, x0], [ci_mean_lower, ci_mean_upper], linewidth=4, label="Pointwise CI at $x_0$")
# Pointwise PI at x0
plt.plot([x0, x0], [pi_lower, pi_upper], linewidth=2, label="Pointwise PI at $x_0$")
plt.xlabel("Average Length of Stay (days)", fontsize=12)
plt.ylabel("Infection Risk (%)", fontsize=12)
plt.title("Hospital Infection Data: Regression, Intervals, and Bands", fontsize=14)
plt.grid(True, linestyle="--", alpha=0.4)
plt.legend()
plt.tight_layout()
plt.show()
We fix a value $x_0$.
Now there are two different random objects we could care about:
The true mean response at $x_0$ is:
$$ E[Y \mid x_0] = \beta_0 + \beta_1 x_0. $$👉 This is the average outcome of $Y$ for all units with $x = x_0$.
“Where is the true regression line at $x_0$?”
So the CI is for:
$$ E[Y \mid x_0] $$NOT for an individual observation.
Now consider a new observation:
$$ Y_{\text{new}} = \beta_0 + \beta_1 x_0 + \varepsilon. $$👉 This includes random noise.
“Where will a new individual value fall?”