Seminar 11

1. Coefficient of Determination ($R^2$)

Definition¶

The coefficient of determination measures the proportion of variability in the response variable $Y$ explained by the regression model.

We start from the decomposition:

$$ Y_i = \hat{Y}_i + e_i $$

where:

  • $\hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i$
  • $e_i = Y_i - \hat{Y}_i$

Total Variability Decomposition¶

$$ \underbrace{\sum_{i=1}^n (Y_i - \bar{Y})^2}_{SST} = \underbrace{\sum_{i=1}^n (\hat{Y}_i - \bar{Y})^2}_{SSR} + \underbrace{\sum_{i=1}^n (Y_i - \hat{Y}_i)^2}_{SSE} $$
  • $SST$: Total Sum of Squares
  • $SSR$: Regression Sum of Squares
  • $SSE$: Error Sum of Squares

Definition of $R^2$¶

$$ R^2 = \frac{SSR}{SST} = 1 - \frac{SSE}{SST} $$

Alternative Expression¶

$$ R^2 = \frac{\sum (\hat{Y}_i - \bar{Y})^2}{\sum (Y_i - \bar{Y})^2} $$

Interpretation¶

  • $R^2 = 1$: perfect fit
  • $R^2 = 0$: no explanatory power

In simple regression: $$ R^2 = r^2 $$

where $r$ is the Pearson correlation coefficient.


In [2]:
import numpy as np
import matplotlib.pyplot as plt

def regression_r2_analysis(x, y, title="Regression Example"):
    x = np.array(x, dtype=float)
    y = np.array(y, dtype=float)
    
    n = len(x)
    x_bar = np.mean(x)
    y_bar = np.mean(y)
    
    # OLS estimators
    Sxx = np.sum((x - x_bar)**2)
    Sxy = np.sum((x - x_bar)*(y - y_bar))
    
    b1 = Sxy / Sxx
    b0 = y_bar - b1 * x_bar
    
    # fitted values
    y_hat = b0 + b1 * x
    
    # sums of squares
    SST = np.sum((y - y_bar)**2)
    SSE = np.sum((y - y_hat)**2)
    SSR = np.sum((y_hat - y_bar)**2)
    
    R2 = SSR / SST
    
    # print results
    print(title)
    print("-" * len(title))
    print(f"Intercept b0 = {b0:.4f}")
    print(f"Slope b1     = {b1:.4f}")
    print(f"SST          = {SST:.4f}")
    print(f"SSE          = {SSE:.4f}")
    print(f"SSR          = {SSR:.4f}")
    print(f"R^2          = {R2:.4f}")
    print(f"Check: 1 - SSE/SST = {1 - SSE/SST:.4f}")
    
    # plot
    x_line = np.linspace(np.min(x), np.max(x), 200)
    y_line = b0 + b1 * x_line
    
    plt.figure(figsize=(7,5))
    plt.scatter(x, y, label="Data")
    plt.plot(x_line, y_line, label="Least Squares Line")
    plt.xlabel("x")
    plt.ylabel("y")
    plt.title(f"{title}\n$R^2 = {R2:.4f}$")
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()
    
    return {
        "b0": b0,
        "b1": b1,
        "SST": SST,
        "SSE": SSE,
        "SSR": SSR,
        "R2": R2
    }

Example 1: Strong Linear Relationship¶

In this first example, the points lie very close to a straight line. Therefore, the residual sum of squares $SSE$ should be small, and so the coefficient of determination $R^2$ should be close to $1$.

This means that the regression line explains most of the variability in the response variable.

In [3]:
x1 = [1, 2, 3, 4, 5, 6, 7, 8]
y1 = [2.1, 4.2, 6.1, 8.3, 9.9, 12.2, 13.8, 16.1]

res1 = regression_r2_analysis(x1, y1, title="Example 1: Strong Linear Relationship")
Example 1: Strong Linear Relationship
-------------------------------------
Intercept b0 = 0.2000
Slope b1     = 1.9750
SST          = 163.9888
SSE          = 0.1625
SSR          = 163.8263
R^2          = 0.9990
Check: 1 - SSE/SST = 0.9990

Example 2: Weak Linear Relationship¶

In the second example, the data do not follow a clear straight-line pattern. The fitted regression line may still exist, but it explains only a small proportion of the variability in $Y$.

Thus, we expect $R^2$ to be relatively small, possibly close to $0$.

In [6]:
x3 = [1, 2, 3, 4, 5, 6, 7, 8]
y3 = [5.2, 7.8, 4.9, 6.1, 5.7, 7.0, 5.4, 6.3]

res3 = regression_r2_analysis(x3, y3, title="Example 3: Weak Linear Relationship")
Example 3: Weak Linear Relationship
-----------------------------------
Intercept b0 = 5.9643
Slope b1     = 0.0190
SST          = 6.6200
SSE          = 6.6048
SSR          = 0.0152
R^2          = 0.0023
Check: 1 - SSE/SST = 0.0023
In [ ]:
 

Confidence Interval for the Slope $\beta_1$

📌 Distribution of the OLS Slope Estimator $\hat{\beta}_1$¶

We want to understand why:

$$ \hat{\beta}_1 \sim N\left(\beta_1, \frac{\sigma^2}{S_{xx}}\right), \quad \text{where } S_{xx} = \sum (x_i - \bar{x})^2 $$

1. Model Assumptions¶

We consider the simple linear regression model:

$$ Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, $$

where:

  • $\varepsilon_i \sim N(0, \sigma^2)$,
  • $\varepsilon_i$ are independent.

Thus,

$$ Y_i \sim N(\beta_0 + \beta_1 x_i, \sigma^2). $$

2. Expression for the OLS Estimator¶

The least squares estimator for the slope is:

$$ \hat{\beta}_1 = \frac{\sum (x_i - \bar{x}) Y_i}{S_{xx}}, \quad S_{xx} = \sum (x_i - \bar{x})^2. $$

SINCE: The slope estimator can be written in either of the following equivalent forms:

$$ \hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^n (x_i-\bar{x})^2} $$

or

$$ \hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i-\bar{x})y_i}{\sum_{i=1}^n (x_i-\bar{x})^2}. $$

These are equivalent because

$$ \sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y}) = \sum_{i=1}^n (x_i-\bar{x})y_i - \bar{y}\sum_{i=1}^n (x_i-\bar{x}), $$

and since

$$ \sum_{i=1}^n (x_i-\bar{x}) = 0, $$

we obtain

$$ \sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y}) = \sum_{i=1}^n (x_i-\bar{x})y_i. $$

3. Substituting the Model¶

Substitute $Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i$:

$$ \hat{\beta}_1 = \frac{\sum (x_i - \bar{x})(\beta_0 + \beta_1 x_i + \varepsilon_i)}{S_{xx}}. $$

Expanding:

$$ \hat{\beta}_1 = \frac{ \beta_0 \sum (x_i - \bar{x}) + \beta_1 \sum (x_i - \bar{x}) x_i + \sum (x_i - \bar{x}) \varepsilon_i }{S_{xx}}. $$

4. Simplifications¶

We use the identities:

$$ \sum (x_i - \bar{x}) = 0, $$$$ \sum (x_i - \bar{x}) x_i = S_{xx}. $$

Thus:

$$ \hat{\beta}_1 = \frac{ 0 + \beta_1 S_{xx} + \sum (x_i - \bar{x}) \varepsilon_i }{S_{xx}}. $$

5. Final Representation¶

We obtain:

$$ \hat{\beta}_1 = \beta_1 + \frac{\sum (x_i - \bar{x}) \varepsilon_i}{S_{xx}}. $$

This is the key representation.


6. Expectation¶

Since $E[\varepsilon_i] = 0$:

$$ E[\hat{\beta}_1] = \beta_1. $$

Thus, $\hat{\beta}_1$ is unbiased.


7. Variance¶

Using independence of the errors:

$$ Var(\hat{\beta}_1) = \frac{1}{S_{xx}^2} \sum (x_i - \bar{x})^2 Var(\varepsilon_i). $$

Since $Var(\varepsilon_i) = \sigma^2$:

$$ Var(\hat{\beta}_1) = \frac{\sigma^2}{S_{xx}^2} \sum (x_i - \bar{x})^2 = \frac{\sigma^2}{S_{xx}}. $$

8. Normality¶

From the representation:

$$ \hat{\beta}_1 = \beta_1 + \frac{\sum (x_i - \bar{x}) \varepsilon_i}{S_{xx}}, $$

we see that $\hat{\beta}_1$ is a linear combination of the errors $\varepsilon_i$.

Since:

  • $\varepsilon_i \sim N(0, \sigma^2)$,
  • they are independent,

any linear combination of them is also normal.

Thus:

$$ \sum (x_i - \bar{x}) \varepsilon_i \sim N\left(0, \sigma^2 S_{xx}\right). $$

Dividing by $S_{xx}$:

$$ \hat{\beta}_1 \sim N\left(\beta_1, \frac{\sigma^2}{S_{xx}}\right). $$

9. Final Result¶

$$ \boxed{ \hat{\beta}_1 \sim N\left(\beta_1, \frac{\sigma^2}{S_{xx}}\right) } $$

10. Interpretation¶

  • Larger spread in $x$ (large $S_{xx}$) $\Rightarrow$ more precise estimate
  • Larger noise $\sigma^2$ $\Rightarrow$ higher uncertainty
  • $\hat{\beta}_1$ is a weighted average of the noise terms

⚠️ Important Remark¶

  • Exact normality holds only if errors are normal
  • Without normality:
$$ \hat{\beta}_1 \text{ is still unbiased but only approximately normal (CLT)} $$

2. Residual Sum of Squares (SSE)¶

The sum of squared errors is:

$$ SSE = \sum_{i=1}^n e_i^2 = \sum_{i=1}^n (Y_i - \hat{Y}_i)^2. $$

This measures the unexplained variation in the data.


We define the estimator of $\sigma^2$ as:

$$ s^2 = \frac{SSE}{n-2}. $$

4. Unbiasedness of $s^2$¶

A fundamental result:

$$ E[SSE] = (n-2)\sigma^2. $$

Therefore:

$$ E[s^2] = \sigma^2. $$

So $s^2$ is an unbiased estimator of $\sigma^2$.


5. Variance of $\hat{\beta}_1$¶

From previous derivation:

$$ \hat{\beta}_1 = \beta_1 + \frac{\sum (x_i - \bar{x})\varepsilon_i}{S_{xx}}. $$

Thus:

$$ Var(\hat{\beta}_1) = \frac{\sigma^2}{S_{xx}}, \quad S_{xx} = \sum (x_i - \bar{x})^2. $$

6. Estimation of the Variance¶

Since $\sigma^2$ is unknown, we replace it with $s^2$:

$$ \widehat{Var}(\hat{\beta}_1) = \frac{s^2}{S_{xx}}. $$

This is a plug-in estimator.


7. Standard Error¶

The standard error is the square root of the estimated variance:

$$ SE(\hat{\beta}_1) = \sqrt{\widehat{Var}(\hat{\beta}_1)} = \sqrt{\frac{s^2}{S_{xx}}}. $$

8. Distribution of the Standardized Estimator¶

We know:

$$ \frac{\hat{\beta}_1 - \beta_1}{\sqrt{\sigma^2/S_{xx}}} \sim N(0,1). $$

Also, from theory:

$$ \frac{SSE}{\sigma^2} \sim \chi^2_{n-2}. $$

9. Independence (crucial result)¶

A key theorem in regression:

  • $\hat{\beta}_1$ and $SSE$ are independent

This is a non-trivial result and follows from orthogonality of projections in the linear model.


10. Construction of the $t$-statistic¶

We replace $\sigma^2$ by $s^2$:

$$ T = \frac{\hat{\beta}_1 - \beta_1}{\sqrt{s^2/S_{xx}}}. $$

Rewrite:

$$ T = \frac{ \frac{\hat{\beta}_1 - \beta_1}{\sqrt{\sigma^2/S_{xx}}} }{ \sqrt{\frac{SSE}{(n-2)\sigma^2}} }. $$

11. Distribution¶

We now have:

  • Numerator $\sim N(0,1)$
  • Denominator $\sim \sqrt{\chi^2_{n-2}/(n-2)}$
  • Independent

Therefore:

$$ T \sim t_{n-2}. $$

12. Final Result¶

$$ \boxed{ T = \frac{\hat{\beta}_1 - \beta_1}{SE(\hat{\beta}_1)} = \frac{\hat{\beta}_1 - \beta_1}{\sqrt{s^2/S_{xx}}} \sim t_{n-2} } $$

13. Confidence Interval¶

$$ \boxed{ \hat{\beta}_1 \pm t_{\alpha/2, n-2} \cdot SE(\hat{\beta}_1) } $$

Confidence Interval for the Intercept $\beta_0$


Estimator¶

$$ \hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{x} $$

Variance¶

$$ Var(\hat{\beta}_0) = \sigma^2 \left( \frac{1}{n} + \frac{\bar{x}^2}{S_{xx}} \right) $$

Estimated Variance¶

$$ \widehat{Var}(\hat{\beta}_0) = s^2 \left( \frac{1}{n} + \frac{\bar{x}^2}{S_{xx}} \right) $$

Standard Error¶

$$ SE(\hat{\beta}_0) = \sqrt{ s^2 \left( \frac{1}{n} + \frac{\bar{x}^2}{S_{xx}} \right) } $$

Confidence Interval¶

$$ \boxed{ \hat{\beta}_0 \pm t_{\alpha/2, n-2} \cdot SE(\hat{\beta}_0) } $$

4. Hypothesis Testing for Slope and Intercept


Test for Slope¶

Hypotheses¶

$$ H_0: \beta_1 = \beta_1^0 \quad \text{vs} \quad H_1: \beta_1 \neq \beta_1^0 $$

Test Statistic¶

$$ T = \frac{\hat{\beta}_1 - \beta_1^0}{SE(\hat{\beta}_1)} \sim t_{n-2} $$

Decision Rule¶

Reject $H_0$ if:

$$ |T| > t_{\alpha/2, n-2} $$

Test for Intercept¶

Hypotheses¶

$$ H_0: \beta_0 = \beta_0^0 $$

Test Statistic¶

$$ T = \frac{\hat{\beta}_0 - \beta_0^0}{SE(\hat{\beta}_0)} \sim t_{n-2} $$
In [ ]:
 

5. Prediction Intervals in Linear Regression


Mean Response¶

$$ \hat{Y}_0 = \hat{\beta}_0 + \hat{\beta}_1 x_0 $$

Variance (Mean Response)¶

$$ Var(\hat{Y}_0) = \sigma^2 \left( \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}} \right) $$
In [ ]:
 

📌 Variance of the Mean Response $\hat{Y}_0$¶

We want to understand why:

$$ Var(\hat{Y}_0) = \sigma^2 \left( \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}} \right), \quad S_{xx} = \sum (x_i - \bar{x})^2. $$

1. Rewrite $\hat{Y}_0$¶

Recall:

$$ \hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{x}. $$

Substitute into $\hat{Y}_0$:

$$ \hat{Y}_0 = \bar{Y} - \hat{\beta}_1 \bar{x} + \hat{\beta}_1 x_0 = \bar{Y} + \hat{\beta}_1 (x_0 - \bar{x}). $$

3. Variance Decomposition¶

We now compute:

$$ Var(\hat{Y}_0) = Var\left(\bar{Y} + \hat{\beta}_1 (x_0 - \bar{x})\right). $$

Using variance properties:

$$ Var(aX + bY) = a^2 Var(X) + b^2 Var(Y) + 2ab \, Cov(X,Y), $$

we get:

$$ Var(\hat{Y}_0) = Var(\bar{Y}) + (x_0 - \bar{x})^2 Var(\hat{\beta}_1) + 2(x_0 - \bar{x}) Cov(\bar{Y}, \hat{\beta}_1). $$

4. Compute Each Term¶

(i) Variance of $\bar{Y}$¶

Since $Y_i$ are independent with variance $\sigma^2$:

$$ Var(\bar{Y}) = \frac{\sigma^2}{n}. $$

(ii) Variance of $\hat{\beta}_1$¶

From previous results:

$$ Var(\hat{\beta}_1) = \frac{\sigma^2}{S_{xx}}. $$

(iii) Covariance term¶

We compute:

$$ Cov(\bar{Y}, \hat{\beta}_1). $$

Recall:

$$ \bar{Y} = \frac{1}{n}\sum Y_i, \quad \hat{\beta}_1 = \frac{\sum (x_i - \bar{x}) Y_i}{S_{xx}}. $$

Thus:

$$ Cov(\bar{Y}, \hat{\beta}_1) = \frac{1}{n S_{xx}} \sum (x_i - \bar{x}) Cov(Y_i, Y_i). $$

Since $Var(Y_i)=\sigma^2$:

$$ Cov(\bar{Y}, \hat{\beta}_1) = \frac{\sigma^2}{n S_{xx}} \sum (x_i - \bar{x}). $$

But:

$$ \sum (x_i - \bar{x}) = 0, $$

so:

$$ Cov(\bar{Y}, \hat{\beta}_1) = 0. $$

5. Final Formula¶

Putting everything together:

$$ Var(\hat{Y}_0) = \frac{\sigma^2}{n} + (x_0 - \bar{x})^2 \frac{\sigma^2}{S_{xx}}. $$

Factor out $\sigma^2$:

$$ \boxed{ Var(\hat{Y}_0) = \sigma^2 \left( \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}} \right). } $$
In [ ]:
 

and hence

$$ SE(\hat{Y}_0) = \sqrt{ \sigma^2 \left( \frac{1}{n} + \frac{(x_0-\bar{x})^2}{S_{xx}} \right) }. $$

Since $\sigma^2$ is unknown, we estimate it by

$$ s^2=\frac{SSE}{n-2}. $$

Thus the estimated standard error is

$$ \widehat{SE}(\hat{Y}_0) = \sqrt{ s^2 \left( \frac{1}{n} + \frac{(x_0-\bar{x})^2}{S_{xx}} \right) }. $$

4. Distribution for the mean response estimator¶

We now standardize:

$$ \frac{ \hat{Y}_0 - (\beta_0+\beta_1 x_0) }{ \sqrt{ \sigma^2 \left( \frac{1}{n} + \frac{(x_0-\bar{x})^2}{S_{xx}} \right) } } \sim N(0,1). $$

Also,

$$ \frac{SSE}{\sigma^2}\sim \chi^2_{n-2}, $$

and this quantity is independent of $\hat{Y}_0$.

Therefore,

$$ \frac{ \hat{Y}_0 - (\beta_0+\beta_1 x_0) }{ \sqrt{ s^2 \left( \frac{1}{n} + \frac{(x_0-\bar{x})^2}{S_{xx}} \right) } } \sim t_{n-2}. $$

5. Confidence interval for the mean response¶

Hence

$$ P\left( -t_{\alpha/2,n-2} \le \frac{ \hat{Y}_0 - (\beta_0+\beta_1 x_0) }{ \sqrt{ s^2 \left( \frac{1}{n} + \frac{(x_0-\bar{x})^2}{S_{xx}} \right) } } \le t_{\alpha/2,n-2} \right)=1-\alpha. $$

Multiplying through by the standard error and rearranging gives

$$ P\left( \hat{Y}_0 - t_{\alpha/2,n-2} \sqrt{ s^2 \left( \frac{1}{n} + \frac{(x_0-\bar{x})^2}{S_{xx}} \right) } \le \beta_0+\beta_1x_0 \le \hat{Y}_0 + t_{\alpha/2,n-2} \sqrt{ s^2 \left( \frac{1}{n} + \frac{(x_0-\bar{x})^2}{S_{xx}} \right) } \right) = 1-\alpha. $$

Therefore, the $(1-\alpha)$ confidence interval for the mean response is

$$ \boxed{ \hat{Y}_0 \pm t_{\alpha/2,n-2} \sqrt{ s^2 \left( \frac{1}{n} + \frac{(x_0-\bar{x})^2}{S_{xx}} \right) } } $$

6. Prediction of a new observation¶

Now suppose we want to predict not the mean, but a new observation at $x_0$.

Let

$$ Y_{\text{new}} = \beta_0+\beta_1x_0+\varepsilon_{\text{new}}, $$

where

$$ \varepsilon_{\text{new}} \sim N(0,\sigma^2) $$

and $\varepsilon_{\text{new}}$ is independent of the original sample.

We want an interval for $Y_{\text{new}}$.


7. Prediction error¶

Consider

$$ Y_{\text{new}}-\hat{Y}_0. $$

Write it as

$$ Y_{\text{new}}-\hat{Y}_0 = (\beta_0+\beta_1x_0+\varepsilon_{\text{new}})-\hat{Y}_0. $$

Equivalently,

$$ Y_{\text{new}}-\hat{Y}_0 = \big((\beta_0+\beta_1x_0)-\hat{Y}_0\big) + \varepsilon_{\text{new}}. $$

The first term comes from estimating the regression line; the second is the irreducible random noise of the future observation.


8. Variance of the prediction error¶

Since $\varepsilon_{\text{new}}$ is independent of $\hat{Y}_0$,

$$ Var(Y_{\text{new}}-\hat{Y}_0) = Var\big((\beta_0+\beta_1x_0)-\hat{Y}_0\big) + Var(\varepsilon_{\text{new}}). $$

Because subtracting a constant does not change variance,

$$ Var\big((\beta_0+\beta_1x_0)-\hat{Y}_0\big) = Var(\hat{Y}_0). $$

So

$$ Var(Y_{\text{new}}-\hat{Y}_0) = Var(\hat{Y}_0)+\sigma^2. $$

Using the previous formula for $Var(\hat{Y}_0)$,

$$ Var(Y_{\text{new}}-\hat{Y}_0) = \sigma^2 \left( \frac{1}{n} + \frac{(x_0-\bar{x})^2}{S_{xx}} \right) +\sigma^2. $$

Hence

$$ Var(Y_{\text{new}}-\hat{Y}_0) = \sigma^2 \left( 1+\frac{1}{n} + \frac{(x_0-\bar{x})^2}{S_{xx}} \right). $$

Therefore,

$$ \boxed{ Var(Y_{\text{new}}-\hat{Y}_0) = \sigma^2 \left( 1+\frac{1}{n} + \frac{(x_0-\bar{x})^2}{S_{xx}} \right) } $$

and the estimated standard error is

$$ \widehat{SE}(Y_{\text{new}}-\hat{Y}_0) = \sqrt{ s^2 \left( 1+\frac{1}{n} + \frac{(x_0-\bar{x})^2}{S_{xx}} \right) }. $$

9. $t$-statistic for prediction¶

Using normality and independence,

$$ \frac{ Y_{\text{new}}-\hat{Y}_0 }{ \sqrt{ s^2 \left( 1+\frac{1}{n} + \frac{(x_0-\bar{x})^2}{S_{xx}} \right) } } \sim t_{n-2}. $$

Therefore,

$$ P\left( -t_{\alpha/2,n-2} \le \frac{ Y_{\text{new}}-\hat{Y}_0 }{ \sqrt{ s^2 \left( 1+\frac{1}{n} + \frac{(x_0-\bar{x})^2}{S_{xx}} \right) } } \le t_{\alpha/2,n-2} \right) = 1-\alpha. $$

Rearranging yields the prediction interval.


10. Prediction interval¶

Thus, the $(1-\alpha)$ prediction interval for a new observation at $x_0$ is

$$ \boxed{ \hat{Y}_0 \pm t_{\alpha/2,n-2} \sqrt{ s^2 \left( 1+\frac{1}{n} + \frac{(x_0-\bar{x})^2}{S_{xx}} \right) } } $$

11. Why is the prediction interval wider?¶

Compare the two formulas:

Confidence interval for the mean response¶

$$ \hat{Y}_0 \pm t_{\alpha/2,n-2} \sqrt{ s^2 \left( \frac{1}{n} + \frac{(x_0-\bar{x})^2}{S_{xx}} \right) } $$

Prediction interval¶

$$ \hat{Y}_0 \pm t_{\alpha/2,n-2} \sqrt{ s^2 \left( 1+\frac{1}{n} + \frac{(x_0-\bar{x})^2}{S_{xx}} \right) } $$

The prediction interval has an additional term:

$$ 1. $$

This extra $1$ appears because a future observation has its own random error term $\varepsilon_{\text{new}}$ with variance $\sigma^2$.

So:

  • the confidence interval for the mean response reflects uncertainty in estimating the regression line,
  • the prediction interval reflects both:
    • uncertainty in estimating the line,
    • random noise of a new observation.

That is why prediction intervals are always wider.


In [ ]:
 

📊 Confidence Bands and Prediction Bands in Linear Regression¶


1. Motivation¶

So far, we have constructed intervals at a single point $x_0$:

  • Confidence interval for the mean response
  • Prediction interval for a new observation

Now we want something stronger:

intervals that hold for all values of $x$ simultaneously

These are called:

  • Confidence bands
  • Prediction bands

2. From Intervals to Bands¶

Pointwise interval (what we had before)¶

At a fixed $x_0$:

$$ \hat{Y}_0 \pm t_{\alpha/2,n-2} \cdot SE(\hat{Y}_0) $$

👉 Valid only for one specific $x_0$


Band (new concept)¶

We want:

$$ P\left( \text{the entire function } \beta_0 + \beta_1 x \text{ lies inside the band for all } x \right) = 1-\alpha $$

3. Confidence Band for the Mean Response¶

We want a band for:

$$ E[Y \mid x] = \beta_0 + \beta_1 x. $$

Key idea¶

Instead of controlling error at a single point, we control:

$$ \sup_x \left| \frac{\hat{Y}(x) - (\beta_0 + \beta_1 x)} {SE(\hat{Y}(x))} \right| $$

4. Working–Hotelling Confidence Band¶

A classical result gives the band:

$$ \boxed{ \hat{Y}(x) \pm \sqrt{2 F_{\alpha,2,n-2}} \cdot \sqrt{ s^2 \left( \frac{1}{n} + \frac{(x-\bar{x})^2}{S_{xx}} \right) } } $$

where:

  • $F_{\alpha,2,n-2}$ is the quantile of the $F$ distribution
  • $s^2 = \frac{SSE}{n-2}$

Interpretation¶

With probability $1-\alpha$:

$$ \beta_0 + \beta_1 x \text{ lies inside the band for all } x. $$

5. Comparison with Pointwise CI¶

Pointwise CI:

$$ \hat{Y}(x) \pm t_{\alpha/2,n-2} \cdot SE(\hat{Y}(x)) $$

Band:

$$ \hat{Y}(x) \pm \sqrt{2F_{\alpha,2,n-2}} \cdot SE(\hat{Y}(x)) $$

Important¶

$$ \sqrt{2F_{\alpha,2,n-2}} > t_{\alpha/2,n-2} $$

👉 Bands are wider than pointwise intervals


6. Prediction Band¶

Now we consider future observations:

$$ Y = \beta_0 + \beta_1 x + \varepsilon. $$

Prediction variance¶

$$ Var(Y - \hat{Y}(x)) = \sigma^2 \left( 1 + \frac{1}{n} + \frac{(x-\bar{x})^2}{S_{xx}} \right) $$

Prediction Band¶

$$ \boxed{ \hat{Y}(x) \pm \sqrt{2 F_{\alpha,2,n-2}} \cdot \sqrt{ s^2 \left( 1 + \frac{1}{n} + \frac{(x-\bar{x})^2}{S_{xx}} \right) } } $$

7. Summary of All Intervals¶

Pointwise Confidence Interval¶

$$ \hat{Y}(x) \pm t_{\alpha/2,n-2} \cdot \sqrt{ s^2 \left( \frac{1}{n} + \frac{(x-\bar{x})^2}{S_{xx}} \right) } $$

Confidence Band¶

$$ \hat{Y}(x) \pm \sqrt{2F_{\alpha,2,n-2}} \cdot \sqrt{ s^2 \left( \frac{1}{n} + \frac{(x-\bar{x})^2}{S_{xx}} \right) } $$

Prediction Interval¶

$$ \hat{Y}(x) \pm t_{\alpha/2,n-2} \cdot \sqrt{ s^2 \left( 1 + \frac{1}{n} + \frac{(x-\bar{x})^2}{S_{xx}} \right) } $$

Prediction Band¶

$$ \hat{Y}(x) \pm \sqrt{2F_{\alpha,2,n-2}} \cdot \sqrt{ s^2 \left( 1 + \frac{1}{n} + \frac{(x-\bar{x})^2}{S_{xx}} \right) } $$

8. Key Insights¶

🔹 Confidence vs Prediction¶

  • Confidence band → uncertainty of regression line
  • Prediction band → uncertainty of data points

🔹 Pointwise vs Simultaneous¶

  • Pointwise: valid at one $x$
  • Band: valid for all $x$ simultaneously

🔹 Width comparison¶

$$ \text{Prediction band} > \text{Confidence band} > \text{Pointwise CI} $$

9. Geometry (important intuition)¶

  • Bands are narrowest at $x = \bar{x}$
  • Bands widen as $|x - \bar{x}|$ increases
  • Shape: hyperbolic envelope around the line

10. Final Interpretation¶

  • Confidence band:

    “With probability $1-\alpha$, the entire true regression line lies inside the band”

  • Prediction band:

    “With probability $1-\alpha$, all future observations lie inside the band (simultaneously)”


⚠️ Important Remark¶

  • Bands are much more conservative than pointwise intervals
  • Useful when making conclusions over a range of $x$ values

In [ ]:
 

Example: Pointwise Confidence Interval, Prediction Interval, Confidence Band, and Prediction Band¶

In this example, we:

  1. fit a simple linear regression model,
  2. compute the least squares regression line,
  3. choose a point $x_0$,
  4. compute:
    • the pointwise confidence interval for the mean response at $x_0$,
    • the pointwise prediction interval for a new observation at $x_0$,
  5. plot:
    • the data,
    • the fitted regression line,
    • the vertical line at $x_0$,
    • the confidence and prediction intervals at $x_0$,
    • the confidence band,
    • the prediction band.

We use the formulas

$$ \hat{Y}(x) = \hat{\beta}_0 + \hat{\beta}_1 x $$$$ s^2 = \frac{SSE}{n-2} $$$$ SE_{\text{mean}}(x) = \sqrt{ s^2\left( \frac{1}{n} + \frac{(x-\bar{x})^2}{S_{xx}} \right) } $$$$ SE_{\text{pred}}(x) = \sqrt{ s^2\left( 1 + \frac{1}{n} + \frac{(x-\bar{x})^2}{S_{xx}} \right) } $$

Pointwise intervals:

$$ \hat{Y}(x_0) \pm t_{\alpha/2,n-2} \, SE_{\text{mean}}(x_0) $$$$ \hat{Y}(x_0) \pm t_{\alpha/2,n-2} \, SE_{\text{pred}}(x_0) $$

Working--Hotelling confidence band:

$$ \hat{Y}(x) \pm W \, SE_{\text{mean}}(x), \qquad W = \sqrt{2F_{\alpha,2,n-2}} $$

Prediction band (using the same simultaneous multiplier here for illustration):

$$ \hat{Y}(x) \pm W \, SE_{\text{pred}}(x) $$
In [7]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import t, f
In [8]:
# Sample data
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype=float)
y = np.array([2.3, 2.9, 3.8, 5.1, 5.0, 6.4, 7.1, 7.9, 9.2, 9.8], dtype=float)

n = len(x)
x_bar = np.mean(x)
y_bar = np.mean(y)

# OLS estimates
Sxx = np.sum((x - x_bar)**2)
Sxy = np.sum((x - x_bar)*(y - y_bar))

b1 = Sxy / Sxx
b0 = y_bar - b1 * x_bar

# Fitted values and residuals
y_hat = b0 + b1 * x
residuals = y - y_hat

# SSE and variance estimate
SSE = np.sum(residuals**2)
s2 = SSE / (n - 2)
s = np.sqrt(s2)

print(f"b0 = {b0:.4f}")
print(f"b1 = {b1:.4f}")
print(f"SSE = {SSE:.4f}")
print(f"s^2 = {s2:.4f}")
print(f"s = {s:.4f}")
b0 = 1.3000
b1 = 0.8455
SSE = 0.6145
s^2 = 0.0768
s = 0.2772
In [27]:
# Choose the point x0 where we want pointwise intervals
x0 = 4.5

# Predicted mean at x0
y0_hat = b0 + b1 * x0

# Standard errors at x0
SE_mean_x0 = np.sqrt(s2 * (1/n + (x0 - x_bar)**2 / Sxx))
SE_pred_x0 = np.sqrt(s2 * (1 + 1/n + (x0 - x_bar)**2 / Sxx))

# Quantiles
alpha = 0.001
t_crit = t.ppf(1 - alpha/2, df=n-2)
W = np.sqrt(2 * f.ppf(1 - alpha, dfn=2, dfd=n-2))

# Pointwise confidence interval for mean response at x0
ci_mean_lower = y0_hat - t_crit * SE_mean_x0
ci_mean_upper = y0_hat + t_crit * SE_mean_x0

# Pointwise prediction interval for a new observation at x0
pi_lower = y0_hat - t_crit * SE_pred_x0
pi_upper = y0_hat + t_crit * SE_pred_x0

print(f"x0 = {x0}")
print(f"Predicted value y_hat(x0) = {y0_hat:.4f}")
print()
print("Pointwise confidence interval for mean response:")
print(f"({ci_mean_lower:.4f}, {ci_mean_upper:.4f})")
print()
print("Pointwise prediction interval:")
print(f"({pi_lower:.4f}, {pi_upper:.4f})")
print()
print(f"t critical value = {t_crit:.4f}")
print(f"Working-Hotelling multiplier W = {W:.4f}")
x0 = 4.5
Predicted value y_hat(x0) = 5.1045

Pointwise confidence interval for mean response:
(4.6367, 5.5724)

Pointwise prediction interval:
(3.6310, 6.5780)

t critical value = 5.0413
Working-Hotelling multiplier W = 6.0817
In [28]:
# Grid of x-values for plotting the fitted line and bands
x_grid = np.linspace(np.min(x), np.max(x), 400)
y_grid = b0 + b1 * x_grid

# Standard errors on the grid
SE_mean_grid = np.sqrt(s2 * (1/n + (x_grid - x_bar)**2 / Sxx))
SE_pred_grid = np.sqrt(s2 * (1 + 1/n + (x_grid - x_bar)**2 / Sxx))

# Pointwise confidence band (if you want to compare)
ci_mean_grid_lower = y_grid - t_crit * SE_mean_grid
ci_mean_grid_upper = y_grid + t_crit * SE_mean_grid

# Confidence band (Working-Hotelling)
conf_band_lower = y_grid - W * SE_mean_grid
conf_band_upper = y_grid + W * SE_mean_grid

# Prediction band (using same simultaneous multiplier W for illustration)
pred_band_lower = y_grid - W * SE_pred_grid
pred_band_upper = y_grid + W * SE_pred_grid
In [29]:
plt.figure(figsize=(10, 7))

# Scatter plot of data
plt.scatter(x, y, label="Data")

# Least squares line
plt.plot(x_grid, y_grid, linewidth=2, label="LS regression line")

# Confidence band
plt.fill_between(
    x_grid, conf_band_lower, conf_band_upper,
    alpha=0.20, label="Confidence band"
)

# Prediction band
plt.fill_between(
    x_grid, pred_band_lower, pred_band_upper,
    alpha=0.12, label="Prediction band"
)

# Vertical line at x0
plt.axvline(x=x0, linestyle="--", linewidth=1.5, label=r"$x_0$")

# Mark predicted point on regression line
plt.scatter([x0], [y0_hat], s=80, zorder=5, label=r"$\hat{Y}_0$")

# Draw pointwise confidence interval at x0
plt.plot([x0, x0], [ci_mean_lower, ci_mean_upper], linewidth=4, label="Pointwise CI at $x_0$")

# Draw pointwise prediction interval at x0
plt.plot([x0, x0], [pi_lower, pi_upper], linewidth=2, label="Pointwise PI at $x_0$")

# Optional horizontal markers for interval endpoints
plt.plot([x0-0.12, x0+0.12], [ci_mean_lower, ci_mean_lower], linewidth=2)
plt.plot([x0-0.12, x0+0.12], [ci_mean_upper, ci_mean_upper], linewidth=2)

plt.plot([x0-0.18, x0+0.18], [pi_lower, pi_lower], linewidth=2)
plt.plot([x0-0.18, x0+0.18], [pi_upper, pi_upper], linewidth=2)

plt.xlabel("x")
plt.ylabel("y")
plt.title("Simple Linear Regression with Pointwise Intervals and Bands")
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
In [ ]:
 

🏥 Hospital Infection Data¶

We analyze a dataset of $n = 58$ hospitals.

  • Response variable: $$ y = \text{infection risk (percentage)} $$

  • Predictor variable: $$ x = \text{average length of stay (days)} $$

We will:

  1. Load and explore the dataset
  2. Fit a simple linear regression model
  3. Compute:
    • coefficients
    • $R^2$
    • ANOVA quantities
  4. Construct:
    • confidence intervals
    • prediction intervals
  5. Visualize:
    • regression line
    • confidence band
    • prediction band

Preview of the dataset¶

In [36]:
import pandas as pd

data = pd.read_csv(
    "hospital.txt",
    sep=r"\s+",
    header=None
)

# If there are more than 2 columns, keep only first two
data = data.iloc[:, :2]

data.columns = ["Stay", "InfectionRisk"]

# Convert to numeric explicitly (very important)
data["Stay"] = pd.to_numeric(data["Stay"], errors="coerce")
data["InfectionRisk"] = pd.to_numeric(data["InfectionRisk"], errors="coerce")

# Drop bad rows if any
data = data.dropna()

print(data.head())
print(data.shape)
   Stay  InfectionRisk
1   5.0          11.20
2  10.0           8.84
3  11.0          11.07
4  13.0          12.78
5  18.0          11.62
(58, 2)
In [37]:
print(data.dtypes)
print(data.describe())
Stay             float64
InfectionRisk    float64
dtype: object
             Stay  InfectionRisk
count   58.000000       58.00000
mean    52.706897       10.04931
std     30.625645        1.44107
min      2.000000        7.39000
25%     28.250000        8.90500
50%     53.000000        9.93000
75%     77.750000       11.06000
max    109.000000       13.95000
In [38]:
import matplotlib.pyplot as plt

x = data["Stay"].values
y = data["InfectionRisk"].values

plt.figure(figsize=(8,5))

plt.scatter(x, y, s=50)

plt.xlabel("Average Length of Stay (days)", fontsize=12)
plt.ylabel("Infection Risk (%)", fontsize=12)
plt.title("Hospital Infection Data", fontsize=14)

plt.grid(True, linestyle="--", alpha=0.4)

plt.tight_layout()
plt.show()
In [42]:
#compute the regression quantities by hand

import numpy as np

n = len(x)
x_bar = np.mean(x)
y_bar = np.mean(y)

Sxx = np.sum((x - x_bar)**2)
Sxy = np.sum((x - x_bar) * (y - y_bar))

b1 = Sxy / Sxx
b0 = y_bar - b1 * x_bar

y_hat = b0 + b1 * x
residuals = y - y_hat

SSE = np.sum(residuals**2)
SSR = np.sum((y_hat - y_bar)**2)
SST = np.sum((y - y_bar)**2)

MSR = SSR / 1
MSE = SSE / (n - 2)
F_stat = MSR / MSE

R2 = SSR / SST
s = np.sqrt(MSE)

print(f"n   = {n}")
print(f"b0  = {b0:.4f}")
print(f"b1  = {b1:.4f}")
print(f"SST = {SST:.4f}")
print(f"SSR = {SSR:.4f}")
print(f"SSE = {SSE:.4f}")
print(f"MSE = {MSE:.4f}")
print(f"s   = {s:.5f}")
print(f"R^2 = {R2:.4f} = {100*R2:.2f}%")
print(f"F   = {F_stat:.4f}")
n   = 58
b0  = 10.2033
b1  = -0.0029
SST = 118.3710
SSR = 0.4563
SSE = 117.9147
MSE = 2.1056
s   = 1.45108
R^2 = 0.0039 = 0.39%
F   = 0.2167
In [43]:
x_grid = np.linspace(np.min(x), np.max(x), 400)
y_grid = b0 + b1 * x_grid

plt.figure(figsize=(8, 5))
plt.scatter(x, y, s=50, label="Data")
plt.plot(x_grid, y_grid, linewidth=2, label="Least squares line")
plt.xlabel("Average Length of Stay (days)", fontsize=12)
plt.ylabel("Infection Risk (%)", fontsize=12)
plt.title("Hospital Infection Data with LS Regression Line", fontsize=14)
plt.grid(True, linestyle="--", alpha=0.4)
plt.legend()
plt.tight_layout()
plt.show()
In [44]:
from scipy.stats import t, f

SE_b1 = np.sqrt(MSE / Sxx)
SE_b0 = np.sqrt(MSE * (1/n + x_bar**2 / Sxx))

t_b1 = b1 / SE_b1
t_b0 = b0 / SE_b0

p_b1 = 2 * (1 - t.cdf(abs(t_b1), df=n-2))
p_b0 = 2 * (1 - t.cdf(abs(t_b0), df=n-2))

anova_table = {
    "Source": ["Regression", "Error", "Total"],
    "DF": [1, n-2, n-1],
    "SS": [SSR, SSE, SST],
    "MS": [MSR, MSE, np.nan],
    "F": [F_stat, np.nan, np.nan],
    "P-value": [1 - f.cdf(F_stat, 1, n-2), np.nan, np.nan]
}

coef_table = {
    "Term": ["Intercept", "Stay"],
    "Coefficient": [b0, b1],
    "SE Coef": [SE_b0, SE_b1],
    "t-value": [t_b0, t_b1],
    "P-value": [p_b0, p_b1]
}

import pandas as pd

print("ANOVA table")
display(pd.DataFrame(anova_table))

print("Coefficient table")
display(pd.DataFrame(coef_table))
ANOVA table
Source DF SS MS F P-value
0 Regression 1 0.456263 0.456263 0.216688 0.64338
1 Error 56 117.914710 2.105620 NaN NaN
2 Total 57 118.370972 NaN NaN NaN
Coefficient table
Term Coefficient SE Coef t-value P-value
0 Intercept 10.203286 0.381729 26.729144 0.00000
1 Stay -0.002921 0.006276 -0.465498 0.64338
In [48]:
# Choose any point of interest
x0 = 50

y0_hat = b0 + b1 * x0

SE_mean_x0 = np.sqrt(MSE * (1/n + (x0 - x_bar)**2 / Sxx))
SE_pred_x0 = np.sqrt(MSE * (1 + 1/n + (x0 - x_bar)**2 / Sxx))

alpha = 0.05
t_crit = t.ppf(1 - alpha/2, df=n-2)

ci_mean_lower = y0_hat - t_crit * SE_mean_x0
ci_mean_upper = y0_hat + t_crit * SE_mean_x0

pi_lower = y0_hat - t_crit * SE_pred_x0
pi_upper = y0_hat + t_crit * SE_pred_x0

print(f"x0 = {x0}")
print(f"Predicted mean response at x0: {y0_hat:.4f}")
print()
print(f"95% CI for the mean response at x0: ({ci_mean_lower:.4f}, {ci_mean_upper:.4f})")
print(f"95% PI for a new observation at x0: ({pi_lower:.4f}, {pi_upper:.4f})")
x0 = 50
Predicted mean response at x0: 10.0572

95% CI for the mean response at x0: (9.6740, 10.4404)
95% PI for a new observation at x0: (7.1252, 12.9892)
In [49]:
#confidence band and prediction band

alpha = 0.05

# Pointwise multiplier
t_crit = t.ppf(1 - alpha/2, df=n-2)

# Working-Hotelling multiplier for simultaneous confidence band
W = np.sqrt(2 * f.ppf(1 - alpha, dfn=2, dfd=n-2))

SE_mean_grid = np.sqrt(MSE * (1/n + (x_grid - x_bar)**2 / Sxx))
SE_pred_grid = np.sqrt(MSE * (1 + 1/n + (x_grid - x_bar)**2 / Sxx))

# Pointwise confidence interval curves
pointwise_ci_lower = y_grid - t_crit * SE_mean_grid
pointwise_ci_upper = y_grid + t_crit * SE_mean_grid

# Confidence band
conf_band_lower = y_grid - W * SE_mean_grid
conf_band_upper = y_grid + W * SE_mean_grid

# Prediction band
pred_band_lower = y_grid - W * SE_pred_grid
pred_band_upper = y_grid + W * SE_pred_grid
In [50]:
plt.figure(figsize=(10, 6))

plt.scatter(x, y, s=45, label="Data")
plt.plot(x_grid, y_grid, linewidth=2, label="LS regression line")

plt.fill_between(
    x_grid, conf_band_lower, conf_band_upper,
    alpha=0.20, label="Confidence band"
)

plt.fill_between(
    x_grid, pred_band_lower, pred_band_upper,
    alpha=0.10, label="Prediction band"
)

plt.axvline(x=x0, linestyle="--", linewidth=1.5, label=r"$x_0$")
plt.scatter([x0], [y0_hat], s=80, zorder=5, label=r"$\hat{Y}_0$")

# Pointwise CI at x0
plt.plot([x0, x0], [ci_mean_lower, ci_mean_upper], linewidth=4, label="Pointwise CI at $x_0$")

# Pointwise PI at x0
plt.plot([x0, x0], [pi_lower, pi_upper], linewidth=2, label="Pointwise PI at $x_0$")

plt.xlabel("Average Length of Stay (days)", fontsize=12)
plt.ylabel("Infection Risk (%)", fontsize=12)
plt.title("Hospital Infection Data: Regression, Intervals, and Bands", fontsize=14)
plt.grid(True, linestyle="--", alpha=0.4)
plt.legend()
plt.tight_layout()
plt.show()
In [ ]:
 

EXTRA INTUITION!


📌 What exactly is the Confidence Interval for the Mean Response?¶

We fix a value $x_0$.

Now there are two different random objects we could care about:


1. Mean response (what CI is about)¶

The true mean response at $x_0$ is:

$$ E[Y \mid x_0] = \beta_0 + \beta_1 x_0. $$

👉 This is the average outcome of $Y$ for all units with $x = x_0$.


Interpretation¶

  • If we could repeat the experiment many times at the same $x_0$,
  • then the average of those outcomes would be:
$$ \beta_0 + \beta_1 x_0. $$

Confidence interval answers:¶

“Where is the true regression line at $x_0$?”

So the CI is for:

$$ E[Y \mid x_0] $$

NOT for an individual observation.


2. Prediction (what PI is about)¶

Now consider a new observation:

$$ Y_{\text{new}} = \beta_0 + \beta_1 x_0 + \varepsilon. $$

👉 This includes random noise.


Prediction interval answers:¶

“Where will a new individual value fall?”

In [ ]: