Seminar 9

Linear Regression¶

Simple and Multiple Linear Regression¶


1. Motivation¶

In many problems in data science, we want to understand how a response variable depends on one or more predictor variables.

Examples:

  • Predicting house prices from area, location, and age.
  • Predicting sales from advertising budget.
  • Predicting risk from financial indicators.
  • Predicting demand from price and macroeconomic variables.

Linear regression models the relationship

$$ Y = f(X_1,\dots,X_p) + \varepsilon $$

using a linear function of the predictors.


2. The Linear Regression Model¶

The general multiple linear regression model is

$$ Y_i = \beta_0 + \beta_1 X_{i1} + \cdots + \beta_p X_{ip} + \varepsilon_i, \quad i=1,\dots,n $$

where

  • $Y_i$ — response variable
  • $X_{ij}$ — predictors (features)
  • $\beta_j$ — unknown parameters
  • $\varepsilon_i$ — random noise

Assumptions:

$$ \varepsilon_i \sim N(0,\sigma^2), \qquad \text{i.i.d.} $$

Thus

$$ \mathbb{E}[Y_i | X_i] = \beta_0 + \beta_1 X_{i1} + \cdots + \beta_p X_{ip} $$

and

$$ \operatorname{Var}(Y_i | X_i) = \sigma^2 $$

3. Simple Linear Regression¶

The simplest case has one predictor.

$$ Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i $$

Interpretation:

  • $\beta_0$ — intercept
  • $\beta_1$ — slope

The expected value of $Y$ is

$$ \mathbb{E}[Y|X] = \beta_0 + \beta_1 X $$

4. Least Squares Estimation¶

We observe data

$$ (X_1,Y_1),\dots,(X_n,Y_n) $$

Define the residuals

$$ e_i = Y_i - (\beta_0 + \beta_1 X_i) $$

The least squares method minimizes

$$ S(\beta_0,\beta_1) = \sum_{i=1}^n e_i^2 $$

that is

$$ S(\beta_0,\beta_1) = \sum_{i=1}^n \left(Y_i-\beta_0-\beta_1 X_i\right)^2 $$

5. Closed-form solution¶

The least squares estimators are

$$ \hat{\beta}_1 = \frac{\sum (X_i-\bar X)(Y_i-\bar Y)} {\sum (X_i-\bar X)^2} $$

and

$$ \hat{\beta}_0 = \bar Y - \hat{\beta}_1 \bar X $$

The fitted regression line is

$$ \hat Y = \hat{\beta}_0 + \hat{\beta}_1 X $$

6. Interpretation of the slope¶

The slope coefficient measures the expected change in $Y$ when $X$ increases by one unit.

$$ \beta_1 = \frac{\operatorname{Cov}(X,Y)}{\operatorname{Var}(X)} $$

If $\beta_1 > 0$:

  • positive relationship

If $\beta_1 < 0$:

  • negative relationship

7. Residuals¶

The residuals are

$$ e_i = Y_i - \hat Y_i $$

Important properties:

$$ \sum e_i = 0 $$$$ \sum X_i e_i = 0 $$

These orthogonality conditions are central to least squares.


8. Variance of the estimator¶

The noise variance is estimated by

$$ \hat{\sigma}^2 = \frac{1}{n-2} \sum e_i^2 $$

The variance of the slope estimator is

$$ \operatorname{Var}(\hat{\beta}_1) = \frac{\sigma^2}{\sum (X_i-\bar X)^2} $$

Estimated standard error:

$$ SE(\hat{\beta}_1) = \sqrt{ \frac{\hat{\sigma}^2}{\sum (X_i-\bar X)^2} } $$

9. Hypothesis Testing¶

We often test

$$ H_0:\beta_1 = 0 $$

Test statistic:

$$ t = \frac{\hat{\beta}_1}{SE(\hat{\beta}_1)} $$

Under the model assumptions

$$ t \sim t_{n-2} $$

10. Confidence Interval¶

A confidence interval for $\beta_1$ is

$$ \hat{\beta}_1 \pm t_{1-\alpha/2,n-2} SE(\hat{\beta}_1) $$

11. Coefficient of Determination ($R^2$)¶

Define the sums of squares

Total variability:

$$ SST = \sum (Y_i-\bar Y)^2 $$

Explained variability:

$$ SSR = \sum (\hat Y_i-\bar Y)^2 $$

Residual variability:

$$ SSE = \sum (Y_i-\hat Y_i)^2 $$

Relationship:

$$ SST = SSR + SSE $$

The coefficient of determination is

$$ R^2 = \frac{SSR}{SST} = 1-\frac{SSE}{SST} $$

Interpretation:

  • proportion of variance explained by the model.

12. Multiple Linear Regression¶

In many data science problems we have multiple predictors.

The model becomes

$$ Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \cdots + \beta_p X_{ip} + \varepsilon_i $$

13. Matrix Formulation¶

Define

$$ Y = \begin{pmatrix} Y_1 \\ \vdots \\ Y_n \end{pmatrix} $$$$ X = \begin{pmatrix} 1 & X_{11} & \dots & X_{1p} \\ 1 & X_{21} & \dots & X_{2p} \\ \vdots & \vdots & & \vdots \\ 1 & X_{n1} & \dots & X_{np} \end{pmatrix} $$$$ \beta = \begin{pmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{pmatrix} $$

The model is

$$ Y = X\beta + \varepsilon $$

14. Least Squares in Matrix Form¶

We minimize

$$ S(\beta) = (Y-X\beta)^T (Y-X\beta) $$

Taking derivatives gives the normal equations

$$ X^T X \hat{\beta} = X^T Y $$

If $X^T X$ is invertible:

$$ \hat{\beta} = (X^T X)^{-1} X^T Y $$

15. Fitted Values and Residuals¶

Predicted values:

$$ \hat Y = X \hat{\beta} $$

Residuals:

$$ e = Y - \hat Y $$

Projection interpretation:

$$ \hat Y = P_X Y $$

where

$$ P_X = X (X^T X)^{-1} X^T $$

is the projection matrix onto the column space of $X$.


16. Variance of the estimator¶

Under the model assumptions

$$ \operatorname{Var}(\hat{\beta}) = \sigma^2 (X^T X)^{-1} $$

Estimate:

$$ \hat{\sigma}^2 = \frac{SSE}{n-p-1} $$

17. Testing regression coefficients¶

For each coefficient we test

$$ H_0:\beta_j = 0 $$

Test statistic:

$$ t = \frac{\hat{\beta}_j}{SE(\hat{\beta}_j)} $$

with

$$ t \sim t_{n-p-1} $$

18. Global F-test¶

We test

$$ H_0: \beta_1 = \cdots = \beta_p = 0 $$

Statistic:

$$ F = \frac{SSR/p}{SSE/(n-p-1)} $$

Under $H_0$

$$ F \sim F_{p,n-p-1} $$

19. Important assumptions¶

The classical linear regression model assumes:

  1. Linearity
  2. Independent observations
  3. Homoscedasticity
$$ \operatorname{Var}(\varepsilon_i) = \sigma^2 $$
  1. Normal errors (for inference)

20. Regression in Data Science¶

In practice, regression is used for:

  • prediction
  • feature importance
  • causal modeling
  • economic forecasting
  • machine learning baselines

Extensions include:

  • ridge regression
  • lasso
  • generalized linear models
  • nonlinear regression

Summary¶

Linear regression provides:

  • interpretable models
  • statistical inference
  • efficient estimation via least squares
  • a geometric projection interpretation

It is one of the fundamental tools in statistics and data science.

In [ ]:
 
In [1]:
import numpy as np
import matplotlib.pyplot as plt

def simple_linear_regression(x, y, show_plot=True):
    """
    Simple linear regression for one predictor.

    Parameters
    ----------
    x : array-like
        Predictor values.
    y : array-like
        Response values.
    show_plot : bool, default=True
        If True, draws the scatter plot and the fitted regression line.

    Returns
    -------
    results : dict
        Dictionary containing:
        - intercept
        - slope
        - y_hat
        - residuals
        - SSE
        - SST
        - SSR
        - R2
    """
    x = np.array(x, dtype=float)
    y = np.array(y, dtype=float)

    if len(x) != len(y):
        raise ValueError("x and y must have the same length.")

    n = len(x)
    if n < 2:
        raise ValueError("At least two data points are required.")

    x_bar = np.mean(x)
    y_bar = np.mean(y)

    Sxx = np.sum((x - x_bar) ** 2)
    Sxy = np.sum((x - x_bar) * (y - y_bar))

    if Sxx == 0:
        raise ValueError("All x values are equal, so the slope is undefined.")

    # Least squares estimators
    slope = Sxy / Sxx
    intercept = y_bar - slope * x_bar

    # Fitted values and residuals
    y_hat = intercept + slope * x
    residuals = y - y_hat

    # Sums of squares
    SSE = np.sum((y - y_hat) ** 2)
    SST = np.sum((y - y_bar) ** 2)
    SSR = np.sum((y_hat - y_bar) ** 2)
    R2 = SSR / SST if SST != 0 else 1.0

    if show_plot:
        plt.figure(figsize=(8, 5))
        plt.scatter(x, y, s=60, label="Observed data")

        x_line = np.linspace(np.min(x), np.max(x), 200)
        y_line = intercept + slope * x_line
        plt.plot(x_line, y_line, linewidth=2, label="Least squares line")

        plt.xlabel("x")
        plt.ylabel("y")
        plt.title("Scatter plot with fitted regression line")
        plt.grid(True, alpha=0.3)
        plt.legend()
        plt.show()

    print("Simple Linear Regression Results")
    print("--------------------------------")
    print(f"Intercept (b0) = {intercept:.6f}")
    print(f"Slope (b1)     = {slope:.6f}")
    print(f"Fitted line    = y_hat = {intercept:.6f} + {slope:.6f}x")
    print(f"SSE            = {SSE:.6f}")
    print(f"SSR            = {SSR:.6f}")
    print(f"SST            = {SST:.6f}")
    print(f"R^2            = {R2:.6f}")

    return {
        "intercept": intercept,
        "slope": slope,
        "y_hat": y_hat,
        "residuals": residuals,
        "SSE": SSE,
        "SSR": SSR,
        "SST": SST,
        "R2": R2
    }
In [ ]:
 
In [2]:
# Input data
hours_studied = [20, 16, 20, 18, 17, 16, 15, 17, 15, 16, 15, 17, 16, 17, 14]
grade_on_exam = [89, 72, 93, 84, 81, 75, 70, 82, 69, 83, 80, 83, 81, 84, 76]

# Run simple linear regression
results = simple_linear_regression(hours_studied, grade_on_exam, show_plot=True)
Simple Linear Regression Results
--------------------------------
Intercept (b0) = 26.741987
Slope (b1)     = 3.216346
Fitted line    = y_hat = 26.741987 + 3.216346x
SSE            = 201.386218
SSR            = 430.347115
SST            = 631.733333
R^2            = 0.681216
In [ ]:
 
In [3]:
import numpy as np
from scipy import stats

def regression_f_test(x, y, alpha=0.05):
    """
    F-test for significance of simple linear regression.

    H0: beta1 = 0  (no relationship)
    H1: beta1 != 0

    Returns
    -------
    dictionary with regression and test results
    """

    x = np.array(x, dtype=float)
    y = np.array(y, dtype=float)

    n = len(x)

    x_bar = np.mean(x)
    y_bar = np.mean(y)

    Sxx = np.sum((x - x_bar)**2)
    Sxy = np.sum((x - x_bar)*(y - y_bar))

    # regression coefficients
    b1 = Sxy / Sxx
    b0 = y_bar - b1*x_bar

    # fitted values
    y_hat = b0 + b1*x

    # sums of squares
    SSE = np.sum((y - y_hat)**2)
    SST = np.sum((y - y_bar)**2)
    SSR = SST - SSE

    # degrees of freedom
    df_reg = 1
    df_err = n - 2

    # mean squares
    MSR = SSR / df_reg
    MSE = SSE / df_err

    # F statistic
    F = MSR / MSE

    # p-value
    p_value = 1 - stats.f.cdf(F, df_reg, df_err)

    # critical value
    F_crit = stats.f.ppf(1 - alpha, df_reg, df_err)

    print("F-Test for Significance of Regression")
    print("------------------------------------")
    print(f"n = {n}")
    print(f"Intercept (b0) = {b0:.6f}")
    print(f"Slope (b1)     = {b1:.6f}")
    print()
    print("ANOVA Table")
    print("------------------------------------")
    print(f"SSR (Regression) = {SSR:.4f}")
    print(f"SSE (Error)      = {SSE:.4f}")
    print(f"SST (Total)      = {SST:.4f}")
    print()
    print(f"MSR = {MSR:.4f}")
    print(f"MSE = {MSE:.4f}")
    print()
    print(f"F statistic = {F:.4f}")
    print(f"F critical  = {F_crit:.4f}")
    print(f"p-value     = {p_value:.6f}")
    print()

    if F > F_crit:
        print("Reject H0: There is a significant linear relationship.")
    else:
        print("Fail to reject H0: No significant relationship detected.")

    return {
        "F": F,
        "p_value": p_value,
        "b0": b0,
        "b1": b1,
        "SSE": SSE,
        "SSR": SSR,
        "SST": SST
    }
In [ ]:
 
In [4]:
# Data
hours_studied = [20,16,20,18,17,16,15,17,15,16,15,17,16,17,14]
grade_on_exam = [89,72,93,84,81,75,70,82,69,83,80,83,81,84,76]

# Run the F-test
result = regression_f_test(hours_studied, grade_on_exam, alpha=0.05)
F-Test for Significance of Regression
------------------------------------
n = 15
Intercept (b0) = 26.741987
Slope (b1)     = 3.216346

ANOVA Table
------------------------------------
SSR (Regression) = 430.3471
SSE (Error)      = 201.3862
SST (Total)      = 631.7333

MSR = 430.3471
MSE = 15.4912

F statistic = 27.7800
F critical  = 4.6672
p-value     = 0.000151

Reject H0: There is a significant linear relationship.
In [ ]:
 
In [5]:
import numpy as np

def coefficient_of_determination(x, y):
    """
    Computes the coefficient of determination (R^2)
    for a simple linear regression model.
    """

    x = np.array(x, dtype=float)
    y = np.array(y, dtype=float)

    n = len(x)

    x_bar = np.mean(x)
    y_bar = np.mean(y)

    # Compute regression coefficients
    Sxx = np.sum((x - x_bar)**2)
    Sxy = np.sum((x - x_bar)*(y - y_bar))

    b1 = Sxy / Sxx
    b0 = y_bar - b1*x_bar

    # Predicted values
    y_hat = b0 + b1*x

    # Sums of squares
    SSE = np.sum((y - y_hat)**2)
    SST = np.sum((y - y_bar)**2)
    SSR = SST - SSE

    # R^2
    R2 = SSR / SST

    print("Coefficient of Determination")
    print("-----------------------------")
    print(f"SST = {SST:.4f}")
    print(f"SSR = {SSR:.4f}")
    print(f"SSE = {SSE:.4f}")
    print()
    print(f"R^2 = {R2:.6f}")
    print()
    print(f"Interpretation: {R2*100:.2f}% of the variability in y")
    print("is explained by the linear regression model.")

    return R2
In [ ]:
 
In [6]:
hours_studied = [20,16,20,18,17,16,15,17,15,16,15,17,16,17,14]
grade_on_exam = [89,72,93,84,81,75,70,82,69,83,80,83,81,84,76]

R2 = coefficient_of_determination(hours_studied, grade_on_exam)
Coefficient of Determination
-----------------------------
SST = 631.7333
SSR = 430.3471
SSE = 201.3862

R^2 = 0.681216

Interpretation: 68.12% of the variability in y
is explained by the linear regression model.
In [ ]: