In many problems in data science, we want to understand how a response variable depends on one or more predictor variables.
Examples:
Linear regression models the relationship
$$ Y = f(X_1,\dots,X_p) + \varepsilon $$using a linear function of the predictors.
The general multiple linear regression model is
$$ Y_i = \beta_0 + \beta_1 X_{i1} + \cdots + \beta_p X_{ip} + \varepsilon_i, \quad i=1,\dots,n $$where
Assumptions:
$$ \varepsilon_i \sim N(0,\sigma^2), \qquad \text{i.i.d.} $$Thus
$$ \mathbb{E}[Y_i | X_i] = \beta_0 + \beta_1 X_{i1} + \cdots + \beta_p X_{ip} $$and
$$ \operatorname{Var}(Y_i | X_i) = \sigma^2 $$The simplest case has one predictor.
$$ Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i $$Interpretation:
The expected value of $Y$ is
$$ \mathbb{E}[Y|X] = \beta_0 + \beta_1 X $$We observe data
$$ (X_1,Y_1),\dots,(X_n,Y_n) $$Define the residuals
$$ e_i = Y_i - (\beta_0 + \beta_1 X_i) $$The least squares method minimizes
$$ S(\beta_0,\beta_1) = \sum_{i=1}^n e_i^2 $$that is
$$ S(\beta_0,\beta_1) = \sum_{i=1}^n \left(Y_i-\beta_0-\beta_1 X_i\right)^2 $$The least squares estimators are
$$ \hat{\beta}_1 = \frac{\sum (X_i-\bar X)(Y_i-\bar Y)} {\sum (X_i-\bar X)^2} $$and
$$ \hat{\beta}_0 = \bar Y - \hat{\beta}_1 \bar X $$The fitted regression line is
$$ \hat Y = \hat{\beta}_0 + \hat{\beta}_1 X $$The slope coefficient measures the expected change in $Y$ when $X$ increases by one unit.
$$ \beta_1 = \frac{\operatorname{Cov}(X,Y)}{\operatorname{Var}(X)} $$If $\beta_1 > 0$:
If $\beta_1 < 0$:
The residuals are
$$ e_i = Y_i - \hat Y_i $$Important properties:
$$ \sum e_i = 0 $$$$ \sum X_i e_i = 0 $$These orthogonality conditions are central to least squares.
The noise variance is estimated by
$$ \hat{\sigma}^2 = \frac{1}{n-2} \sum e_i^2 $$The variance of the slope estimator is
$$ \operatorname{Var}(\hat{\beta}_1) = \frac{\sigma^2}{\sum (X_i-\bar X)^2} $$Estimated standard error:
$$ SE(\hat{\beta}_1) = \sqrt{ \frac{\hat{\sigma}^2}{\sum (X_i-\bar X)^2} } $$We often test
$$ H_0:\beta_1 = 0 $$Test statistic:
$$ t = \frac{\hat{\beta}_1}{SE(\hat{\beta}_1)} $$Under the model assumptions
$$ t \sim t_{n-2} $$A confidence interval for $\beta_1$ is
$$ \hat{\beta}_1 \pm t_{1-\alpha/2,n-2} SE(\hat{\beta}_1) $$Define the sums of squares
Total variability:
$$ SST = \sum (Y_i-\bar Y)^2 $$Explained variability:
$$ SSR = \sum (\hat Y_i-\bar Y)^2 $$Residual variability:
$$ SSE = \sum (Y_i-\hat Y_i)^2 $$Relationship:
$$ SST = SSR + SSE $$The coefficient of determination is
$$ R^2 = \frac{SSR}{SST} = 1-\frac{SSE}{SST} $$Interpretation:
In many data science problems we have multiple predictors.
The model becomes
$$ Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \cdots + \beta_p X_{ip} + \varepsilon_i $$Define
$$ Y = \begin{pmatrix} Y_1 \\ \vdots \\ Y_n \end{pmatrix} $$$$ X = \begin{pmatrix} 1 & X_{11} & \dots & X_{1p} \\ 1 & X_{21} & \dots & X_{2p} \\ \vdots & \vdots & & \vdots \\ 1 & X_{n1} & \dots & X_{np} \end{pmatrix} $$$$ \beta = \begin{pmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{pmatrix} $$The model is
$$ Y = X\beta + \varepsilon $$We minimize
$$ S(\beta) = (Y-X\beta)^T (Y-X\beta) $$Taking derivatives gives the normal equations
$$ X^T X \hat{\beta} = X^T Y $$If $X^T X$ is invertible:
$$ \hat{\beta} = (X^T X)^{-1} X^T Y $$Predicted values:
$$ \hat Y = X \hat{\beta} $$Residuals:
$$ e = Y - \hat Y $$Projection interpretation:
$$ \hat Y = P_X Y $$where
$$ P_X = X (X^T X)^{-1} X^T $$is the projection matrix onto the column space of $X$.
Under the model assumptions
$$ \operatorname{Var}(\hat{\beta}) = \sigma^2 (X^T X)^{-1} $$Estimate:
$$ \hat{\sigma}^2 = \frac{SSE}{n-p-1} $$For each coefficient we test
$$ H_0:\beta_j = 0 $$Test statistic:
$$ t = \frac{\hat{\beta}_j}{SE(\hat{\beta}_j)} $$with
$$ t \sim t_{n-p-1} $$We test
$$ H_0: \beta_1 = \cdots = \beta_p = 0 $$Statistic:
$$ F = \frac{SSR/p}{SSE/(n-p-1)} $$Under $H_0$
$$ F \sim F_{p,n-p-1} $$The classical linear regression model assumes:
In practice, regression is used for:
Extensions include:
Linear regression provides:
It is one of the fundamental tools in statistics and data science.
import numpy as np
import matplotlib.pyplot as plt
def simple_linear_regression(x, y, show_plot=True):
"""
Simple linear regression for one predictor.
Parameters
----------
x : array-like
Predictor values.
y : array-like
Response values.
show_plot : bool, default=True
If True, draws the scatter plot and the fitted regression line.
Returns
-------
results : dict
Dictionary containing:
- intercept
- slope
- y_hat
- residuals
- SSE
- SST
- SSR
- R2
"""
x = np.array(x, dtype=float)
y = np.array(y, dtype=float)
if len(x) != len(y):
raise ValueError("x and y must have the same length.")
n = len(x)
if n < 2:
raise ValueError("At least two data points are required.")
x_bar = np.mean(x)
y_bar = np.mean(y)
Sxx = np.sum((x - x_bar) ** 2)
Sxy = np.sum((x - x_bar) * (y - y_bar))
if Sxx == 0:
raise ValueError("All x values are equal, so the slope is undefined.")
# Least squares estimators
slope = Sxy / Sxx
intercept = y_bar - slope * x_bar
# Fitted values and residuals
y_hat = intercept + slope * x
residuals = y - y_hat
# Sums of squares
SSE = np.sum((y - y_hat) ** 2)
SST = np.sum((y - y_bar) ** 2)
SSR = np.sum((y_hat - y_bar) ** 2)
R2 = SSR / SST if SST != 0 else 1.0
if show_plot:
plt.figure(figsize=(8, 5))
plt.scatter(x, y, s=60, label="Observed data")
x_line = np.linspace(np.min(x), np.max(x), 200)
y_line = intercept + slope * x_line
plt.plot(x_line, y_line, linewidth=2, label="Least squares line")
plt.xlabel("x")
plt.ylabel("y")
plt.title("Scatter plot with fitted regression line")
plt.grid(True, alpha=0.3)
plt.legend()
plt.show()
print("Simple Linear Regression Results")
print("--------------------------------")
print(f"Intercept (b0) = {intercept:.6f}")
print(f"Slope (b1) = {slope:.6f}")
print(f"Fitted line = y_hat = {intercept:.6f} + {slope:.6f}x")
print(f"SSE = {SSE:.6f}")
print(f"SSR = {SSR:.6f}")
print(f"SST = {SST:.6f}")
print(f"R^2 = {R2:.6f}")
return {
"intercept": intercept,
"slope": slope,
"y_hat": y_hat,
"residuals": residuals,
"SSE": SSE,
"SSR": SSR,
"SST": SST,
"R2": R2
}
# Input data
hours_studied = [20, 16, 20, 18, 17, 16, 15, 17, 15, 16, 15, 17, 16, 17, 14]
grade_on_exam = [89, 72, 93, 84, 81, 75, 70, 82, 69, 83, 80, 83, 81, 84, 76]
# Run simple linear regression
results = simple_linear_regression(hours_studied, grade_on_exam, show_plot=True)
Simple Linear Regression Results -------------------------------- Intercept (b0) = 26.741987 Slope (b1) = 3.216346 Fitted line = y_hat = 26.741987 + 3.216346x SSE = 201.386218 SSR = 430.347115 SST = 631.733333 R^2 = 0.681216
import numpy as np
from scipy import stats
def regression_f_test(x, y, alpha=0.05):
"""
F-test for significance of simple linear regression.
H0: beta1 = 0 (no relationship)
H1: beta1 != 0
Returns
-------
dictionary with regression and test results
"""
x = np.array(x, dtype=float)
y = np.array(y, dtype=float)
n = len(x)
x_bar = np.mean(x)
y_bar = np.mean(y)
Sxx = np.sum((x - x_bar)**2)
Sxy = np.sum((x - x_bar)*(y - y_bar))
# regression coefficients
b1 = Sxy / Sxx
b0 = y_bar - b1*x_bar
# fitted values
y_hat = b0 + b1*x
# sums of squares
SSE = np.sum((y - y_hat)**2)
SST = np.sum((y - y_bar)**2)
SSR = SST - SSE
# degrees of freedom
df_reg = 1
df_err = n - 2
# mean squares
MSR = SSR / df_reg
MSE = SSE / df_err
# F statistic
F = MSR / MSE
# p-value
p_value = 1 - stats.f.cdf(F, df_reg, df_err)
# critical value
F_crit = stats.f.ppf(1 - alpha, df_reg, df_err)
print("F-Test for Significance of Regression")
print("------------------------------------")
print(f"n = {n}")
print(f"Intercept (b0) = {b0:.6f}")
print(f"Slope (b1) = {b1:.6f}")
print()
print("ANOVA Table")
print("------------------------------------")
print(f"SSR (Regression) = {SSR:.4f}")
print(f"SSE (Error) = {SSE:.4f}")
print(f"SST (Total) = {SST:.4f}")
print()
print(f"MSR = {MSR:.4f}")
print(f"MSE = {MSE:.4f}")
print()
print(f"F statistic = {F:.4f}")
print(f"F critical = {F_crit:.4f}")
print(f"p-value = {p_value:.6f}")
print()
if F > F_crit:
print("Reject H0: There is a significant linear relationship.")
else:
print("Fail to reject H0: No significant relationship detected.")
return {
"F": F,
"p_value": p_value,
"b0": b0,
"b1": b1,
"SSE": SSE,
"SSR": SSR,
"SST": SST
}
# Data
hours_studied = [20,16,20,18,17,16,15,17,15,16,15,17,16,17,14]
grade_on_exam = [89,72,93,84,81,75,70,82,69,83,80,83,81,84,76]
# Run the F-test
result = regression_f_test(hours_studied, grade_on_exam, alpha=0.05)
F-Test for Significance of Regression ------------------------------------ n = 15 Intercept (b0) = 26.741987 Slope (b1) = 3.216346 ANOVA Table ------------------------------------ SSR (Regression) = 430.3471 SSE (Error) = 201.3862 SST (Total) = 631.7333 MSR = 430.3471 MSE = 15.4912 F statistic = 27.7800 F critical = 4.6672 p-value = 0.000151 Reject H0: There is a significant linear relationship.
import numpy as np
def coefficient_of_determination(x, y):
"""
Computes the coefficient of determination (R^2)
for a simple linear regression model.
"""
x = np.array(x, dtype=float)
y = np.array(y, dtype=float)
n = len(x)
x_bar = np.mean(x)
y_bar = np.mean(y)
# Compute regression coefficients
Sxx = np.sum((x - x_bar)**2)
Sxy = np.sum((x - x_bar)*(y - y_bar))
b1 = Sxy / Sxx
b0 = y_bar - b1*x_bar
# Predicted values
y_hat = b0 + b1*x
# Sums of squares
SSE = np.sum((y - y_hat)**2)
SST = np.sum((y - y_bar)**2)
SSR = SST - SSE
# R^2
R2 = SSR / SST
print("Coefficient of Determination")
print("-----------------------------")
print(f"SST = {SST:.4f}")
print(f"SSR = {SSR:.4f}")
print(f"SSE = {SSE:.4f}")
print()
print(f"R^2 = {R2:.6f}")
print()
print(f"Interpretation: {R2*100:.2f}% of the variability in y")
print("is explained by the linear regression model.")
return R2
hours_studied = [20,16,20,18,17,16,15,17,15,16,15,17,16,17,14]
grade_on_exam = [89,72,93,84,81,75,70,82,69,83,80,83,81,84,76]
R2 = coefficient_of_determination(hours_studied, grade_on_exam)
Coefficient of Determination ----------------------------- SST = 631.7333 SSR = 430.3471 SSE = 201.3862 R^2 = 0.681216 Interpretation: 68.12% of the variability in y is explained by the linear regression model.