We observe data
$$ (x_1,y_1),\dots,(x_n,y_n), $$where
$$ x_i \in \mathbb{R}^p, \qquad y_i \in \{0,1\}. $$The goal is to model the conditional probability
$$ \mathbb{P}(Y=1\mid X=x). $$A naive linear model would be
$$ p(x)=\beta_0+\beta^Tx, $$but linear functions can produce values outside $[0,1]$, so they cannot represent probabilities.
The logistic (sigmoid) function is
$$ \sigma(z)=\frac{1}{1+e^{-z}}. $$It satisfies
$$ 0<\sigma(z)<1. $$Therefore logistic regression assumes
$$ \mathbb{P}(Y=1\mid X=x)=\sigma(\theta^Tx). $$Equivalently,
$$ p(x)=\frac{1}{1+e^{-\theta^Tx}}. $$The odds are
$$ \frac{p(x)}{1-p(x)}. $$The log-odds (logit) are
$$ \log\left(\frac{p(x)}{1-p(x)}\right). $$Logistic regression assumes
$$ \log\left(\frac{p(x)}{1-p(x)}\right)=\theta^Tx. $$Let us derive the sigmoid.
Suppose
$$ \log\left(\frac{p}{1-p}\right)=z. $$Exponentiating,
$$ \frac{p}{1-p}=e^z. $$Then
$$ p=e^z(1-p). $$Thus
$$ p=e^z-e^zp. $$Hence
$$ p(1+e^z)=e^z. $$Therefore
$$ p=\frac{e^z}{1+e^z} = \frac{1}{1+e^{-z}}. $$The model is
$$ \log\left(\frac{p(x)}{1-p(x)}\right) = \beta_0+\beta_1x_1+\cdots+\beta_px_p. $$If $x_j$ increases by one unit while all other variables remain fixed, then the log-odds increase by $\beta_j$.
Exponentiating:
$$ \frac{\text{new odds}}{\text{old odds}} = e^{\beta_j}. $$Therefore:
Assume
$$ Y_i \mid X_i=x_i \sim \operatorname{Bernoulli}(p_i), $$where
$$ p_i=\sigma(\theta^Tx_i). $$Thus
$$ \mathbb{P}(Y_i=y_i\mid X_i=x_i) = p_i^{y_i}(1-p_i)^{1-y_i}. $$Assuming conditional independence,
$$ L(\theta) = \prod_{i=1}^n p_i^{y_i}(1-p_i)^{1-y_i}. $$Substituting
$$ p_i=\sigma(\theta^Tx_i), $$we obtain
$$ L(\theta) = \prod_{i=1}^n \sigma(\theta^Tx_i)^{y_i} \left(1-\sigma(\theta^Tx_i)\right)^{1-y_i}. $$The log-likelihood is
$$ \ell(\theta) = \sum_{i=1}^n \left[ y_i\log p_i + (1-y_i)\log(1-p_i) \right]. $$The maximum likelihood estimator is
$$ \hat{\theta} = \arg\max_\theta \ell(\theta). $$Recall:
$$ p_i=\sigma(z_i), \qquad z_i=\theta^Tx_i. $$Since
$$ \sigma(z)=\frac{e^z}{1+e^z}, $$we have
$$ \log \sigma(z) = z-\log(1+e^z). $$Also,
$$ 1-\sigma(z)=\frac{1}{1+e^z}, $$so
$$ \log(1-\sigma(z)) = -\log(1+e^z). $$Therefore,
$$ \ell(\theta) = \sum_{i=1}^n \left[ y_i\theta^Tx_i - \log(1+e^{\theta^Tx_i}) \right]. $$Let
$$ \sigma(z)=\frac{1}{1+e^{-z}}. $$Differentiate:
$$ \sigma'(z) = \frac{e^{-z}}{(1+e^{-z})^2}. $$But
$$ \sigma(z)=\frac{1}{1+e^{-z}}, $$and
$$ 1-\sigma(z)=\frac{e^{-z}}{1+e^{-z}}. $$Therefore,
$$ \sigma'(z)=\sigma(z)(1-\sigma(z)). $$This identity is fundamental.
We have
$$ \ell(\theta) = \sum_{i=1}^n \left[ y_i\theta^Tx_i - \log(1+e^{\theta^Tx_i}) \right]. $$Differentiate:
$$ \nabla_\theta(y_i\theta^Tx_i)=y_ix_i. $$Also,
$$ \nabla_\theta \log(1+e^{\theta^Tx_i}) = \frac{e^{\theta^Tx_i}}{1+e^{\theta^Tx_i}}x_i. $$But
$$ \frac{e^{\theta^Tx_i}}{1+e^{\theta^Tx_i}} = \sigma(\theta^Tx_i) = p_i. $$Therefore,
$$ \nabla_\theta \ell(\theta) = \sum_{i=1}^n (y_i-p_i)x_i. $$In matrix form:
$$ \nabla_\theta \ell(\theta) = X^T(y-p). $$Hence
$$ \nabla_\theta J(\theta) = X^T(p-y). $$We know
$$ \nabla_\theta \ell(\theta) = \sum_{i=1}^n(y_i-p_i)x_i. $$Since
$$ \nabla_\theta p_i = p_i(1-p_i)x_i, $$we obtain
$$ \nabla_\theta^2 \ell(\theta) = -\sum_{i=1}^n p_i(1-p_i)x_ix_i^T. $$Define
$$ W = \operatorname{diag}(p_1(1-p_1),\dots,p_n(1-p_n)). $$Then
$$ \nabla^2 \ell(\theta) = -X^TWX. $$Hence
$$ \nabla^2 J(\theta)=X^TWX. $$For any vector $v$,
$$ v^TX^TWXv = (Xv)^TW(Xv). $$Therefore,
$$ v^TX^TWXv = \sum_{i=1}^n p_i(1-p_i)(x_i^Tv)^2 \ge 0. $$Thus
$$ \nabla^2J(\theta)\succeq0. $$Hence:
Newton's method:
$$ \theta^{(t+1)} = \theta^{(t)} - \left[ \nabla^2J(\theta^{(t)}) \right]^{-1} \nabla J(\theta^{(t)}). $$Since
$$ \nabla J(\theta)=X^T(p-y), $$and
$$ \nabla^2J(\theta)=X^TWX, $$we get
$$ \theta^{(t+1)} = \theta^{(t)} - (X^TWX)^{-1}X^T(p-y). $$Gradient descent update:
$$ \theta^{(t+1)} = \theta^{(t)} - \alpha \nabla J(\theta^{(t)}). $$Thus
$$ \theta^{(t+1)} = \theta^{(t)} - \alpha X^T(p-y). $$We predict class $1$ when
$$ p(x)\ge\frac12. $$Since the sigmoid is increasing,
$$ p(x)\ge\frac12 \iff \theta^Tx\ge0. $$Thus the decision boundary is
$$ \theta^Tx=0. $$This is a hyperplane.
Suppose we want to predict whether a student passes an exam.
Let
We have the following data:
| Student | Hours studied $x$ | Result $y$ |
|---|---|---|
| 1 | 1 | 0 |
| 2 | 2 | 0 |
| 3 | 3 | 0 |
| 4 | 4 | 1 |
| 5 | 5 | 1 |
| 6 | 6 | 1 |
We want to fit a logistic regression model.
Let
$$ z_i=\beta_0+\beta_1x_i. $$Then
$$ p_i=\frac{1}{1+e^{-z_i}}. $$The partial derivatives of the negative log-likelihood are
$$ \frac{\partial J}{\partial \beta_0} = \sum_{i=1}^n(p_i-y_i), $$and
$$ \frac{\partial J}{\partial \beta_1} = \sum_{i=1}^n(p_i-y_i)x_i. $$
import numpy as np
import matplotlib.pyplot as plt
from ipywidgets import interact, FloatSlider, IntSlider
# Function and derivative
def f(x):
return x**2
def grad_f(x):
return 2*x
def gradient_descent_plot(x0=4.0, learning_rate=0.2, steps=10):
xs = [x0]
for _ in range(steps):
x_new = xs[-1] - learning_rate * grad_f(xs[-1])
xs.append(x_new)
xs = np.array(xs)
ys = f(xs)
x_grid = np.linspace(-5, 5, 400)
plt.figure(figsize=(8, 5))
plt.plot(x_grid, f(x_grid), label=r"$f(x)=x^2$")
plt.scatter(xs, ys, s=60, zorder=3, label="Gradient descent steps")
for i in range(len(xs)-1):
plt.arrow(
xs[i], ys[i],
xs[i+1] - xs[i],
ys[i+1] - ys[i],
length_includes_head=True,
head_width=0.12,
alpha=0.7
)
plt.axvline(0, linestyle="--", alpha=0.5)
plt.title(
rf"Gradient descent: $x_0={x0}$, learning rate = {learning_rate}, steps = {steps}"
)
plt.xlabel("x")
plt.ylabel("f(x)")
plt.grid(True)
plt.legend()
plt.show()
print("Final x:", xs[-1])
print("Final f(x):", ys[-1])
interact(
gradient_descent_plot,
x0=FloatSlider(value=4.0, min=-5.0, max=5.0, step=0.1),
learning_rate=FloatSlider(value=0.2, min=0.01, max=1.2, step=0.01),
steps=IntSlider(value=10, min=1, ma
x=50, step=1)
);
interactive(children=(FloatSlider(value=4.0, description='x0', max=5.0, min=-5.0), FloatSlider(value=0.2, desc…
import numpy as np
import matplotlib.pyplot as plt
# Data
x = np.array([1, 2, 3, 4, 5, 6], dtype=float)
y = np.array([0, 0, 0, 1, 1, 1], dtype=float)
# Add intercept column
X = np.column_stack([np.ones(len(x)), x])
X
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def negative_log_likelihood(beta, X, y):
z = X @ beta
# Stable version of log(1 + exp(z)) - y*z
return np.sum(np.logaddexp(0, z) - y*z)
def gradient(beta, X, y):
z = X @ beta
p = sigmoid(z)
return X.T @ (p - y)
# Gradient descent
beta = np.array([0.0, 0.0])
learning_rate = 0.1
n_iterations = 10000
loss_history = []
for k in range(n_iterations):
grad = gradient(beta, X, y)
beta = beta - learning_rate * grad
loss_history.append(negative_log_likelihood(beta, X, y))
beta
array([-25.4061142 , 7.29789363])
x_grid = np.linspace(0, 7, 300)
X_grid = np.column_stack([np.ones(len(x_grid)), x_grid])
p_grid = sigmoid(X_grid @ beta)
plt.figure(figsize=(8, 5))
plt.scatter(x, y, label="Observed data")
plt.plot(x_grid, p_grid, label="Fitted logistic curve")
plt.axhline(0.5, linestyle="--", label="Threshold 0.5")
plt.axvline(decision_boundary, linestyle="--", label="Decision boundary")
plt.xlabel("Hours studied")
plt.ylabel("Probability of passing")
plt.title("Logistic Regression Example")
plt.legend()
plt.grid(True)
plt.show()
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[7], line 9 7 plt.plot(x_grid, p_grid, label="Fitted logistic curve") 8 plt.axhline(0.5, linestyle="--", label="Threshold 0.5") ----> 9 plt.axvline(decision_boundary, linestyle="--", label="Decision boundary") 10 plt.xlabel("Hours studied") 11 plt.ylabel("Probability of passing") NameError: name 'decision_boundary' is not defined