Seminar 12

Logistic Regression¶


1. The Statistical Problem¶

We observe data

$$ (x_1,y_1),\dots,(x_n,y_n), $$

where

$$ x_i \in \mathbb{R}^p, \qquad y_i \in \{0,1\}. $$

The goal is to model the conditional probability

$$ \mathbb{P}(Y=1\mid X=x). $$

A naive linear model would be

$$ p(x)=\beta_0+\beta^Tx, $$

but linear functions can produce values outside $[0,1]$, so they cannot represent probabilities.


2. The Logistic Function¶

The logistic (sigmoid) function is

$$ \sigma(z)=\frac{1}{1+e^{-z}}. $$

It satisfies

$$ 0<\sigma(z)<1. $$

Therefore logistic regression assumes

$$ \mathbb{P}(Y=1\mid X=x)=\sigma(\theta^Tx). $$

Equivalently,

$$ p(x)=\frac{1}{1+e^{-\theta^Tx}}. $$

3. Odds and Log-Odds¶

The odds are

$$ \frac{p(x)}{1-p(x)}. $$

The log-odds (logit) are

$$ \log\left(\frac{p(x)}{1-p(x)}\right). $$

Logistic regression assumes

$$ \log\left(\frac{p(x)}{1-p(x)}\right)=\theta^Tx. $$

Let us derive the sigmoid.

Suppose

$$ \log\left(\frac{p}{1-p}\right)=z. $$

Exponentiating,

$$ \frac{p}{1-p}=e^z. $$

Then

$$ p=e^z(1-p). $$

Thus

$$ p=e^z-e^zp. $$

Hence

$$ p(1+e^z)=e^z. $$

Therefore

$$ p=\frac{e^z}{1+e^z} = \frac{1}{1+e^{-z}}. $$

4. Interpretation of the Coefficients¶

The model is

$$ \log\left(\frac{p(x)}{1-p(x)}\right) = \beta_0+\beta_1x_1+\cdots+\beta_px_p. $$

If $x_j$ increases by one unit while all other variables remain fixed, then the log-odds increase by $\beta_j$.

Exponentiating:

$$ \frac{\text{new odds}}{\text{old odds}} = e^{\beta_j}. $$

Therefore:

  • $e^{\beta_j}$ is the multiplicative change in the odds.
  • $\beta_j>0$ increases the probability of class $1$.
  • $\beta_j<0$ decreases the probability of class $1$.

5. Bernoulli Model¶

Assume

$$ Y_i \mid X_i=x_i \sim \operatorname{Bernoulli}(p_i), $$

where

$$ p_i=\sigma(\theta^Tx_i). $$

Thus

$$ \mathbb{P}(Y_i=y_i\mid X_i=x_i) = p_i^{y_i}(1-p_i)^{1-y_i}. $$

6. Likelihood Function¶

Assuming conditional independence,

$$ L(\theta) = \prod_{i=1}^n p_i^{y_i}(1-p_i)^{1-y_i}. $$

Substituting

$$ p_i=\sigma(\theta^Tx_i), $$

we obtain

$$ L(\theta) = \prod_{i=1}^n \sigma(\theta^Tx_i)^{y_i} \left(1-\sigma(\theta^Tx_i)\right)^{1-y_i}. $$

The log-likelihood is

$$ \ell(\theta) = \sum_{i=1}^n \left[ y_i\log p_i + (1-y_i)\log(1-p_i) \right]. $$

The maximum likelihood estimator is

$$ \hat{\theta} = \arg\max_\theta \ell(\theta). $$

7. Simplifying the Log-Likelihood¶

Recall:

$$ p_i=\sigma(z_i), \qquad z_i=\theta^Tx_i. $$

Since

$$ \sigma(z)=\frac{e^z}{1+e^z}, $$

we have

$$ \log \sigma(z) = z-\log(1+e^z). $$

Also,

$$ 1-\sigma(z)=\frac{1}{1+e^z}, $$

so

$$ \log(1-\sigma(z)) = -\log(1+e^z). $$

Therefore,

$$ \ell(\theta) = \sum_{i=1}^n \left[ y_i\theta^Tx_i - \log(1+e^{\theta^Tx_i}) \right]. $$

8. Derivative of the Sigmoid Function¶

Let

$$ \sigma(z)=\frac{1}{1+e^{-z}}. $$

Differentiate:

$$ \sigma'(z) = \frac{e^{-z}}{(1+e^{-z})^2}. $$

But

$$ \sigma(z)=\frac{1}{1+e^{-z}}, $$

and

$$ 1-\sigma(z)=\frac{e^{-z}}{1+e^{-z}}. $$

Therefore,

$$ \sigma'(z)=\sigma(z)(1-\sigma(z)). $$

This identity is fundamental.


9. Gradient of the Log-Likelihood¶

We have

$$ \ell(\theta) = \sum_{i=1}^n \left[ y_i\theta^Tx_i - \log(1+e^{\theta^Tx_i}) \right]. $$

Differentiate:

$$ \nabla_\theta(y_i\theta^Tx_i)=y_ix_i. $$

Also,

$$ \nabla_\theta \log(1+e^{\theta^Tx_i}) = \frac{e^{\theta^Tx_i}}{1+e^{\theta^Tx_i}}x_i. $$

But

$$ \frac{e^{\theta^Tx_i}}{1+e^{\theta^Tx_i}} = \sigma(\theta^Tx_i) = p_i. $$

Therefore,

$$ \nabla_\theta \ell(\theta) = \sum_{i=1}^n (y_i-p_i)x_i. $$

In matrix form:

$$ \nabla_\theta \ell(\theta) = X^T(y-p). $$

Hence

$$ \nabla_\theta J(\theta) = X^T(p-y). $$

10. Hessian Matrix¶

We know

$$ \nabla_\theta \ell(\theta) = \sum_{i=1}^n(y_i-p_i)x_i. $$

Since

$$ \nabla_\theta p_i = p_i(1-p_i)x_i, $$

we obtain

$$ \nabla_\theta^2 \ell(\theta) = -\sum_{i=1}^n p_i(1-p_i)x_ix_i^T. $$

Define

$$ W = \operatorname{diag}(p_1(1-p_1),\dots,p_n(1-p_n)). $$

Then

$$ \nabla^2 \ell(\theta) = -X^TWX. $$

Hence

$$ \nabla^2 J(\theta)=X^TWX. $$

11. Convexity¶

For any vector $v$,

$$ v^TX^TWXv = (Xv)^TW(Xv). $$

Therefore,

$$ v^TX^TWXv = \sum_{i=1}^n p_i(1-p_i)(x_i^Tv)^2 \ge 0. $$

Thus

$$ \nabla^2J(\theta)\succeq0. $$

Hence:

  • $J(\theta)$ is convex,
  • $\ell(\theta)$ is concave.

12. Newton's Method¶

Newton's method:

$$ \theta^{(t+1)} = \theta^{(t)} - \left[ \nabla^2J(\theta^{(t)}) \right]^{-1} \nabla J(\theta^{(t)}). $$

Since

$$ \nabla J(\theta)=X^T(p-y), $$

and

$$ \nabla^2J(\theta)=X^TWX, $$

we get

$$ \theta^{(t+1)} = \theta^{(t)} - (X^TWX)^{-1}X^T(p-y). $$

13. Gradient Descent¶

Gradient descent update:

$$ \theta^{(t+1)} = \theta^{(t)} - \alpha \nabla J(\theta^{(t)}). $$

Thus

$$ \theta^{(t+1)} = \theta^{(t)} - \alpha X^T(p-y). $$

14. Decision Boundary¶

We predict class $1$ when

$$ p(x)\ge\frac12. $$

Since the sigmoid is increasing,

$$ p(x)\ge\frac12 \iff \theta^Tx\ge0. $$

Thus the decision boundary is

$$ \theta^Tx=0. $$

This is a hyperplane.


In [ ]:
 
In [ ]:
 

Example: Logistic Regression with Full Solution¶

Problem¶

Suppose we want to predict whether a student passes an exam.

Let

  • $x$ = number of hours studied,
  • $Y=1$ if the student passes,
  • $Y=0$ if the student fails.

We have the following data:

Student Hours studied $x$ Result $y$
1 1 0
2 2 0
3 3 0
4 4 1
5 5 1
6 6 1

We want to fit a logistic regression model.

5. Gradient¶

Let

$$ z_i=\beta_0+\beta_1x_i. $$

Then

$$ p_i=\frac{1}{1+e^{-z_i}}. $$

The partial derivatives of the negative log-likelihood are

$$ \frac{\partial J}{\partial \beta_0} = \sum_{i=1}^n(p_i-y_i), $$

and

$$ \frac{\partial J}{\partial \beta_1} = \sum_{i=1}^n(p_i-y_i)x_i. $$
In [ ]:
 
In [6]:
import numpy as np
import matplotlib.pyplot as plt
from ipywidgets import interact, FloatSlider, IntSlider

# Function and derivative
def f(x):
    return x**2

def grad_f(x):
    return 2*x

def gradient_descent_plot(x0=4.0, learning_rate=0.2, steps=10):
    xs = [x0]
    
    for _ in range(steps):
        x_new = xs[-1] - learning_rate * grad_f(xs[-1])
        xs.append(x_new)
    
    xs = np.array(xs)
    ys = f(xs)
    
    x_grid = np.linspace(-5, 5, 400)
    
    plt.figure(figsize=(8, 5))
    plt.plot(x_grid, f(x_grid), label=r"$f(x)=x^2$")
    plt.scatter(xs, ys, s=60, zorder=3, label="Gradient descent steps")
    
    for i in range(len(xs)-1):
        plt.arrow(
            xs[i], ys[i],
            xs[i+1] - xs[i],
            ys[i+1] - ys[i],
            length_includes_head=True,
            head_width=0.12,
            alpha=0.7
        )
    
    plt.axvline(0, linestyle="--", alpha=0.5)
    plt.title(
        rf"Gradient descent: $x_0={x0}$, learning rate = {learning_rate}, steps = {steps}"
    )
    plt.xlabel("x")
    plt.ylabel("f(x)")
    plt.grid(True)
    plt.legend()
    plt.show()
    
    print("Final x:", xs[-1])
    print("Final f(x):", ys[-1])

interact(
    gradient_descent_plot,
    x0=FloatSlider(value=4.0, min=-5.0, max=5.0, step=0.1),
    learning_rate=FloatSlider(value=0.2, min=0.01, max=1.2, step=0.01),
    steps=IntSlider(value=10, min=1, ma
                    x=50, step=1)
);
interactive(children=(FloatSlider(value=4.0, description='x0', max=5.0, min=-5.0), FloatSlider(value=0.2, desc…
In [ ]:
 
In [4]:
import numpy as np
import matplotlib.pyplot as plt

# Data
x = np.array([1, 2, 3, 4, 5, 6], dtype=float)
y = np.array([0, 0, 0, 1, 1, 1], dtype=float)

# Add intercept column
X = np.column_stack([np.ones(len(x)), x])

X

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def negative_log_likelihood(beta, X, y):
    z = X @ beta
    
    # Stable version of log(1 + exp(z)) - y*z
    return np.sum(np.logaddexp(0, z) - y*z)

def gradient(beta, X, y):
    z = X @ beta
    p = sigmoid(z)
    return X.T @ (p - y)



# Gradient descent
beta = np.array([0.0, 0.0])
learning_rate = 0.1
n_iterations = 10000

loss_history = []

for k in range(n_iterations):
    grad = gradient(beta, X, y)
    beta = beta - learning_rate * grad
    loss_history.append(negative_log_likelihood(beta, X, y))

beta
Out[4]:
array([-25.4061142 ,   7.29789363])
In [7]:
x_grid = np.linspace(0, 7, 300)
X_grid = np.column_stack([np.ones(len(x_grid)), x_grid])
p_grid = sigmoid(X_grid @ beta)

plt.figure(figsize=(8, 5))
plt.scatter(x, y, label="Observed data")
plt.plot(x_grid, p_grid, label="Fitted logistic curve")
plt.axhline(0.5, linestyle="--", label="Threshold 0.5")
plt.axvline(decision_boundary, linestyle="--", label="Decision boundary")
plt.xlabel("Hours studied")
plt.ylabel("Probability of passing")
plt.title("Logistic Regression Example")
plt.legend()
plt.grid(True)
plt.show()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[7], line 9
      7 plt.plot(x_grid, p_grid, label="Fitted logistic curve")
      8 plt.axhline(0.5, linestyle="--", label="Threshold 0.5")
----> 9 plt.axvline(decision_boundary, linestyle="--", label="Decision boundary")
     10 plt.xlabel("Hours studied")
     11 plt.ylabel("Probability of passing")

NameError: name 'decision_boundary' is not defined
In [ ]:
 
In [ ]:
 
In [ ]: