Poisson regression is used when the response variable is a count:
The response variable satisfies
$$ Y_i \in \{0,1,2,\dots\}. $$Linear regression is not appropriate because it may predict negative values.
Poisson regression solves this by modeling the mean as a positive quantity.
A random variable $Y$ has Poisson distribution with parameter $\lambda>0$ if
$$ P(Y=y)=\frac{e^{-\lambda}\lambda^y}{y!}, \qquad y=0,1,2,\dots $$We write
$$ Y\sim \operatorname{Poisson}(\lambda). $$The mean and variance are
$$ \mathbb{E}[Y]=\lambda, $$and
$$ \operatorname{Var}(Y)=\lambda. $$Thus, for a Poisson random variable,
$$ \mathbb{E}[Y]=\operatorname{Var}(Y). $$This is an important property.
Suppose we observe independent data
$$ (y_i,x_i), \qquad i=1,\dots,n, $$where
$$ x_i= \begin{pmatrix} 1\\ x_{i1}\\ \vdots\\ x_{ip} \end{pmatrix} \in \mathbb{R}^{p+1}. $$We assume
$$ Y_i\sim \operatorname{Poisson}(\lambda_i), $$where the mean depends on the predictors.
The problem is that we need
$$ \lambda_i>0. $$Therefore, instead of modeling $\lambda_i$ directly as a linear function, we model its logarithm:
$$ \log(\lambda_i)=x_i^T\beta. $$Equivalently,
$$ \lambda_i=e^{x_i^T\beta}. $$This automatically guarantees
$$ \lambda_i>0. $$If we used
$$ \lambda_i=x_i^T\beta, $$then the model could produce negative predicted counts.
That is impossible because
$$ \lambda_i=\mathbb{E}[Y_i]>0. $$The log link solves this:
$$ \log(\lambda_i)=x_i^T\beta. $$Since the exponential function is always positive,
$$ \lambda_i=e^{x_i^T\beta}>0. $$Thus Poisson regression is a generalized linear model with:
| Component | Poisson Regression |
|---|---|
| Response distribution | Poisson |
| Mean | $\lambda_i$ |
| Link function | $\log(\lambda_i)$ |
| Linear predictor | $x_i^T\beta$ |
Consider the model
$$ \log(\lambda_i)=\beta_0+\beta_1x_i. $$Then
$$ \lambda_i=e^{\beta_0+\beta_1x_i}. $$Now increase $x_i$ by one unit. The new mean is
$$ \lambda(x_i+1)=e^{\beta_0+\beta_1(x_i+1)}. $$Therefore,
$$ \lambda(x_i+1) = e^{\beta_0+\beta_1x_i+\beta_1} = e^{\beta_1}e^{\beta_0+\beta_1x_i}. $$So
$$ \lambda(x_i+1)=e^{\beta_1}\lambda(x_i). $$Hence,
$$ \frac{\lambda(x_i+1)}{\lambda(x_i)}=e^{\beta_1}. $$Therefore, a one-unit increase in $x_i$ multiplies the expected count by
$$ e^{\beta_1}. $$So:
For example, if
$$ \beta_1=0.2, $$then
$$ e^{0.2}\approx 1.221. $$So a one-unit increase in $x$ increases the expected count by approximately $22.1\%$.
Since
$$ Y_i\sim \operatorname{Poisson}(\lambda_i), $$we have
$$ P(Y_i=y_i) = \frac{e^{-\lambda_i}\lambda_i^{y_i}}{y_i!}. $$Assuming independence, the likelihood function is
$$ L(\beta) = \prod_{i=1}^n \frac{e^{-\lambda_i}\lambda_i^{y_i}}{y_i!}. $$Since
$$ \lambda_i=e^{x_i^T\beta}, $$we get
$$ L(\beta) = \prod_{i=1}^n \frac{ e^{-e^{x_i^T\beta}} \left(e^{x_i^T\beta}\right)^{y_i} }{y_i!}. $$Using
$$ \left(e^{x_i^T\beta}\right)^{y_i} = e^{y_i x_i^T\beta}, $$we obtain
$$ L(\beta) = \prod_{i=1}^n \frac{ e^{-e^{x_i^T\beta}} e^{y_i x_i^T\beta} }{y_i!}. $$Taking logarithms,
$$ \ell(\beta)=\log L(\beta). $$Therefore,
$$ \ell(\beta) = \sum_{i=1}^n \log \left( \frac{e^{-\lambda_i}\lambda_i^{y_i}}{y_i!} \right). $$So
$$ \ell(\beta) = \sum_{i=1}^n \left( -\lambda_i+y_i\log(\lambda_i)-\log(y_i!) \right). $$Since
$$ \lambda_i=e^{x_i^T\beta} $$and
$$ \log(\lambda_i)=x_i^T\beta, $$we get
$$ \ell(\beta) = \sum_{i=1}^n \left( -e^{x_i^T\beta} + y_i x_i^T\beta - \log(y_i!) \right). $$Thus,
$$ \boxed{ \ell(\beta) = \sum_{i=1}^n \left[ y_i x_i^T\beta - e^{x_i^T\beta} - \log(y_i!) \right]. } $$Since $\log(y_i!)$ does not depend on $\beta$, for optimization we may ignore it.
Thus we maximize
$$ \tilde{\ell}(\beta) = \sum_{i=1}^n \left[ y_i x_i^T\beta - e^{x_i^T\beta} \right]. $$The score function is the gradient of the log-likelihood:
$$ U(\beta)=\nabla_\beta \ell(\beta). $$We have
$$ \ell(\beta) = \sum_{i=1}^n \left[ y_i x_i^T\beta - e^{x_i^T\beta} - \log(y_i!) \right]. $$Differentiate term by term.
First,
$$ \nabla_\beta (y_i x_i^T\beta)=y_i x_i. $$Second,
$$ \nabla_\beta e^{x_i^T\beta} = e^{x_i^T\beta}x_i. $$Therefore,
$$ \nabla_\beta \left[ -y_i? \right] $$More precisely,
$$ \nabla_\beta \left[ y_i x_i^T\beta - e^{x_i^T\beta} \right] = y_i x_i - e^{x_i^T\beta}x_i. $$Hence,
$$ U(\beta) = \sum_{i=1}^n x_i \left( y_i-e^{x_i^T\beta} \right). $$Since
$$ \lambda_i=e^{x_i^T\beta}, $$we get
$$ \boxed{ U(\beta) = \sum_{i=1}^n x_i(y_i-\lambda_i). } $$In matrix form, let
$$ y= \begin{pmatrix} y_1\\ \vdots\\ y_n \end{pmatrix}, \qquad \lambda= \begin{pmatrix} \lambda_1\\ \vdots\\ \lambda_n \end{pmatrix}. $$Then
$$ \boxed{ U(\beta)=X^T(y-\lambda). } $$The maximum likelihood estimator satisfies
$$ X^T(y-\lambda)=0. $$That is,
$$ X^T y = X^T \lambda. $$But since
$$ \lambda_i=e^{x_i^T\beta}, $$this is a nonlinear system.
Therefore, there is generally no closed-form solution.
The Hessian matrix is
$$ H(\beta)=\nabla_\beta^2 \ell(\beta). $$From the score function,
$$ U(\beta)=\sum_{i=1}^n x_i(y_i-\lambda_i). $$Since $y_i$ is constant with respect to $\beta$,
$$ \nabla_\beta(y_i-\lambda_i) = -\nabla_\beta \lambda_i. $$But
$$ \lambda_i=e^{x_i^T\beta}, $$so
$$ \nabla_\beta \lambda_i = \lambda_i x_i. $$Therefore,
$$ \nabla_\beta \left[ x_i(y_i-\lambda_i) \right] = -\lambda_i x_i x_i^T. $$Hence,
$$ \boxed{ H(\beta) = -\sum_{i=1}^n \lambda_i x_i x_i^T. } $$In matrix form,
$$ \boxed{ H(\beta) = -X^T W X, } $$where
$$ W= \operatorname{diag}(\lambda_1,\dots,\lambda_n). $$Because $W$ is positive diagonal, $X^TWX$ is positive semidefinite.
Therefore,
$$ H(\beta) $$is negative semidefinite.
So the log-likelihood is concave.
This is good: any local maximum is a global maximum.
The Fisher information matrix is
$$ I(\beta) = -\mathbb{E}[H(\beta)]. $$For Poisson regression,
$$ H(\beta)=-X^T W X, $$where
$$ W=\operatorname{diag}(\lambda_1,\dots,\lambda_n). $$Since $\lambda_i$ is determined by $x_i$ and $\beta$, we get
$$ \boxed{ I(\beta)=X^T W X. } $$This matrix is used to approximate the variance of the MLE:
$$ \widehat{\operatorname{Var}}(\hat{\beta}) = (X^T\widehat{W}X)^{-1}, $$where
$$ \widehat{W} = \operatorname{diag}(\hat{\lambda}_1,\dots,\hat{\lambda}_n). $$Therefore,
$$ \operatorname{SE}(\hat{\beta}_j) = \sqrt{ \left[ (X^T\widehat{W}X)^{-1} \right]_{jj} }. $$To maximize $\ell(\beta)$, Newton's method uses
$$ \beta^{(new)} = \beta^{(old)} - H(\beta^{(old)})^{-1}U(\beta^{(old)}). $$For Poisson regression,
$$ U(\beta)=X^T(y-\lambda), $$and
$$ H(\beta)=-X^TWX. $$Therefore,
$$ \beta^{(new)} = \beta^{(old)} + (X^TWX)^{-1}X^T(y-\lambda). $$Thus,
$$ \boxed{ \beta^{(new)} = \beta^{(old)} + (X^TWX)^{-1}X^T(y-\lambda). } $$This is the Newton update for Poisson regression.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# -----------------------------
# 1. Generate example data
# -----------------------------
np.random.seed(42)
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype=float)
beta_true = np.array([0.5, 0.25])
X = np.column_stack([np.ones(len(x)), x])
lambda_true = np.exp(X @ beta_true)
y = np.random.poisson(lambda_true)
data = pd.DataFrame({
"x": x,
"y": y,
"true_lambda": lambda_true
})
data
| x | y | true_lambda | |
|---|---|---|---|
| 0 | 1.0 | 4 | 2.117000 |
| 1 | 2.0 | 1 | 2.718282 |
| 2 | 3.0 | 3 | 3.490343 |
| 3 | 4.0 | 4 | 4.481689 |
| 4 | 5.0 | 5 | 5.754603 |
| 5 | 6.0 | 7 | 7.389056 |
| 6 | 7.0 | 9 | 9.487736 |
| 7 | 8.0 | 7 | 12.182494 |
| 8 | 9.0 | 13 | 15.642632 |
| 9 | 10.0 | 17 | 20.085537 |
# -----------------------------
# 2. Newton method for Poisson regression
# -----------------------------
def poisson_log_likelihood(beta, X, y):
eta = X @ beta
lam = np.exp(eta)
return np.sum(y * eta - lam) # ignoring constant -log(y!)
def poisson_score(beta, X, y):
eta = X @ beta
lam = np.exp(eta)
return X.T @ (y - lam)
def poisson_hessian(beta, X, y):
eta = X @ beta
lam = np.exp(eta)
W = np.diag(lam)
return -X.T @ W @ X
def fit_poisson_newton(X, y, max_iter=100, tol=1e-8):
beta = np.zeros(X.shape[1])
history = []
for iteration in range(max_iter):
loglik = poisson_log_likelihood(beta, X, y)
score = poisson_score(beta, X, y)
hessian = poisson_hessian(beta, X, y)
# Newton update:
# beta_new = beta - H^{-1} score
step = np.linalg.solve(hessian, score)
beta_new = beta - step
history.append({
"iteration": iteration,
"log_likelihood": loglik,
"beta_0": beta[0],
"beta_1": beta[1],
"step_norm": np.linalg.norm(step)
})
if np.linalg.norm(beta_new - beta) < tol:
beta = beta_new
break
beta = beta_new
return beta, pd.DataFrame(history)
# -----------------------------
# 3. Fit the model
# -----------------------------
beta_hat, history = fit_poisson_newton(X, y)
print("Estimated beta:")
print(beta_hat)
print("\nTrue beta:")
print(beta_true)
history
Estimated beta: [0.52441211 0.22269885] True beta: [0.5 0.25]
| iteration | log_likelihood | beta_0 | beta_1 | step_norm | |
|---|---|---|---|---|---|
| 0 | 0 | -10.000000 | 0.000000 | 0.000000 | 2.412129e+00 |
| 1 | 1 | -347617.856995 | -1.933333 | 1.442424 | 9.878277e-01 |
| 2 | 2 | -127603.034150 | -2.921160 | 1.441189 | 9.669958e-01 |
| 3 | 3 | -46707.887194 | -3.888150 | 1.437839 | 9.109014e-01 |
| 4 | 4 | -16991.219226 | -4.799007 | 1.428791 | 7.623428e-01 |
| 5 | 5 | -6099.977203 | -5.560967 | 1.404628 | 3.884367e-01 |
| 6 | 6 | -2128.994247 | -5.944324 | 1.342018 | 4.964289e-01 |
| 7 | 7 | -692.458283 | -5.471004 | 1.192320 | 1.881206e+00 |
| 8 | 8 | -172.079028 | -3.612416 | 0.901478 | 2.460713e+00 |
| 9 | 9 | 14.217286 | -1.176526 | 0.552844 | 1.180433e+00 |
| 10 | 10 | 68.817154 | -0.013981 | 0.348122 | 4.406504e-01 |
| 11 | 11 | 78.411193 | 0.415875 | 0.251186 | 1.051626e-01 |
| 12 | 12 | 78.946529 | 0.517624 | 0.224609 | 7.016584e-03 |
| 13 | 13 | 78.949068 | 0.524378 | 0.222709 | 3.519188e-05 |
| 14 | 14 | 78.949068 | 0.524412 | 0.222699 | 9.317839e-10 |
# -----------------------------
# 4. Interpretation
# -----------------------------
beta_0_hat, beta_1_hat = beta_hat
multiplicative_effect = np.exp(beta_1_hat)
print(f"Estimated beta_0 = {beta_0_hat:.4f}")
print(f"Estimated beta_1 = {beta_1_hat:.4f}")
print(f"\nEach one-unit increase in x multiplies the expected count by:")
print(f"exp(beta_1) = {multiplicative_effect:.4f}")
print(f"\nThat is approximately a {(multiplicative_effect - 1) * 100:.2f}% increase.")
Estimated beta_0 = 0.5244 Estimated beta_1 = 0.2227 Each one-unit increase in x multiplies the expected count by: exp(beta_1) = 1.2494 That is approximately a 24.94% increase.
# -----------------------------
# 5. Predicted values
# -----------------------------
lambda_hat = np.exp(X @ beta_hat)
data["predicted_lambda"] = lambda_hat
data
| x | y | true_lambda | predicted_lambda | |
|---|---|---|---|---|
| 0 | 1.0 | 4 | 2.117000 | 2.110893 |
| 1 | 2.0 | 1 | 2.718282 | 2.637443 |
| 2 | 3.0 | 3 | 3.490343 | 3.295338 |
| 3 | 4.0 | 4 | 4.481689 | 4.117341 |
| 4 | 5.0 | 5 | 5.754603 | 5.144388 |
| 5 | 6.0 | 7 | 7.389056 | 6.427626 |
| 6 | 7.0 | 9 | 9.487736 | 8.030960 |
| 7 | 8.0 | 7 | 12.182494 | 10.034237 |
| 8 | 9.0 | 13 | 15.642632 | 12.537219 |
| 9 | 10.0 | 17 | 20.085537 | 15.664557 |
# -----------------------------
# 6. Final plot
# -----------------------------
x_grid = np.linspace(0, 11, 200)
X_grid = np.column_stack([np.ones(len(x_grid)), x_grid])
lambda_grid_hat = np.exp(X_grid @ beta_hat)
lambda_grid_true = np.exp(X_grid @ beta_true)
plt.figure(figsize=(9, 6))
plt.scatter(x, y, label="Observed counts")
plt.plot(x_grid, lambda_grid_hat, label="Fitted Poisson regression")
plt.plot(x_grid, lambda_grid_true, linestyle="--", label="True mean function")
plt.xlabel("x")
plt.ylabel("Expected count")
plt.title("Poisson Regression Fitted by Newton's Method")
plt.legend()
plt.grid(True)
plt.show()
In classical linear regression we assume
$$ Y_i = \beta_0 + \beta_1x_{i1} + \cdots + \beta_px_{ip} + \varepsilon_i, $$where
$$ \varepsilon_i \sim N(0,\sigma^2). $$Equivalently,
$$ Y_i \sim N(\mu_i,\sigma^2), $$with
$$ \mu_i = \beta_0 + \beta_1x_{i1} + \cdots + \beta_px_{ip}. $$This model works very well for continuous responses.
However, many real-world problems do not satisfy these assumptions.
Examples:
| Problem | Response Type |
|---|---|
| Spam detection | Binary |
| Number of website clicks | Counts |
| Disease occurrence | Binary |
| Number of accidents | Counts |
| Customer purchase | Binary |
Suppose we use linear regression for probabilities.
Example:
$$ \hat{Y} = -0.3 + 0.8x. $$For some values of $x$,
$$ \hat{Y} < 0 $$or
$$ \hat{Y} > 1. $$But probabilities must satisfy
$$ 0 \leq p \leq 1. $$In linear regression:
$$ \operatorname{Var}(Y_i)=\sigma^2. $$But for Bernoulli random variables:
$$ \operatorname{Var}(Y_i)=p_i(1-p_i). $$The variance depends on the mean.
Counts are not normally distributed.
For example:
$$ Y_i \sim \text{Poisson}(\lambda_i). $$Generalized Linear Models extend linear regression by allowing:
A GLM consists of three components.
The response variable follows a distribution from the exponential family.
Examples:
| Model | Distribution |
|---|---|
| Linear Regression | Normal |
| Logistic Regression | Bernoulli |
| Poisson Regression | Poisson |
We define the linear predictor
$$ \eta_i = \beta_0 + \beta_1x_{i1} + \cdots + \beta_px_{ip}. $$Matrix form:
$$ \eta = X\beta. $$where
$$ X= \begin{bmatrix} 1 & x_{11} & \cdots & x_{1p}\\ 1 & x_{21} & \cdots & x_{2p}\\ \vdots & \vdots & \ddots & \vdots\\ 1 & x_{n1} & \cdots & x_{np} \end{bmatrix}. $$We connect the mean response to the linear predictor using a function:
$$ g(\mu_i)=\eta_i. $$Equivalently,
$$ g(\mu_i)=x_i^T\beta. $$The function $g$ is called the link function.
Suppose
$$ Y_i \sim N(\mu_i,\sigma^2). $$We choose the identity link:
$$ g(\mu)=\mu. $$Then
$$ \mu_i=\eta_i=x_i^T\beta. $$Therefore,
$$ Y_i = x_i^T\beta + \varepsilon_i. $$Thus ordinary least squares regression is a GLM.
Many GLMs come from the exponential family.
A distribution belongs to the exponential family if it can be written as
$$ f(y;\theta) = h(y) \exp \left( \eta(\theta)T(y)-A(\theta) \right). $$Examples:
This common structure allows a unified theory.
Each exponential-family distribution has a natural link function.
| Distribution | Canonical Link |
|---|---|
| Normal | Identity |
| Bernoulli | Logit |
| Poisson | Log |
| Gamma | Inverse |
Canonical links simplify optimization and theory.
A GLM consists of:
Key examples:
| Model | Distribution | Link |
|---|---|---|
| Linear Regression | Normal | Identity |
| Logistic Regression | Bernoulli | Logit |
| Poisson Regression | Poisson | Log |
GLMs unify many classical statistical methods into a single framework.
Normality tests often detect deviations because:
Examples:
| Problem | Effect |
|---|---|
| Long right tail | Positive skewness |
| Heavy tails | High kurtosis |
| Outliers | High kurtosis |
| Asymmetric distribution | Nonzero skewness |
QQ-plots visually reveal:
Produces S-shaped asymmetry.
Extreme points deviate strongly at ends.
Many classical methods assume:
Heavy skewness or kurtosis can affect:
Small deviations from:
do NOT necessarily invalidate statistical procedures.
For large samples:
What matters is:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import skew, kurtosis, norm
np.random.seed(42)
# Generate normal data
x_normal = np.random.normal(size=1000)
# Generate skewed data
x_skewed = np.random.exponential(size=1000)
# Generate heavy-tailed data
x_heavy = np.random.standard_t(df=3, size=1000)
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
axes[0].hist(x_normal, bins=30, density=True)
axes[0].set_title("Approximately Normal")
axes[1].hist(x_skewed, bins=30, density=True)
axes[1].set_title("Right-Skewed")
axes[2].hist(x_heavy, bins=30, density=True)
axes[2].set_title("Heavy-Tailed")
plt.show()
datasets = {
"Normal": x_normal,
"Skewed": x_skewed,
"Heavy-tailed": x_heavy
}
for name, data in datasets.items():
s = skew(data)
k = kurtosis(data) # excess kurtosis
print(f"{name}:")
print(f" Skewness = {s:.4f}")
print(f" Excess Kurtosis = {k:.4f}")
print()
import scipy.stats as stats
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
stats.probplot(x_normal, dist="norm", plot=axes[0])
axes[0].set_title("Normal QQ-Plot")
stats.probplot(x_skewed, dist="norm", plot=axes[1])
axes[1].set_title("Skewed QQ-Plot")
stats.probplot(x_heavy, dist="norm", plot=axes[2])
axes[2].set_title("Heavy-Tailed QQ-Plot")
plt.show()
Skewness measures asymmetry:
$$ \gamma_1 = \frac{ \mathbb{E}[(X-\mu)^3] }{ \sigma^3 }. $$Kurtosis measures tail heaviness:
$$ \gamma_2 = \frac{ \mathbb{E}[(X-\mu)^4] }{ \sigma^4 }. $$For the normal distribution:
$$ \text{Skewness}=0, $$and
$$ \text{Excess Kurtosis}=0. $$Therefore:
These quantities are fundamental for: