Seminar 13

In [ ]:
 

Poisson Regression¶

1. Motivation¶

Poisson regression is used when the response variable is a count:

  • number of clicks,
  • number of accidents,
  • number of emails,
  • number of customers,
  • number of calls,
  • number of failures,
  • number of events in a fixed time interval.

The response variable satisfies

$$ Y_i \in \{0,1,2,\dots\}. $$

Linear regression is not appropriate because it may predict negative values.

Poisson regression solves this by modeling the mean as a positive quantity.


2. Poisson Distribution¶

A random variable $Y$ has Poisson distribution with parameter $\lambda>0$ if

$$ P(Y=y)=\frac{e^{-\lambda}\lambda^y}{y!}, \qquad y=0,1,2,\dots $$

We write

$$ Y\sim \operatorname{Poisson}(\lambda). $$

The mean and variance are

$$ \mathbb{E}[Y]=\lambda, $$

and

$$ \operatorname{Var}(Y)=\lambda. $$

Thus, for a Poisson random variable,

$$ \mathbb{E}[Y]=\operatorname{Var}(Y). $$

This is an important property.


3. The Poisson Regression Model¶

Suppose we observe independent data

$$ (y_i,x_i), \qquad i=1,\dots,n, $$

where

$$ x_i= \begin{pmatrix} 1\\ x_{i1}\\ \vdots\\ x_{ip} \end{pmatrix} \in \mathbb{R}^{p+1}. $$

We assume

$$ Y_i\sim \operatorname{Poisson}(\lambda_i), $$

where the mean depends on the predictors.

The problem is that we need

$$ \lambda_i>0. $$

Therefore, instead of modeling $\lambda_i$ directly as a linear function, we model its logarithm:

$$ \log(\lambda_i)=x_i^T\beta. $$

Equivalently,

$$ \lambda_i=e^{x_i^T\beta}. $$

This automatically guarantees

$$ \lambda_i>0. $$

4. Why Use the Log Link?¶

If we used

$$ \lambda_i=x_i^T\beta, $$

then the model could produce negative predicted counts.

That is impossible because

$$ \lambda_i=\mathbb{E}[Y_i]>0. $$

The log link solves this:

$$ \log(\lambda_i)=x_i^T\beta. $$

Since the exponential function is always positive,

$$ \lambda_i=e^{x_i^T\beta}>0. $$

Thus Poisson regression is a generalized linear model with:

Component Poisson Regression
Response distribution Poisson
Mean $\lambda_i$
Link function $\log(\lambda_i)$
Linear predictor $x_i^T\beta$

5. Interpretation of the Coefficients¶

Consider the model

$$ \log(\lambda_i)=\beta_0+\beta_1x_i. $$

Then

$$ \lambda_i=e^{\beta_0+\beta_1x_i}. $$

Now increase $x_i$ by one unit. The new mean is

$$ \lambda(x_i+1)=e^{\beta_0+\beta_1(x_i+1)}. $$

Therefore,

$$ \lambda(x_i+1) = e^{\beta_0+\beta_1x_i+\beta_1} = e^{\beta_1}e^{\beta_0+\beta_1x_i}. $$

So

$$ \lambda(x_i+1)=e^{\beta_1}\lambda(x_i). $$

Hence,

$$ \frac{\lambda(x_i+1)}{\lambda(x_i)}=e^{\beta_1}. $$

Therefore, a one-unit increase in $x_i$ multiplies the expected count by

$$ e^{\beta_1}. $$

So:

  • if $\beta_1>0$, the expected count increases;
  • if $\beta_1<0$, the expected count decreases;
  • if $\beta_1=0$, the predictor has no effect.

For example, if

$$ \beta_1=0.2, $$

then

$$ e^{0.2}\approx 1.221. $$

So a one-unit increase in $x$ increases the expected count by approximately $22.1\%$.


6. Likelihood Function¶

Since

$$ Y_i\sim \operatorname{Poisson}(\lambda_i), $$

we have

$$ P(Y_i=y_i) = \frac{e^{-\lambda_i}\lambda_i^{y_i}}{y_i!}. $$

Assuming independence, the likelihood function is

$$ L(\beta) = \prod_{i=1}^n \frac{e^{-\lambda_i}\lambda_i^{y_i}}{y_i!}. $$

Since

$$ \lambda_i=e^{x_i^T\beta}, $$

we get

$$ L(\beta) = \prod_{i=1}^n \frac{ e^{-e^{x_i^T\beta}} \left(e^{x_i^T\beta}\right)^{y_i} }{y_i!}. $$

Using

$$ \left(e^{x_i^T\beta}\right)^{y_i} = e^{y_i x_i^T\beta}, $$

we obtain

$$ L(\beta) = \prod_{i=1}^n \frac{ e^{-e^{x_i^T\beta}} e^{y_i x_i^T\beta} }{y_i!}. $$

7. Log-Likelihood Function¶

Taking logarithms,

$$ \ell(\beta)=\log L(\beta). $$

Therefore,

$$ \ell(\beta) = \sum_{i=1}^n \log \left( \frac{e^{-\lambda_i}\lambda_i^{y_i}}{y_i!} \right). $$

So

$$ \ell(\beta) = \sum_{i=1}^n \left( -\lambda_i+y_i\log(\lambda_i)-\log(y_i!) \right). $$

Since

$$ \lambda_i=e^{x_i^T\beta} $$

and

$$ \log(\lambda_i)=x_i^T\beta, $$

we get

$$ \ell(\beta) = \sum_{i=1}^n \left( -e^{x_i^T\beta} + y_i x_i^T\beta - \log(y_i!) \right). $$

Thus,

$$ \boxed{ \ell(\beta) = \sum_{i=1}^n \left[ y_i x_i^T\beta - e^{x_i^T\beta} - \log(y_i!) \right]. } $$

Since $\log(y_i!)$ does not depend on $\beta$, for optimization we may ignore it.

Thus we maximize

$$ \tilde{\ell}(\beta) = \sum_{i=1}^n \left[ y_i x_i^T\beta - e^{x_i^T\beta} \right]. $$

8. Score Function¶

The score function is the gradient of the log-likelihood:

$$ U(\beta)=\nabla_\beta \ell(\beta). $$

We have

$$ \ell(\beta) = \sum_{i=1}^n \left[ y_i x_i^T\beta - e^{x_i^T\beta} - \log(y_i!) \right]. $$

Differentiate term by term.

First,

$$ \nabla_\beta (y_i x_i^T\beta)=y_i x_i. $$

Second,

$$ \nabla_\beta e^{x_i^T\beta} = e^{x_i^T\beta}x_i. $$

Therefore,

$$ \nabla_\beta \left[ -y_i? \right] $$

More precisely,

$$ \nabla_\beta \left[ y_i x_i^T\beta - e^{x_i^T\beta} \right] = y_i x_i - e^{x_i^T\beta}x_i. $$

Hence,

$$ U(\beta) = \sum_{i=1}^n x_i \left( y_i-e^{x_i^T\beta} \right). $$

Since

$$ \lambda_i=e^{x_i^T\beta}, $$

we get

$$ \boxed{ U(\beta) = \sum_{i=1}^n x_i(y_i-\lambda_i). } $$

In matrix form, let

$$ y= \begin{pmatrix} y_1\\ \vdots\\ y_n \end{pmatrix}, \qquad \lambda= \begin{pmatrix} \lambda_1\\ \vdots\\ \lambda_n \end{pmatrix}. $$

Then

$$ \boxed{ U(\beta)=X^T(y-\lambda). } $$

The maximum likelihood estimator satisfies

$$ X^T(y-\lambda)=0. $$

That is,

$$ X^T y = X^T \lambda. $$

But since

$$ \lambda_i=e^{x_i^T\beta}, $$

this is a nonlinear system.

Therefore, there is generally no closed-form solution.


9. Hessian Matrix¶

The Hessian matrix is

$$ H(\beta)=\nabla_\beta^2 \ell(\beta). $$

From the score function,

$$ U(\beta)=\sum_{i=1}^n x_i(y_i-\lambda_i). $$

Since $y_i$ is constant with respect to $\beta$,

$$ \nabla_\beta(y_i-\lambda_i) = -\nabla_\beta \lambda_i. $$

But

$$ \lambda_i=e^{x_i^T\beta}, $$

so

$$ \nabla_\beta \lambda_i = \lambda_i x_i. $$

Therefore,

$$ \nabla_\beta \left[ x_i(y_i-\lambda_i) \right] = -\lambda_i x_i x_i^T. $$

Hence,

$$ \boxed{ H(\beta) = -\sum_{i=1}^n \lambda_i x_i x_i^T. } $$

In matrix form,

$$ \boxed{ H(\beta) = -X^T W X, } $$

where

$$ W= \operatorname{diag}(\lambda_1,\dots,\lambda_n). $$

Because $W$ is positive diagonal, $X^TWX$ is positive semidefinite.

Therefore,

$$ H(\beta) $$

is negative semidefinite.

So the log-likelihood is concave.

This is good: any local maximum is a global maximum.


10. Fisher Information¶

The Fisher information matrix is

$$ I(\beta) = -\mathbb{E}[H(\beta)]. $$

For Poisson regression,

$$ H(\beta)=-X^T W X, $$

where

$$ W=\operatorname{diag}(\lambda_1,\dots,\lambda_n). $$

Since $\lambda_i$ is determined by $x_i$ and $\beta$, we get

$$ \boxed{ I(\beta)=X^T W X. } $$

This matrix is used to approximate the variance of the MLE:

$$ \widehat{\operatorname{Var}}(\hat{\beta}) = (X^T\widehat{W}X)^{-1}, $$

where

$$ \widehat{W} = \operatorname{diag}(\hat{\lambda}_1,\dots,\hat{\lambda}_n). $$

Therefore,

$$ \operatorname{SE}(\hat{\beta}_j) = \sqrt{ \left[ (X^T\widehat{W}X)^{-1} \right]_{jj} }. $$

11. Newton-Raphson Method¶

To maximize $\ell(\beta)$, Newton's method uses

$$ \beta^{(new)} = \beta^{(old)} - H(\beta^{(old)})^{-1}U(\beta^{(old)}). $$

For Poisson regression,

$$ U(\beta)=X^T(y-\lambda), $$

and

$$ H(\beta)=-X^TWX. $$

Therefore,

$$ \beta^{(new)} = \beta^{(old)} + (X^TWX)^{-1}X^T(y-\lambda). $$

Thus,

$$ \boxed{ \beta^{(new)} = \beta^{(old)} + (X^TWX)^{-1}X^T(y-\lambda). } $$

This is the Newton update for Poisson regression.


In [ ]:
 
In [6]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# -----------------------------
# 1. Generate example data
# -----------------------------

np.random.seed(42)

x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype=float)

beta_true = np.array([0.5, 0.25])

X = np.column_stack([np.ones(len(x)), x])

lambda_true = np.exp(X @ beta_true)

y = np.random.poisson(lambda_true)

data = pd.DataFrame({
    "x": x,
    "y": y,
    "true_lambda": lambda_true
})

data
Out[6]:
x y true_lambda
0 1.0 4 2.117000
1 2.0 1 2.718282
2 3.0 3 3.490343
3 4.0 4 4.481689
4 5.0 5 5.754603
5 6.0 7 7.389056
6 7.0 9 9.487736
7 8.0 7 12.182494
8 9.0 13 15.642632
9 10.0 17 20.085537
In [7]:
# -----------------------------
# 2. Newton method for Poisson regression
# -----------------------------

def poisson_log_likelihood(beta, X, y):
    eta = X @ beta
    lam = np.exp(eta)
    return np.sum(y * eta - lam)   # ignoring constant -log(y!)

def poisson_score(beta, X, y):
    eta = X @ beta
    lam = np.exp(eta)
    return X.T @ (y - lam)

def poisson_hessian(beta, X, y):
    eta = X @ beta
    lam = np.exp(eta)
    W = np.diag(lam)
    return -X.T @ W @ X

def fit_poisson_newton(X, y, max_iter=100, tol=1e-8):
    beta = np.zeros(X.shape[1])
    history = []

    for iteration in range(max_iter):
        loglik = poisson_log_likelihood(beta, X, y)
        score = poisson_score(beta, X, y)
        hessian = poisson_hessian(beta, X, y)

        # Newton update:
        # beta_new = beta - H^{-1} score
        step = np.linalg.solve(hessian, score)
        beta_new = beta - step

        history.append({
            "iteration": iteration,
            "log_likelihood": loglik,
            "beta_0": beta[0],
            "beta_1": beta[1],
            "step_norm": np.linalg.norm(step)
        })

        if np.linalg.norm(beta_new - beta) < tol:
            beta = beta_new
            break

        beta = beta_new

    return beta, pd.DataFrame(history)
In [8]:
# -----------------------------
# 3. Fit the model
# -----------------------------

beta_hat, history = fit_poisson_newton(X, y)

print("Estimated beta:")
print(beta_hat)

print("\nTrue beta:")
print(beta_true)

history
Estimated beta:
[0.52441211 0.22269885]

True beta:
[0.5  0.25]
Out[8]:
iteration log_likelihood beta_0 beta_1 step_norm
0 0 -10.000000 0.000000 0.000000 2.412129e+00
1 1 -347617.856995 -1.933333 1.442424 9.878277e-01
2 2 -127603.034150 -2.921160 1.441189 9.669958e-01
3 3 -46707.887194 -3.888150 1.437839 9.109014e-01
4 4 -16991.219226 -4.799007 1.428791 7.623428e-01
5 5 -6099.977203 -5.560967 1.404628 3.884367e-01
6 6 -2128.994247 -5.944324 1.342018 4.964289e-01
7 7 -692.458283 -5.471004 1.192320 1.881206e+00
8 8 -172.079028 -3.612416 0.901478 2.460713e+00
9 9 14.217286 -1.176526 0.552844 1.180433e+00
10 10 68.817154 -0.013981 0.348122 4.406504e-01
11 11 78.411193 0.415875 0.251186 1.051626e-01
12 12 78.946529 0.517624 0.224609 7.016584e-03
13 13 78.949068 0.524378 0.222709 3.519188e-05
14 14 78.949068 0.524412 0.222699 9.317839e-10
In [9]:
# -----------------------------
# 4. Interpretation
# -----------------------------

beta_0_hat, beta_1_hat = beta_hat

multiplicative_effect = np.exp(beta_1_hat)

print(f"Estimated beta_0 = {beta_0_hat:.4f}")
print(f"Estimated beta_1 = {beta_1_hat:.4f}")

print(f"\nEach one-unit increase in x multiplies the expected count by:")
print(f"exp(beta_1) = {multiplicative_effect:.4f}")

print(f"\nThat is approximately a {(multiplicative_effect - 1) * 100:.2f}% increase.")
Estimated beta_0 = 0.5244
Estimated beta_1 = 0.2227

Each one-unit increase in x multiplies the expected count by:
exp(beta_1) = 1.2494

That is approximately a 24.94% increase.
In [10]:
# -----------------------------
# 5. Predicted values
# -----------------------------

lambda_hat = np.exp(X @ beta_hat)

data["predicted_lambda"] = lambda_hat

data
Out[10]:
x y true_lambda predicted_lambda
0 1.0 4 2.117000 2.110893
1 2.0 1 2.718282 2.637443
2 3.0 3 3.490343 3.295338
3 4.0 4 4.481689 4.117341
4 5.0 5 5.754603 5.144388
5 6.0 7 7.389056 6.427626
6 7.0 9 9.487736 8.030960
7 8.0 7 12.182494 10.034237
8 9.0 13 15.642632 12.537219
9 10.0 17 20.085537 15.664557
In [11]:
# -----------------------------
# 6. Final plot
# -----------------------------

x_grid = np.linspace(0, 11, 200)

X_grid = np.column_stack([np.ones(len(x_grid)), x_grid])

lambda_grid_hat = np.exp(X_grid @ beta_hat)
lambda_grid_true = np.exp(X_grid @ beta_true)

plt.figure(figsize=(9, 6))

plt.scatter(x, y, label="Observed counts")

plt.plot(x_grid, lambda_grid_hat, label="Fitted Poisson regression")

plt.plot(x_grid, lambda_grid_true, linestyle="--", label="True mean function")

plt.xlabel("x")
plt.ylabel("Expected count")
plt.title("Poisson Regression Fitted by Newton's Method")
plt.legend()
plt.grid(True)
plt.show()
In [ ]:
 
In [ ]:
 
In [ ]:
 

Generalized Linear Models (GLMs)¶

1. Motivation¶

In classical linear regression we assume

$$ Y_i = \beta_0 + \beta_1x_{i1} + \cdots + \beta_px_{ip} + \varepsilon_i, $$

where

$$ \varepsilon_i \sim N(0,\sigma^2). $$

Equivalently,

$$ Y_i \sim N(\mu_i,\sigma^2), $$

with

$$ \mu_i = \beta_0 + \beta_1x_{i1} + \cdots + \beta_px_{ip}. $$

This model works very well for continuous responses.

However, many real-world problems do not satisfy these assumptions.

Examples:

Problem Response Type
Spam detection Binary
Number of website clicks Counts
Disease occurrence Binary
Number of accidents Counts
Customer purchase Binary

Problem 1: Predictions may be impossible¶

Suppose we use linear regression for probabilities.

Example:

$$ \hat{Y} = -0.3 + 0.8x. $$

For some values of $x$,

$$ \hat{Y} < 0 $$

or

$$ \hat{Y} > 1. $$

But probabilities must satisfy

$$ 0 \leq p \leq 1. $$

Problem 2: Variance is often not constant¶

In linear regression:

$$ \operatorname{Var}(Y_i)=\sigma^2. $$

But for Bernoulli random variables:

$$ \operatorname{Var}(Y_i)=p_i(1-p_i). $$

The variance depends on the mean.


Problem 3: Non-normal responses¶

Counts are not normally distributed.

For example:

$$ Y_i \sim \text{Poisson}(\lambda_i). $$

2. The Idea of GLMs¶

Generalized Linear Models extend linear regression by allowing:

  1. Different response distributions,
  2. Different relationships between the mean and predictors.

3. Structure of a GLM¶

A GLM consists of three components.


3.1 Random Component¶

The response variable follows a distribution from the exponential family.

Examples:

Model Distribution
Linear Regression Normal
Logistic Regression Bernoulli
Poisson Regression Poisson

3.2 Systematic Component¶

We define the linear predictor

$$ \eta_i = \beta_0 + \beta_1x_{i1} + \cdots + \beta_px_{ip}. $$

Matrix form:

$$ \eta = X\beta. $$

where

$$ X= \begin{bmatrix} 1 & x_{11} & \cdots & x_{1p}\\ 1 & x_{21} & \cdots & x_{2p}\\ \vdots & \vdots & \ddots & \vdots\\ 1 & x_{n1} & \cdots & x_{np} \end{bmatrix}. $$

3.3 Link Function¶

We connect the mean response to the linear predictor using a function:

$$ g(\mu_i)=\eta_i. $$

Equivalently,

$$ g(\mu_i)=x_i^T\beta. $$

The function $g$ is called the link function.


4. Linear Regression as a GLM¶

Suppose

$$ Y_i \sim N(\mu_i,\sigma^2). $$

We choose the identity link:

$$ g(\mu)=\mu. $$

Then

$$ \mu_i=\eta_i=x_i^T\beta. $$

Therefore,

$$ Y_i = x_i^T\beta + \varepsilon_i. $$

Thus ordinary least squares regression is a GLM.

8. Exponential Family¶

Many GLMs come from the exponential family.

A distribution belongs to the exponential family if it can be written as

$$ f(y;\theta) = h(y) \exp \left( \eta(\theta)T(y)-A(\theta) \right). $$

Examples:

  • Normal,
  • Bernoulli,
  • Binomial,
  • Poisson,
  • Gamma.

This common structure allows a unified theory.


9. Canonical Link Functions¶

Each exponential-family distribution has a natural link function.

Distribution Canonical Link
Normal Identity
Bernoulli Logit
Poisson Log
Gamma Inverse

Canonical links simplify optimization and theory.


10. Summary¶

A GLM consists of:

  1. A distribution from the exponential family,
  2. A linear predictor,
  3. A link function.

Key examples:

Model Distribution Link
Linear Regression Normal Identity
Logistic Regression Bernoulli Logit
Poisson Regression Poisson Log

GLMs unify many classical statistical methods into a single framework.

In [ ]:
 
In [ ]:
 

6.¶

7. Why Skewness and Kurtosis Matter¶

Normality tests often detect deviations because:

  • skewness differs from zero,
  • kurtosis differs from three.

Examples:

Problem Effect
Long right tail Positive skewness
Heavy tails High kurtosis
Outliers High kurtosis
Asymmetric distribution Nonzero skewness

8. Connection with QQ-Plots¶

QQ-plots visually reveal:

  • skewness,
  • kurtosis.

Skewness in QQ-plot¶

Produces S-shaped asymmetry.


Heavy tails¶

Extreme points deviate strongly at ends.


9. Why Normality Matters¶

Many classical methods assume:

  • approximately normal residuals,
  • finite variance,
  • moderate tails.

Heavy skewness or kurtosis can affect:

  • confidence intervals,
  • hypothesis tests,
  • p-values,
  • regression inference.

10. Important Practical Advice¶

Small deviations from:

  • zero skewness,
  • zero excess kurtosis,

do NOT necessarily invalidate statistical procedures.

For large samples:

  • CLT often helps,
  • many methods remain robust.

What matters is:

  • severe asymmetry,
  • extreme outliers,
  • very heavy tails.

11. Python Example¶

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import skew, kurtosis, norm

np.random.seed(42)

# Generate normal data
x_normal = np.random.normal(size=1000)

# Generate skewed data
x_skewed = np.random.exponential(size=1000)

# Generate heavy-tailed data
x_heavy = np.random.standard_t(df=3, size=1000)

12. Plotting the Distributions¶

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

axes[0].hist(x_normal, bins=30, density=True)
axes[0].set_title("Approximately Normal")

axes[1].hist(x_skewed, bins=30, density=True)
axes[1].set_title("Right-Skewed")

axes[2].hist(x_heavy, bins=30, density=True)
axes[2].set_title("Heavy-Tailed")

plt.show()

13. Computing Skewness and Kurtosis¶

datasets = {
    "Normal": x_normal,
    "Skewed": x_skewed,
    "Heavy-tailed": x_heavy
}

for name, data in datasets.items():
    s = skew(data)
    k = kurtosis(data)   # excess kurtosis

    print(f"{name}:")
    print(f"  Skewness = {s:.4f}")
    print(f"  Excess Kurtosis = {k:.4f}")
    print()

14. QQ-Plots¶

import scipy.stats as stats

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

stats.probplot(x_normal, dist="norm", plot=axes[0])
axes[0].set_title("Normal QQ-Plot")

stats.probplot(x_skewed, dist="norm", plot=axes[1])
axes[1].set_title("Skewed QQ-Plot")

stats.probplot(x_heavy, dist="norm", plot=axes[2])
axes[2].set_title("Heavy-Tailed QQ-Plot")

plt.show()

15. Interpretation of the Plots¶


Normal data¶

  • skewness near 0,
  • excess kurtosis near 0,
  • QQ-plot approximately linear.

Skewed data¶

  • large positive skewness,
  • asymmetric QQ-plot.

Heavy-tailed data¶

  • large positive kurtosis,
  • QQ-plot deviates strongly at tails.

16. Summary¶

Skewness measures asymmetry:

$$ \gamma_1 = \frac{ \mathbb{E}[(X-\mu)^3] }{ \sigma^3 }. $$

Kurtosis measures tail heaviness:

$$ \gamma_2 = \frac{ \mathbb{E}[(X-\mu)^4] }{ \sigma^4 }. $$

For the normal distribution:

$$ \text{Skewness}=0, $$

and

$$ \text{Excess Kurtosis}=0. $$

Therefore:

  • skewness detects asymmetry,
  • kurtosis detects heavy tails and outliers.

These quantities are fundamental for:

  • checking normality,
  • understanding distributions,
  • diagnosing regression residuals,
  • validating statistical assumptions.
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]: