Seminar 2

Statistical Hypothesis Testing and Types of Errors¶

1. Motivation¶

In data science, we constantly face questions of the form:

  • Is this effect real or just noise?
  • Is a model improvement statistically significant?
  • Does a new algorithm outperform the baseline?
  • Is a parameter equal to some reference value?

Statistical hypothesis testing provides a principled framework for answering such questions under uncertainty.


2. Statistical Model and Data¶

Let
$$ X = (X_1, \dots, X_n) $$ be observed data, modeled as a random sample from a distribution $$ X_i \sim P_\theta, \quad \theta \in \Theta, $$ where:

  • $\theta$ is an unknown parameter,
  • $\Theta$ is the parameter space.

Examples:

  • $X_i \sim \mathcal{N}(\mu, \sigma^2)$, $\theta = \mu$
  • $X_i \sim \text{Bernoulli}(p)$, $\theta = p$
  • Regression model $Y = X\beta + \varepsilon$

3. Hypotheses¶

3.1 Null and Alternative Hypotheses¶

A hypothesis is a statement about the parameter $\theta$.

  • Null hypothesis $H_0$: baseline or default assumption
  • Alternative hypothesis $H_1$: competing claim

Formally: $$ H_0: \theta \in \Theta_0, \quad H_1: \theta \in \Theta_1, $$ where $$ \Theta_0 \cap \Theta_1 = \varnothing, \quad \Theta_0 \cup \Theta_1 \subseteq \Theta. $$


3.2 Types of Alternatives¶

  • Two-sided $$ H_0: \theta = \theta_0, \quad H_1: \theta \neq \theta_0 $$

  • One-sided $$ H_0: \theta \le \theta_0, \quad H_1: \theta > \theta_0 $$

Choice of alternative must be made before seeing the data.


4. Test as a Decision Rule¶

A statistical test is a decision rule $$ \varphi(X) = \begin{cases} 1 & \text{reject } H_0 \\ 0 & \text{do not reject } H_0 \end{cases} $$

Equivalently, define a rejection region $\mathcal{R}$: $$ \varphi(X) = 1 \iff X \in \mathcal{R}. $$


5. Types of Errors¶

5.1 Type I and Type II Errors¶

Decision \ Truth $H_0$ true $H_1$ true
Reject $H_0$ Type I error Correct
Do not reject $H_0$ Correct Type II error
  • Type I error: rejecting a true null hypothesis
  • Type II error: failing to reject a false null hypothesis

5.2 Error Probabilities¶

Type I Error Probability (Significance Level)¶

$$ \alpha = \mathbb{P}_{\theta \in \Theta_0}(\text{reject } H_0) $$
  • Also called size or significance level
  • Fixed in advance (e.g. $\alpha = 0.05$)

Formally: $$ \sup_{\theta \in \Theta_0} \mathbb{P}_\theta(X \in \mathcal{R}) \le \alpha $$


Type II Error Probability¶

$$ \beta(\theta) = \mathbb{P}_\theta(\text{do not reject } H_0), \quad \theta \in \Theta_1 $$

5.3 Power Function¶

The power of a test is $$ \pi(\theta) = \mathbb{P}_\theta(\text{reject } H_0) = 1 - \beta(\theta), \quad \theta \in \Theta_1 $$


6. Test Statistic¶

A test statistic is a function $$ T = T(X) $$ summarizing evidence against $H_0$.

Examples:

  • Sample mean
  • $t$-statistic
  • Likelihood ratio
  • Wald statistic
  • Score statistic

Decision rule: $$ \text{Reject } H_0 \iff T \in \mathcal{C} $$


7. Distribution Under the Null¶

Key idea:

The distribution of the test statistic under $H_0$ is known or approximable.

Let $$ F_0(t) = \mathbb{P}_{H_0}(T \le t) $$

Critical value $c_\alpha$ satisfies $$ \mathbb{P}_{H_0}(T \ge c_\alpha) = \alpha $$


8. p-value¶

8.1 Definition¶

The p-value is $$ p = \mathbb{P}_{H_0}(T \ge T_{\text{obs}}) $$


8.2 Decision Rule¶

$$ \text{Reject } H_0 \iff p \le \alpha $$

One-Sample Test for Proportions

1. Problem Setting and Motivation¶

The one-sample test for proportions is used when we want to test a claim about a single population proportion.

Typical data science questions:

  • Is the click-through rate equal to 5%?
  • Is the defect rate below 1%?
  • Has the conversion rate changed after a product update?
  • Is the fraction of positive labels larger than a benchmark?

2. Statistical Model¶

Let
$$ X_1, \dots, X_n \sim \text{Bernoulli}(p), $$ where:

  • $X_i = 1$ indicates “success”
  • $X_i = 0$ indicates “failure”
  • $p \in (0,1)$ is the unknown population proportion

The total number of successes: $$ S = \sum_{i=1}^n X_i \sim \text{Binomial}(n, p) $$

The sample proportion: $$ \hat{p} = \frac{S}{n} $$


3. Hypotheses¶

3.1 Null and Alternative Hypotheses¶

We test: $$ H_0: p = p_0 $$

Against one of the following alternatives:

  • Two-sided $$ H_1: p \neq p_0 $$

  • Right-tailed $$ H_1: p > p_0 $$

  • Left-tailed $$ H_1: p < p_0 $$

The alternative must be chosen before seeing the data.


4. Distribution of the Sample Proportion¶

We have: $$ \mathbb{E}[\hat{p}] = p, \qquad \mathrm{Var}(\hat{p}) = \frac{p(1-p)}{n} $$

Under $H_0$: $$ \mathbb{E}[\hat{p}] = p_0, \qquad \mathrm{Var}(\hat{p}) = \frac{p_0(1-p_0)}{n} $$


5. Normal Approximation (CLT)¶

By the Central Limit Theorem: $$ \frac{\hat{p} - p}{\sqrt{p(1-p)/n}} \;\xrightarrow{d}\; \mathcal{N}(0,1) $$

Under $H_0$: $$ Z = \frac{\hat{p} - p_0} {\sqrt{p_0(1-p_0)/n}} \;\approx\; \mathcal{N}(0,1) $$


6. Validity Conditions¶

The normal approximation is valid when: $$ np_0 \ge 5 \quad \text{and} \quad n(1-p_0) \ge 5 $$

(Thresholds like 5 or 10 are common rules of thumb.)

If these conditions fail:

  • Use an exact binomial test
  • Or a simulation-based test

7. Test Statistic¶

The one-sample z-statistic for proportions is: $$ Z = \frac{\hat{p} - p_0} {\sqrt{p_0(1-p_0)/n}} $$

This statistic measures how many standard deviations $\hat{p}$ is away from $p_0$ under $H_0$.


8. Rejection Regions¶

Let $\alpha$ be the significance level.

Two-Sided Test¶

Reject $H_0$ if: $$ |Z| \ge z_{1-\alpha/2} $$

Right-Tailed Test¶

Reject $H_0$ if: $$ Z \ge z_{1-\alpha} $$

Left-Tailed Test¶

Reject $H_0$ if: $$ Z \le -z_{1-\alpha} $$


9. p-value¶

Definition¶

  • Two-sided $$ p\text{-value} = 2\bigl(1 - \Phi(|Z_{\text{obs}}|)\bigr) $$

  • Right-tailed $$ p\text{-value} = 1 - \Phi(Z_{\text{obs}}) $$

  • Left-tailed $$ p\text{-value} = \Phi(Z_{\text{obs}}) $$

where $\Phi$ is the CDF of $\mathcal{N}(0,1)$.


10. Decision Rule¶

$$ \text{Reject } H_0 \iff p\text{-value} \le \alpha $$

13. Exact Binomial Test (Brief)¶

When $n$ is small or $p_0$ is extreme:

$$ S \sim \text{Binomial}(n, p_0) $$

p-values are computed exactly using the binomial distribution rather than a normal approximation.


14. Confidence Interval Connection¶

A $(1-\alpha)$ confidence interval for $p$: $$ \hat{p} \pm z_{1-\alpha/2} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} $$

Relationship:

$p_0$ is rejected by the two-sided test at level $\alpha$
iff $p_0$ is outside the $(1-\alpha)$ confidence interval.

In [12]:
import math
from scipy.stats import norm


def one_sample_proportion_test(
    data=None,
    p_hat=None,
    n=None,
    p0=0.5,
    alternative="two-sided",
    alpha=0.05,
    success_values=("1", "success", "yes", "true")
):
    """
    One-sample Z-test for proportions.

    Parameters
    ----------
    data : str, optional
        Raw data as a string (e.g. "10101", "1 0 1 1 0", "success failure success").
    p_hat : float, optional
        Sample proportion (used if data is not provided).
    n : int, optional
        Sample size (required if p_hat is provided).
    p0 : float
        Null hypothesis proportion.
    alternative : {"two-sided", "less", "greater"}
        Type of alternative hypothesis.
    alpha : float
        Significance level.
    success_values : tuple
        Values interpreted as "success" in the data string.

    Returns
    -------
    dict
        Test results.
    """

    # ---- Case 1: raw data is given ----
    if data is not None:
        tokens = data.lower().replace(",", " ").split()
        n = len(tokens)
        x = sum(token in success_values for token in tokens)
        p_hat = x / n

    # ---- Case 2: p_hat and n are given ----
    elif p_hat is not None and n is not None:
        x = p_hat * n

    else:
        raise ValueError("Provide either `data` or both `p_hat` and `n`.")

    # ---- Z statistic ----
    standard_error = math.sqrt(p0 * (1 - p0) / n)
    z_obs = (p_hat - p0) / standard_error

    # ---- p-value ----
    if alternative == "two-sided":
        p_value = 2 * (1 - norm.cdf(abs(z_obs)))
    elif alternative == "greater":
        p_value = 1 - norm.cdf(z_obs)
    elif alternative == "less":
        p_value = norm.cdf(z_obs)
    else:
        raise ValueError("alternative must be 'two-sided', 'less', or 'greater'")

    # ---- Decision ----
    reject = p_value < alpha

    return {
        "n": n,
        "x": x,
        "p_hat": p_hat,
        "z_obs": z_obs,
        "p_value": p_value,
        "alpha": alpha,
        "reject_H0": reject
    }

1. Using raw data (string)¶

In [13]:
result = one_sample_proportion_test(
    data="1 0 1 1 0 1 1 0 1 1",
    p0=0.5,
    alternative="two-sided"
)

print(result)
{'n': 10, 'x': 7, 'p_hat': 0.7, 'z_obs': 1.2649110640673513, 'p_value': 0.2059032107320684, 'alpha': 0.05, 'reject_H0': False}

2. Using raw data with words¶

In [15]:
result = one_sample_proportion_test(
    data="success failure success success failure",
    p0=0.6,
    alternative="greater"
)

print(result)
{'n': 5, 'x': 3, 'p_hat': 0.6, 'z_obs': 0.0, 'p_value': 0.5, 'alpha': 0.05, 'reject_H0': False}

3. Using a ready sample proportion¶

It has been found that 85.6% of all enrolled college and university students in the United States are undergraduates. A random sample of 500 enrolled college students in a particular state revealed that 420 of them were undergraduates. Is there sufficient evidence to conclude that the proportion differs from the national percentage? Use $\alpha= 0.05$.

In [18]:
result = one_sample_proportion_test(
    n=500,
    p_hat=(420 / 500),
    p0=0.856,
    alternative="two-sided"
)

print(result)
{'n': 500, 'x': 420.0, 'p_hat': 0.84, 'z_obs': -1.0190297341929058, 'p_value': 0.30818885050252565, 'alpha': 0.05, 'reject_H0': False}

One-Sample Test for the Mean

1. Problem Setting and Motivation¶

The one-sample test for the mean is used when we want to test a claim about the population mean based on a single sample.

Typical data science questions:

  • Is the average response time equal to 200 ms?
  • Has the mean revenue per user changed?
  • Is the mean prediction error zero?
  • Is the expected value of a feature equal to a benchmark?

2. Statistical Model¶

Let $$ X_1, \dots, X_n \;\text{i.i.d.}\; \sim P $$ with: $$ \mathbb{E}[X_i] = \mu, \qquad \mathrm{Var}(X_i) = \sigma^2 $$

The parameter of interest is the population mean $\mu$.

The sample mean is $$ \bar{X} = \frac{1}{n}\sum_{i=1}^n X_i $$


3. Hypotheses¶

We test $$ H_0: \mu = \mu_0 $$

against one of the following alternatives:

  • Two-sided $$ H_1: \mu \neq \mu_0 $$

  • Right-tailed $$ H_1: \mu > \mu_0 $$

  • Left-tailed $$ H_1: \mu < \mu_0 $$

The alternative must be chosen before observing the data.


4. Case I: Variance Known (Z-Test)¶

4.1 Assumptions¶

  • $X_i \sim \mathcal{N}(\mu, \sigma^2)$
    or
  • $n$ is large (CLT applies)
  • The variance $\sigma^2$ is known

4.2 Distribution of the Sample Mean¶

We have: $$ \mathbb{E}[\bar{X}] = \mu, \qquad \mathrm{Var}(\bar{X}) = \frac{\sigma^2}{n} $$

Under $H_0$: $$ \bar{X} \sim \mathcal{N}\left(\mu_0, \frac{\sigma^2}{n}\right) $$


4.3 Test Statistic (Z-Statistic)¶

Define: $$ Z = \frac{\bar{X} - \mu_0}{\sigma / \sqrt{n}} $$

Under $H_0$: $$ Z \sim \mathcal{N}(0,1) $$


4.4 Rejection Regions¶

Let $\alpha$ be the significance level.

  • Two-sided $$ |Z| \ge z_{1-\alpha/2} $$

  • Right-tailed $$ Z \ge z_{1-\alpha} $$

  • Left-tailed $$ Z \le -z_{1-\alpha} $$


4.5 p-value¶

Let $Z_{\text{obs}}$ be the observed value of the test statistic.

  • Two-sided $$ p\text{-value} = 2\bigl(1 - \Phi(|Z_{\text{obs}}|)\bigr) $$

  • Right-tailed $$ p\text{-value} = 1 - \Phi(Z_{\text{obs}}) $$

  • Left-tailed $$ p\text{-value} = \Phi(Z_{\text{obs}}) $$


5. Case II: Variance Unknown (t-Test)¶

5.1 Assumptions¶

  • $X_i \sim \mathcal{N}(\mu, \sigma^2)$
  • The variance $\sigma^2$ is unknown

This is the most common real-world situation.


5.2 Sample Variance¶

The sample variance is: $$ S^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar{X})^2 $$


5.3 Test Statistic (t-Statistic)¶

Define: $$ T = \frac{\bar{X} - \mu_0}{S / \sqrt{n}} $$

Under $H_0$: $$ T \sim t_{n-1} $$ (Student’s t-distribution with $n-1$ degrees of freedom)


5.4 Why t-Distribution?¶

Replacing $\sigma$ with the random variable $S$ introduces extra uncertainty.
The t-distribution has:

  • heavier tails than the normal distribution
  • convergence to $\mathcal{N}(0,1)$ as $n \to \infty$

5.5 Rejection Regions¶

Let $t_{n-1,\,1-\alpha}$ denote the $(1-\alpha)$ quantile of $t_{n-1}$.

  • Two-sided $$ |T| \ge t_{n-1,\,1-\alpha/2} $$

  • Right-tailed $$ T \ge t_{n-1,\,1-\alpha} $$

  • Left-tailed $$ T \le -t_{n-1,\,1-\alpha} $$


5.6 p-value¶

Let $T_{\text{obs}}$ be the observed statistic.

  • Two-sided $$ p\text{-value} = 2\bigl(1 - F_{t_{n-1}}(|T_{\text{obs}}|)\bigr) $$

  • Right-tailed $$ p\text{-value} = 1 - F_{t_{n-1}}(T_{\text{obs}}) $$

  • Left-tailed $$ p\text{-value} = F_{t_{n-1}}(T_{\text{obs}}) $$


6. Decision Rule¶

For both tests: $$ \text{Reject } H_0 \iff p\text{-value} \le \alpha $$


9. Confidence Interval Connection¶

Known Variance¶

$$ \bar{X} \pm z_{1-\alpha/2}\frac{\sigma}{\sqrt{n}} $$

Unknown Variance¶

$$ \bar{X} \pm t_{n-1,\,1-\alpha/2}\frac{S}{\sqrt{n}} $$

Relationship:

$H_0: \mu = \mu_0$ is rejected at level $\alpha$
iff $\mu_0$ is outside the $(1-\alpha)$ confidence interval.


10. Practical Data Science Remarks¶

  • The t-test is extremely robust to mild non-normality
  • For large $n$, Z-test and t-test give nearly identical results
  • Always report:
    • estimated mean $\bar{X}$
    • confidence interval
    • effect size

11. Summary¶

  • Parameter of interest: population mean $\mu$
  • Known variance → Z-test
  • Unknown variance → t-test
  • Test statistics follow known distributions under $H_0$
  • Strong connection to confidence intervals
In [32]:
import math
from scipy.stats import t


def one_sample_ttest(
    data=None,
    x_bar=None,
    s=None,
    n=None,
    mu0=0.0,
    alternative="two-sided",   # "two-sided", "greater", "less"
    alpha=0.05
):
    """
    One-sample t-test for the population mean.
    Uses both:
      (1) p-value method
      (2) critical region method
    """

    # ---------- Parse input ----------
    if data is not None:
        values = [float(x) for x in data.replace(",", " ").split()]
        n = len(values)
        if n < 2:
            raise ValueError("Sample size must be at least 2.")

        x_bar = sum(values) / n
        s = math.sqrt(
            sum((x - x_bar) ** 2 for x in values) / (n - 1)
        )

    elif x_bar is not None and s is not None and n is not None:
        if n < 2:
            raise ValueError("Sample size must be at least 2.")
    else:
        raise ValueError("Provide either `data` OR (`x_bar`, `s`, `n`).")

    # ---------- Test statistic ----------
    se = s / math.sqrt(n)
    t_obs = (x_bar - mu0) / se
    df = n - 1

    # ---------- p-value method ----------
    if alternative == "two-sided":
        p_value = 2 * (1 - t.cdf(abs(t_obs), df))
    elif alternative == "greater":
        p_value = 1 - t.cdf(t_obs, df)
    elif alternative == "less":
        p_value = t.cdf(t_obs, df)
    else:
        raise ValueError("alternative must be 'two-sided', 'greater', or 'less'.")

    reject_by_pvalue = (p_value < alpha)

    # ---------- Critical region method ----------
    if alternative == "two-sided":
        t_crit = t.ppf(1 - alpha / 2, df)
        reject_by_critical = abs(t_obs) > t_crit
        critical_region = f"|T| > {t_crit:.4f}"

    elif alternative == "greater":
        t_crit = t.ppf(1 - alpha, df)
        reject_by_critical = t_obs > t_crit
        critical_region = f"T > {t_crit:.4f}"

    else:  # "less"
        t_crit = t.ppf(alpha, df)
        reject_by_critical = t_obs < t_crit
        critical_region = f"T < {t_crit:.4f}"

    # ---------- Return results ----------
    return {
        "inputs": {
            "n": n,
            "x_bar": x_bar,
            "s": s,
            "mu0": mu0,
            "alternative": alternative,
            "alpha": alpha
        },
        "statistic": {
            "t_obs": t_obs,
            "df": df,
            "se": se
        },
        "p_value_method": {
            "p_value": p_value,
            "reject_H0": reject_by_pvalue
        },
        "critical_region_method": {
            "critical_region": critical_region,
            "t_crit": t_crit,
            "reject_H0": reject_by_critical
        }
    }

Using raw data¶

The weight of the world’s smallest mammal is the bumblebee bat (also known as Kitti’s hog-nosed bat or Craseonycteris thonglongyai) is approximately normally distributed with a mean 1.9 grams. Such bats are roughly the size of a large bumblebee. A chiropterologist believes that the Kitti’s hog-nosed bats in a new geographical region under study has a different average weight than 1.9 grams. A sample of 10 bats weighed in grams in the new region are shown below. Use the confidence interval method to test the claim that mean weight for all bumblebee bats is not 1.9 g using a 10% level of significance.

In [39]:
res = one_sample_ttest(
    data="1.9 2.24 2.13 2 1.54 1.96 1.79 2.18 1.81 2.3",
    mu0=1.9,
    alternative="two-sided",
    alpha=0.1
)

print(res)
{'inputs': {'n': 10, 'x_bar': 1.9849999999999999, 's': 0.23524219198283478, 'mu0': 1.9, 'alternative': 'two-sided', 'alpha': 0.1}, 'statistic': {'t_obs': 1.1426249638667096, 'df': 9, 'se': 0.07439011284363593}, 'p_value_method': {'p_value': 0.28267920117045664, 'reject_H0': False}, 'critical_region_method': {'critical_region': '|T| > 1.8331', 't_crit': 1.8331129326536333, 'reject_H0': False}}

Using summary statistics¶

The label on a particular brand of cream of mushroom soup states that (on average) there is 870 mg of sodium per serving. A nutritionist would like to test if the average is actually more than the stated value. To test this, 13 servings of this soup were randomly selected and amount of sodium measured. The sample mean was found to be 882.4 mg and the sample standard deviation was 24.3 mg. Assume that the amount of sodium per serving is normally distributed. Test this claim using the traditional method of hypothesis testing. Use the α = 0.05 level of significance.

In [35]:
res = one_sample_ttest(
    x_bar=882.4,
    s=24.3,
    n=13,
    mu0=870,
    alternative="greater",
    alpha=0.05
)

print(res)
{'inputs': {'n': 13, 'x_bar': 882.4, 's': 24.3, 'mu0': 870, 'alternative': 'greater', 'alpha': 0.05}, 'statistic': {'t_obs': 1.8398697866565177, 'df': 12, 'se': 6.7396073841365345}, 'p_value_method': {'p_value': 0.04532103678298238, 'reject_H0': True}, 'critical_region_method': {'critical_region': 'T > 1.7823', 't_crit': 1.7822875556491589, 'reject_H0': True}}

Two-Sample Test for Proportions

1. Problem Setting and Motivation¶

The two-sample test for proportions is used when we want to compare two population proportions based on independent samples.

Typical data science questions:

  • Is the conversion rate different between version A and version B?
  • Does a new recommendation algorithm increase click-through rate?
  • Is the defect rate lower for supplier 1 than supplier 2?
  • Are positive label rates equal across two groups?

This test is a core statistical tool behind A/B testing.


2. Statistical Model¶

Let $$ X_1, \dots, X_{n_1} \sim \text{Bernoulli}(p_1), \qquad Y_1, \dots, Y_{n_2} \sim \text{Bernoulli}(p_2), $$ where:

  • $p_1$ and $p_2$ are the unknown population proportions
  • the two samples are independent

Define the numbers of successes: $$ S_1 = \sum_{i=1}^{n_1} X_i, \qquad S_2 = \sum_{j=1}^{n_2} Y_j $$

Sample proportions: $$ \hat{p}_1 = \frac{S_1}{n_1}, \qquad \hat{p}_2 = \frac{S_2}{n_2} $$


3. Parameter of Interest¶

The quantity of interest is the difference of proportions: $$ \Delta = p_1 - p_2 $$


4. Hypotheses¶

We test: $$ H_0: p_1 = p_2 \quad \text{(equivalently } \Delta = 0 \text{)} $$

Against one of the following alternatives:

  • Two-sided $$ H_1: p_1 \neq p_2 $$

  • Right-tailed $$ H_1: p_1 > p_2 $$

  • Left-tailed $$ H_1: p_1 < p_2 $$

The alternative must be chosen before observing the data.


5. Sampling Distribution of the Difference¶

We have: $$ \mathbb{E}[\hat{p}_1 - \hat{p}_2] = p_1 - p_2 $$ $$ \mathrm{Var}(\hat{p}_1 - \hat{p}_2) = \frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2} $$


6. Null Hypothesis and Pooled Proportion¶

Under $H_0: p_1 = p_2 = p$, the common proportion is estimated by the pooled estimator: $$ \hat{p} = \frac{S_1 + S_2}{n_1 + n_2} $$

This pooling reflects the assumption that both samples come from the same population under $H_0$.


7. Normal Approximation (CLT)¶

By the Central Limit Theorem, under $H_0$: $$ Z = \frac{(\hat{p}_1 - \hat{p}_2)} {\sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}} \;\approx\; \mathcal{N}(0,1) $$


8. Validity Conditions¶

The normal approximation is valid when: $$ n_1\hat{p} \ge 5,\quad n_1(1-\hat{p}) \ge 5, $$ $$ n_2\hat{p} \ge 5,\quad n_2(1-\hat{p}) \ge 5 $$

If these conditions fail:

  • use exact tests (e.g. Fisher’s exact test)
  • or permutation tests

9. Test Statistic¶

The two-sample z-statistic for proportions is: $$ Z = \frac{\hat{p}_1 - \hat{p}_2} {\sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}} $$

This measures how many standard deviations the observed difference is from zero under $H_0$.


10. Rejection Regions¶

Let $\alpha$ be the significance level.

Two-Sided Test¶

Reject $H_0$ if: $$ |Z| \ge z_{1-\alpha/2} $$

Right-Tailed Test¶

Reject $H_0$ if: $$ Z \ge z_{1-\alpha} $$

Left-Tailed Test¶

Reject $H_0$ if: $$ Z \le -z_{1-\alpha} $$


11. p-value¶

Let $Z_{\text{obs}}$ be the observed value of the test statistic.

  • Two-sided $$ p\text{-value} = 2\bigl(1 - \Phi(|Z_{\text{obs}}|)\bigr) $$

  • Right-tailed $$ p\text{-value} = 1 - \Phi(Z_{\text{obs}}) $$

  • Left-tailed $$ p\text{-value} = \Phi(Z_{\text{obs}}) $$


12. Decision Rule¶

$$ \text{Reject } H_0 \iff p\text{-value} \le \alpha $$

15. Confidence Interval for Difference of Proportions¶

A $(1-\alpha)$ confidence interval for $p_1 - p_2$: $$ (\hat{p}_1 - \hat{p}_2) \pm z_{1-\alpha/2} \sqrt{ \frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2} } $$

Important:

  • No pooling is used in confidence intervals
  • Pooling is only used under the null hypothesis

16. Relation to A/B Testing¶

  • Group A → proportion $p_1$
  • Group B → proportion $p_2$
  • Null hypothesis: no difference
  • This test is the classical frequentist A/B test

In practice, it is often complemented or replaced by:

  • logistic regression
  • Bayesian A/B testing
  • bootstrap methods

17. Practical Data Science Remarks¶

  • Always report:
    • $\hat{p}_1$, $\hat{p}_2$
    • difference $\hat{p}_1 - \hat{p}_2$
    • confidence interval
  • Statistical significance does not imply business relevance
  • Large samples can make tiny differences significant

18. Summary¶

  • Two independent Bernoulli samples
  • Parameter of interest: $p_1 - p_2$
  • Test statistic: pooled z-test
  • CLT-based approximation
  • Foundation of classical A/B testing

Example¶

A vice principal wants to see if there is a difference between the number of students who are late to class for the first class of the day compared to the student’s class right after lunch. To test their claim to see if there is a difference in the proportion of late students between first and after lunch classes, the vice-principal randomly selects 200 students from first class and records if they are late, then randomly selects 200 students in their class after lunch and records if they are late. At the 0.05 level of significance, can a difference be concluded? First Class After Lunch Class Sample Size 200 200 Number of late students 13 16

In [42]:
import math
from scipy.stats import norm


def two_sample_proportion_ztest(
    data1=None,
    data2=None,
    p1_hat=None,
    n1=None,
    p2_hat=None,
    n2=None,
    diff0=0.0,                 # H0: p1 - p2 = diff0 (usually 0)
    alternative="two-sided",    # "two-sided", "greater", "less"
    alpha=0.05,
    success_values=("1", "success", "yes", "true")
):
    """
    Two-sample Z-test for proportions.
    Works with either:
      (A) raw data strings (data1, data2)
      (B) summary inputs (p1_hat, n1, p2_hat, n2)

    Uses BOTH:
      (1) p-value method
      (2) critical region method

    Note: For the classical pooled two-proportion z-test (valid when diff0 = 0),
          we pool the proportions under H0. If diff0 != 0, we use the unpooled SE.
    """

    # ---------- Parse input ----------
    if data1 is not None and data2 is not None:
        tokens1 = data1.lower().replace(",", " ").split()
        tokens2 = data2.lower().replace(",", " ").split()

        n1 = len(tokens1)
        n2 = len(tokens2)

        x1 = sum(tok in success_values for tok in tokens1)
        x2 = sum(tok in success_values for tok in tokens2)

        p1_hat = x1 / n1
        p2_hat = x2 / n2

    elif (p1_hat is not None and n1 is not None and
          p2_hat is not None and n2 is not None):
        x1 = p1_hat * n1
        x2 = p2_hat * n2
    else:
        raise ValueError("Provide either (data1, data2) OR (p1_hat, n1, p2_hat, n2).")

    if n1 <= 0 or n2 <= 0:
        raise ValueError("n1 and n2 must be positive.")
    if not (0 <= p1_hat <= 1) or not (0 <= p2_hat <= 1):
        raise ValueError("p1_hat and p2_hat must be in [0,1].")

    # ---------- Test statistic ----------
    # If H0 is p1 - p2 = 0, use pooled SE (classical two-proportion z-test)
    if diff0 == 0.0:
        p_pool = (p1_hat * n1 + p2_hat * n2) / (n1 + n2)
        se = math.sqrt(p_pool * (1 - p_pool) * (1 / n1 + 1 / n2))
        z_obs = (p1_hat - p2_hat - diff0) / se
        se_type = "pooled (H0: p1-p2=0)"
    else:
        # General diff0 ≠ 0: use unpooled SE (common practical approach)
        se = math.sqrt(p1_hat * (1 - p1_hat) / n1 + p2_hat * (1 - p2_hat) / n2)
        z_obs = (p1_hat - p2_hat - diff0) / se
        se_type = "unpooled (general diff0)"

    # ---------- p-value method ----------
    if alternative == "two-sided":
        p_value = 2 * (1 - norm.cdf(abs(z_obs)))
    elif alternative == "greater":
        # H1: p1 - p2 > diff0
        p_value = 1 - norm.cdf(z_obs)
    elif alternative == "less":
        # H1: p1 - p2 < diff0
        p_value = norm.cdf(z_obs)
    else:
        raise ValueError("alternative must be 'two-sided', 'greater', or 'less'.")

    reject_by_pvalue = (p_value < alpha)

    # ---------- Critical region method ----------
    if alternative == "two-sided":
        z_crit = norm.ppf(1 - alpha / 2)
        reject_by_critical = abs(z_obs) > z_crit
        critical_region = f"|Z| > {z_crit:.4f}"
    elif alternative == "greater":
        z_crit = norm.ppf(1 - alpha)
        reject_by_critical = z_obs > z_crit
        critical_region = f"Z > {z_crit:.4f}"
    else:  # "less"
        z_crit = norm.ppf(alpha)
        reject_by_critical = z_obs < z_crit
        critical_region = f"Z < {z_crit:.4f}"

    # ---------- Return results ----------
    return {
        "inputs": {
            "n1": n1, "p1_hat": p1_hat, "x1": x1,
            "n2": n2, "p2_hat": p2_hat, "x2": x2,
            "diff0": diff0,
            "alternative": alternative,
            "alpha": alpha
        },
        "statistic": {
            "z_obs": z_obs,
            "se": se,
            "se_type": se_type
        },
        "p_value_method": {
            "p_value": p_value,
            "reject_H0": reject_by_pvalue
        },
        "critical_region_method": {
            "critical_region": critical_region,
            "z_crit": z_crit,
            "reject_H0": reject_by_critical
        }
    }


# ------------------ Example usage ------------------
if __name__ == "__main__":
    # Example 1: raw data strings (1 = success, 0 = failure)
    res1 = two_sample_proportion_ztest(
        data1="1 0 1 1 0 1 1 0 1 1",
        data2="1 0 0 0 1 0 0 0 1 0",
        diff0=0.0,
        alternative="two-sided",
        alpha=0.05
    )
    print("Example 1 (data strings):")
    print(res1, "\n")

    # Example 2: summary inputs
    res2 = two_sample_proportion_ztest(
        p1_hat=13/200, n1=200,
        p2_hat=16/200, n2=200,
        diff0=0.0,
        alternative="two-sided",
        alpha=0.05
    )
    print("Example 2 (summary inputs):")
    print(res2)
Example 1 (data strings):
{'inputs': {'n1': 10, 'p1_hat': 0.7, 'x1': 7, 'n2': 10, 'p2_hat': 0.3, 'x2': 3, 'diff0': 0.0, 'alternative': 'two-sided', 'alpha': 0.05}, 'statistic': {'z_obs': 1.7888543819998317, 'se': 0.22360679774997896, 'se_type': 'pooled (H0: p1-p2=0)'}, 'p_value_method': {'p_value': 0.07363827012030266, 'reject_H0': False}, 'critical_region_method': {'critical_region': '|Z| > 1.9600', 'z_crit': 1.959963984540054, 'reject_H0': False}} 

Example 2 (summary inputs):
{'inputs': {'n1': 200, 'p1_hat': 0.065, 'x1': 13.0, 'n2': 200, 'p2_hat': 0.08, 'x2': 16.0, 'diff0': 0.0, 'alternative': 'two-sided', 'alpha': 0.05}, 'statistic': {'z_obs': -0.5784492956984421, 'se': 0.025931399885081405, 'se_type': 'pooled (H0: p1-p2=0)'}, 'p_value_method': {'p_value': 0.5629608205677976, 'reject_H0': False}, 'critical_region_method': {'critical_region': '|Z| > 1.9600', 'z_crit': 1.959963984540054, 'reject_H0': False}}

Example¶

The general United States adult population volunteer an average of 4.2 hours per week. A random sample of 18 undergraduate college students and 20 graduate college students indicated the results below concerning the amount of time spent in volunteer service per week. At α = 0.01 level of significance, is there sufficient evidence to conclude that a difference exists between the mean number of volunteer hours per week for undergraduate and graduate college students? Assume that number of volunteer hours per week is normally distributed. UndergraduateGraduate Sample Mean 2.5 3.8 Sample Variance 2.2 3.5 Sample Size 18 20