Seminar 14

Testing Normality of a Sample¶

Skewness and Kurtosis¶

1. Introduction¶

When checking whether a dataset is approximately normal, two important numerical characteristics are:

  • Skewness
  • Kurtosis

These quantities describe:

  • asymmetry,
  • tail behavior,
  • concentration of probability mass.

They help us understand how a distribution differs from a normal distribution.


skewness


3.1 Definition¶

The skewness of a random variable is

$$ \gamma_1 = \frac{ \mathbb{E}[(X-\mu)^3] }{ \sigma^3 }. $$

This is the standardized third central moment.

In [ ]:
 


3.2 Why Divide by $\sigma^3$?¶

The numerator has units:

$$ (\text{units of }X)^3. $$

Dividing by

$$ \sigma^3 $$

makes skewness dimensionless.

Thus skewness is scale invariant.


3.3 Interpretation¶

Skewness measures asymmetry.


Symmetric distribution¶

$$ \gamma_1=0. $$

Example:

  • normal distribution.

Positive skewness¶

$$ \gamma_1>0. $$

Long right tail.

Example:

  • income distributions.

Negative skewness¶

$$ \gamma_1<0. $$

Long left tail.

Example:

  • difficult exam scores.

3.4 Why the Third Power Detects Asymmetry¶

Positive deviations contribute positively:

$$ (X-\mu)^3>0. $$

Negative deviations contribute negatively:

$$ (X-\mu)^3<0. $$

Thus asymmetry does not cancel.


3.6 Sample Skewness¶

For data

$$ x_1,\dots,x_n, $$

sample skewness is

$$ g_1 = \frac{ \frac1n\sum_{i=1}^n (x_i-\bar x)^3 }{ \left( \frac1n\sum_{i=1}^n (x_i-\bar x)^2 \right)^{3/2} }. $$
In [ ]:
 

Kurtosis


4.1 Definition¶

The kurtosis is

$$ \gamma_2 = \frac{ \mathbb{E}[(X-\mu)^4] }{ \sigma^4 }. $$

This is the standardized fourth central moment.


4.2 Excess Kurtosis¶

For the normal distribution:

$$ \gamma_2=3. $$

Therefore statisticians often define:

$$ \text{Excess Kurtosis} = \gamma_2-3. $$

Thus:

Distribution Excess Kurtosis
Normal 0
Heavy-tailed Positive
Light-tailed Negative

4.3 Interpretation¶

Kurtosis measures:

  • tail heaviness,
  • outlier-proneness,
  • concentration in tails.

It is NOT simply “peakedness”.

This is a common misconception.


4.4 Why the Fourth Power Matters¶

The fourth power magnifies large deviations enormously.

Example:

$$ 2^4=16, $$

but

$$ 5^4=625. $$

Thus extreme observations dominate the fourth moment.

Therefore kurtosis is highly sensitive to outliers and heavy tails.

4.7 Sample Kurtosis¶

Sample kurtosis is

$$ g_2 = \frac{ \frac1n\sum_{i=1}^n (x_i-\bar x)^4 }{ \left( \frac1n\sum_{i=1}^n (x_i-\bar x)^2 \right)^2 }. $$

Sample excess kurtosis is

$$ g_2-3. $$

5. Relationship with Normality¶

The normal distribution satisfies:

$$ \text{Skewness}=0, $$

and

$$ \text{Excess Kurtosis}=0. $$

Therefore:

  • nonzero skewness suggests asymmetry,
  • nonzero excess kurtosis suggests non-normal tails.

In [ ]:
 

Testing Normality of a Sample

1. Introduction¶

Many classical statistical methods assume that the data are normally distributed.

Examples:

  • t-tests,
  • ANOVA,
  • linear regression inference,
  • confidence intervals,
  • Pearson correlation inference.

Therefore, before applying these methods, we often want to check whether the sample is approximately normal.

Suppose we observe a sample

$$ X_1,\dots,X_n. $$

We want to test:

$$ H_0:\text{the data come from a normal distribution} $$

against

$$ H_1:\text{the data are not normally distributed}. $$

2. Important Philosophy¶

Normality tests should NEVER be used mechanically.

For large samples:

  • even tiny deviations from normality may become significant.

For small samples:

  • tests may have low power and fail to detect non-normality.

Therefore, in practice we usually combine:

  • graphical methods,
  • numerical tests,
  • domain knowledge.

The most common tools are:

Method Type
Histogram Visual
QQ-plot Visual
Shapiro–Wilk Formal test
Anderson–Darling Formal test
Kolmogorov–Smirnov Formal test
Pearson Chi-Square Formal test

3. Histogram¶

The simplest method is to draw a histogram.

If the data are approximately normal:

  • the histogram should look bell-shaped,
  • symmetric,
  • unimodal (one peak).

However:

  • histograms depend strongly on bin width,
  • small samples can be misleading.

Therefore histograms are useful but insufficient.


In [ ]:
 

4. QQ-Plot (Quantile-Quantile Plot)¶

The QQ-plot is one of the most important graphical methods.


4.1 Main Idea¶

If the sample is normal, then:

  • the empirical quantiles,
  • and the theoretical normal quantiles

should approximately coincide.


4.2 Ordered Sample¶

Let

$$ X_{(1)}\le X_{(2)}\le \cdots \le X_{(n)} $$

be the ordered observations.

These are called the order statistics.


4.3 Theoretical Normal Quantiles¶

For each observation we define

$$ p_i=\frac{i-0.5}{n}. $$

Then compute

$$ z_i=\Phi^{-1}(p_i), $$

where

$$ \Phi $$

is the standard normal CDF.

Thus:

$$ z_i $$

is the theoretical quantile of the standard normal distribution.


4.4 Constructing the QQ-Plot¶

We plot:

  • x-axis: theoretical normal quantiles
$$ z_i $$
  • y-axis: ordered observations
$$ X_{(i)}. $$

If the data are normal, the points should lie approximately on a straight line.


4.5 Interpretation¶

Approximately linear¶

Suggests approximate normality.


S-shaped curve¶

Indicates skewness.


Curved tails¶

Indicates heavy tails or light tails.


Extreme deviations at ends¶

Suggests outliers.


4.6 Why QQ-Plots are Important¶

QQ-plots are often more informative than formal tests because they show:

  • where deviations occur,
  • skewness,
  • tail behavior,
  • outliers.

Many statisticians trust QQ-plots more than p-values.


5. Pearson Chi-Square Test for Normality¶

This is one of the oldest normality tests.


5.1 Main Idea¶

We compare:

  • observed frequencies,
  • expected frequencies under normality.

5.2 Step 1: Create Intervals¶

Partition the real line into bins:

$$ I_1,\dots,I_k. $$

Example:

$$ (-\infty,-1],(-1,0],(0,1],(1,\infty). $$

5.3 Step 2: Observed Frequencies¶

Let

$$ O_j $$

be the number of observations inside interval

$$ I_j. $$

5.4 Step 3: Estimate Parameters¶

Usually the mean and variance are unknown.

Estimate them using:

$$ \hat{\mu}=\bar X, $$

and

$$ \hat{\sigma}^2 = \frac1n\sum_{i=1}^n(X_i-\bar X)^2. $$

5.5 Step 4: Expected Frequencies¶

Under the fitted normal distribution,

$$ X\sim N(\hat{\mu},\hat{\sigma}^2), $$

the probability of interval

$$ I_j $$

is

$$ p_j=P(X\in I_j). $$

Therefore the expected count is

$$ E_j=np_j. $$

5.6 Test Statistic¶

The Pearson chi-square statistic is

$$ \chi^2 = \sum_{j=1}^k \frac{(O_j-E_j)^2}{E_j}. $$

Large values indicate poor fit.


5.7 Degrees of Freedom¶

The degrees of freedom are

$$ df=k-1-m, $$

where:

  • $k$ = number of bins,
  • $m$ = number of estimated parameters.

For normality:

  • mean estimated,
  • variance estimated,

so

$$ m=2. $$

Thus:

$$ df=k-3. $$

5.8 Limitations¶

The Pearson test:

  • depends strongly on binning,
  • loses information,
  • performs poorly for small samples.

Therefore it is rarely the preferred method today.


In [ ]:
 

8. Anderson–Darling Test¶

The Anderson–Darling test improves upon KS by giving more weight to the tails.

This is extremely important because:

  • non-normality often appears in tails,
  • outliers strongly affect inference.

8.1 Main Idea¶

Compare:

  • empirical distribution,
  • theoretical normal distribution,

but emphasize tail deviations.


8.2 Test Statistic¶

Suppose

$$ X_{(1)}\le \cdots \le X_{(n)} $$

are ordered observations.

Define

$$ Z_i=\frac{X_{(i)}-\hat{\mu}}{\hat{\sigma}}. $$

Then:

$$ A^2 = -n - \frac1n \sum_{i=1}^n (2i-1) \left[ \log(\Phi(Z_i)) + \log(1-\Phi(Z_{n+1-i})) \right]. $$

Large values indicate non-normality.


8.3 Why It Is Powerful¶

The logarithms strongly emphasize:

  • extreme quantiles,
  • tail discrepancies.

Therefore Anderson–Darling is particularly good at detecting:

  • heavy tails,
  • outliers,
  • tail asymmetry.

8.4 Advantages¶

  • strong tail sensitivity,
  • generally powerful,
  • better than KS in many situations.

9. Shapiro–Wilk Test¶

This is usually considered the best default normality test.

Especially for:

  • small samples,
  • moderate samples.

9.1 Main Idea¶

For normal data, ordered observations should behave approximately like:

  • expected normal order statistics.

The Shapiro–Wilk test measures how well:

  • the observed order statistics,
  • align with normal order statistics.

9.2 Ordered Sample¶

Let

$$ X_{(1)}\le \cdots \le X_{(n)} $$

be the ordered observations.


9.3 Test Statistic¶

The statistic is

$$ W = \frac{ \left( \sum_{i=1}^n a_i X_{(i)} \right)^2 }{ \sum_{i=1}^n (X_i-\bar X)^2 }. $$

The constants

$$ a_i $$

depend on:

  • expected normal order statistics,
  • covariance matrix of order statistics.

They are precomputed numerically.


9.4 Interpretation¶

If the sample is normal:

  • numerator is large,
  • points align well,
  • $W$ is close to 1.

If the sample is non-normal:

  • deviations appear,
  • $W$ becomes smaller.

Thus:

  • small values of $W$ indicate non-normality.

9.5 Why It Works Well¶

The test is sensitive to:

  • skewness,
  • kurtosis,
  • tail behavior,
  • general departures from normality.

It usually has very high power.


9.6 Advantages¶

  • excellent power,
  • works well for small samples,
  • widely recommended.

9.7 Limitations¶

For extremely large samples:

  • even tiny harmless deviations become significant.

Thus interpretation should always include graphical diagnostics.


10. Comparison of Methods¶

Method Main Idea Strengths Weaknesses
Histogram Visual shape Simple Subjective
QQ-plot Compare quantiles Very informative Subjective
Pearson Chi-Square Compare frequencies Classical Depends on bins
KS/Lilliefors Compare CDFs Simple Weak power
Anderson–Darling Tail-sensitive EDF test Strong tails Slightly more complex
Shapiro–Wilk Order statistics Best general power Sensitive for huge n

11. Which Test Should Be Used?¶

General recommendation:

Situation Recommended Method
Small/moderate sample Shapiro–Wilk
Tail behavior important Anderson–Darling
Visual diagnosis QQ-plot
Classical teaching Pearson Chi-Square
Distribution-free EDF idea Lilliefors

12. Important Practical Advice¶

Never rely only on p-values.

Always inspect:

  • QQ-plots,
  • histograms,
  • residual plots.

Remember:

  • with large $n$, tiny deviations become significant;
  • with small $n$, tests may miss serious deviations.

Statistical significance is not the same as practical importance.


13. Normality in Regression¶

In regression, the assumption is NOT:

$$ Y_i \sim N(\mu,\sigma^2). $$

Instead, the assumption is:

$$ \varepsilon_i \sim N(0,\sigma^2). $$

Therefore we test normality of:

  • residuals,
  • not necessarily raw observations.

This is extremely important.


14. Summary¶

To verify normality we use:

  • graphical methods,
  • formal hypothesis tests.

The most important tools are:

Method Type
Histogram Visual
QQ-plot Visual
Shapiro–Wilk Formal test
Anderson–Darling Formal test
Lilliefors Formal test
Pearson Chi-Square Formal test

Main recommendations:

  • QQ-plot is often the most informative graphical tool.
  • Shapiro–Wilk is usually the best default test.
  • Anderson–Darling is especially strong for tail deviations.
  • Pearson chi-square is classical but less modern.
  • Lilliefors corrects the KS test when parameters are estimated.

In practice, good statistical analysis combines:

  • theory,
  • graphics,
  • formal testing,
  • subject-matter understanding.
In [ ]:
 

Jarque–Bera Test

$$ H_0: \text{sample is normal} $$

against

$$ H_1: \text{sample is not normal}. $$

6.2 Test Statistic¶

The statistic is

$$ JB = \frac{n}{6} \left( g_1^2 + \frac{(g_2-3)^2}{4} \right). $$

where:

  • $g_1$ = sample skewness,
  • $g_2$ = sample kurtosis.

Under $H_0$:

$$ JB \xrightarrow{d} \chi^2_2. $$

Large values indicate non-normality.


In [ ]:
 
In [11]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import jarque_bera, shapiro, anderson, chisquare, norm
import scipy.stats as stats

# -----------------------------
# Generate normal data
# -----------------------------

np.random.seed(42)

x = np.random.normal(loc=0, scale=1, size=1000)

alpha = 0.05

# -----------------------------
# 1. Jarque-Bera test
# -----------------------------

jb_stat, jb_p = jarque_bera(x)

# -----------------------------
# 2. Shapiro-Wilk test
# -----------------------------

shapiro_stat, shapiro_p = shapiro(x)

# -----------------------------
# 3. Anderson-Darling test
# -----------------------------

ad_result = anderson(x, dist="norm")

ad_stat = ad_result.statistic

# scipy gives critical values, not p-values
# We compare the statistic with the 5% critical value

ad_critical_5 = ad_result.critical_values[
    list(ad_result.significance_level).index(5.0)
]

ad_decision = (
    "Reject H0"
    if ad_stat > ad_critical_5
    else "Fail to reject H0"
)

# -----------------------------
# 4. Pearson Chi-Square test
# -----------------------------

# Estimate parameters of normal distribution
mu_hat = np.mean(x)
sigma_hat = np.std(x, ddof=1)

# Choose number of bins
k = 10

# Use equal-probability bins under fitted normal distribution
probabilities = np.linspace(0, 1, k + 1)

bin_edges = norm.ppf(probabilities, loc=mu_hat, scale=sigma_hat)

# Replace infinities at the endpoints
bin_edges[0] = -np.inf
bin_edges[-1] = np.inf

# Observed frequencies
observed, _ = np.histogram(x, bins=bin_edges)

# Expected frequencies
expected = np.ones(k) * len(x) / k

# Pearson chi-square statistic manually
chi_stat = np.sum((observed - expected) ** 2 / expected)

# Degrees of freedom:
# k bins - 1 constraint - 2 estimated parameters
df = k - 1 - 2

chi_p = 1 - stats.chi2.cdf(chi_stat, df=df)

# -----------------------------
# Create comparison table
# -----------------------------

results = pd.DataFrame({
    "Test": [
        "Jarque-Bera",
        "Shapiro-Wilk",
        "Anderson-Darling",
        "Pearson Chi-Square"
    ],
    "Statistic": [
        jb_stat,
        shapiro_stat,
        ad_stat,
        chi_stat
    ],
    "p-value / critical value": [
        jb_p,
        shapiro_p,
        f"5% critical value = {ad_critical_5:.4f}",
        chi_p
    ],
    "Decision at alpha=0.05": [
        "Reject H0" if jb_p < alpha else "Fail to reject H0",
        "Reject H0" if shapiro_p < alpha else "Fail to reject H0",
        ad_decision,
        "Reject H0" if chi_p < alpha else "Fail to reject H0"
    ]
})

results
Out[11]:
Test Statistic p-value / critical value Decision at alpha=0.05
0 Jarque-Bera 2.456373 0.292823 Fail to reject H0
1 Shapiro-Wilk 0.998608 0.626629 Fail to reject H0
2 Anderson-Darling 0.347470 5% critical value = 0.7840 Fail to reject H0
3 Pearson Chi-Square 7.700000 0.359789 Fail to reject H0
In [12]:
# -----------------------------
# Print detailed output
# -----------------------------

print("Jarque-Bera test")
print("Statistic:", jb_stat)
print("p-value:", jb_p)
print()

print("Shapiro-Wilk test")
print("Statistic:", shapiro_stat)
print("p-value:", shapiro_p)
print()

print("Anderson-Darling test")
print("Statistic:", ad_stat)
print("Critical values:", ad_result.critical_values)
print("Significance levels:", ad_result.significance_level)
print("5% critical value:", ad_critical_5)
print()

print("Pearson Chi-Square test")
print("Statistic:", chi_stat)
print("Degrees of freedom:", df)
print("p-value:", chi_p)
Jarque-Bera test
Statistic: 2.4563732018799604
p-value: 0.2928231016469254

Shapiro-Wilk test
Statistic: 0.9986082911491394
p-value: 0.6266290545463562

Anderson-Darling test
Statistic: 0.3474697767348971
Critical values: [0.574 0.653 0.784 0.914 1.088]
Significance levels: [15.  10.   5.   2.5  1. ]
5% critical value: 0.784

Pearson Chi-Square test
Statistic: 7.700000000000001
Degrees of freedom: 7
p-value: 0.3597893327261251
In [13]:
# -----------------------------
# Plot histogram
# -----------------------------

plt.figure(figsize=(8, 5))

plt.hist(x, bins=30, density=True, alpha=0.6)

x_grid = np.linspace(min(x), max(x), 300)
normal_density = norm.pdf(x_grid, loc=mu_hat, scale=sigma_hat)

plt.plot(x_grid, normal_density, linewidth=2, label="Fitted normal density")

plt.title("Histogram of Normal Data")
plt.xlabel("x")
plt.ylabel("Density")
plt.legend()
plt.grid(True)
plt.show()
In [14]:
# -----------------------------
# QQ-plot
# -----------------------------

stats.probplot(x, dist="norm", plot=plt)

plt.title("QQ-Plot of Normal Data")
plt.grid(True)
plt.show()
In [15]:
# -----------------------------
# Pearson Chi-Square observed vs expected counts
# -----------------------------

pearson_table = pd.DataFrame({
    "Bin": np.arange(1, k + 1),
    "Observed": observed,
    "Expected": expected
})

pearson_table
Out[15]:
Bin Observed Expected
0 1 103 100.0
1 2 97 100.0
2 3 113 100.0
3 4 94 100.0
4 5 91 100.0
5 6 111 100.0
6 7 98 100.0
7 8 106 100.0
8 9 83 100.0
9 10 104 100.0
In [ ]:
 

Example 2 — Test for Non-Normal Data¶

In [16]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import jarque_bera, shapiro, anderson, chisquare, norm
import scipy.stats as stats

# -----------------------------
# Generate exponential data
# -----------------------------

np.random.seed(42)

x = np.random.exponential(scale=1, size=1000)

alpha = 0.05

# -----------------------------
# 1. Jarque-Bera test
# -----------------------------

jb_stat, jb_p = jarque_bera(x)

# -----------------------------
# 2. Shapiro-Wilk test
# -----------------------------

shapiro_stat, shapiro_p = shapiro(x)

# -----------------------------
# 3. Anderson-Darling test
# -----------------------------

ad_result = anderson(x, dist="norm")

ad_stat = ad_result.statistic

# scipy returns critical values instead of p-values

ad_critical_5 = ad_result.critical_values[
    list(ad_result.significance_level).index(5.0)
]

ad_decision = (
    "Reject H0"
    if ad_stat > ad_critical_5
    else "Fail to reject H0"
)

# -----------------------------
# 4. Pearson Chi-Square test
# -----------------------------

# Estimate normal parameters
mu_hat = np.mean(x)
sigma_hat = np.std(x, ddof=1)

# Number of bins
k = 10

# Equal-probability bins under fitted normal
probabilities = np.linspace(0, 1, k + 1)

bin_edges = norm.ppf(probabilities, loc=mu_hat, scale=sigma_hat)

# Replace infinities
bin_edges[0] = -np.inf
bin_edges[-1] = np.inf

# Observed frequencies
observed, _ = np.histogram(x, bins=bin_edges)

# Expected frequencies
expected = np.ones(k) * len(x) / k

# Pearson chi-square statistic
chi_stat = np.sum((observed - expected) ** 2 / expected)

# Degrees of freedom:
# k bins - 1 constraint - 2 estimated parameters
df = k - 1 - 2

chi_p = 1 - stats.chi2.cdf(chi_stat, df=df)

# -----------------------------
# Create comparison table
# -----------------------------

results = pd.DataFrame({
    "Test": [
        "Jarque-Bera",
        "Shapiro-Wilk",
        "Anderson-Darling",
        "Pearson Chi-Square"
    ],
    "Statistic": [
        jb_stat,
        shapiro_stat,
        ad_stat,
        chi_stat
    ],
    "p-value / critical value": [
        jb_p,
        shapiro_p,
        f"5% critical value = {ad_critical_5:.4f}",
        chi_p
    ],
    "Decision at alpha=0.05": [
        "Reject H0" if jb_p < alpha else "Fail to reject H0",
        "Reject H0" if shapiro_p < alpha else "Fail to reject H0",
        ad_decision,
        "Reject H0" if chi_p < alpha else "Fail to reject H0"
    ]
})

results
Out[16]:
Test Statistic p-value / critical value Decision at alpha=0.05
0 Jarque-Bera 1714.130613 0.0 Reject H0
1 Shapiro-Wilk 0.824626 0.0 Reject H0
2 Anderson-Darling 46.519607 5% critical value = 0.7840 Reject H0
3 Pearson Chi-Square 376.680000 0.0 Reject H0
In [17]:
# -----------------------------
# Print detailed output
# -----------------------------

print("Jarque-Bera test")
print("Statistic:", jb_stat)
print("p-value:", jb_p)
print()

print("Shapiro-Wilk test")
print("Statistic:", shapiro_stat)
print("p-value:", shapiro_p)
print()

print("Anderson-Darling test")
print("Statistic:", ad_stat)
print("Critical values:", ad_result.critical_values)
print("Significance levels:", ad_result.significance_level)
print("5% critical value:", ad_critical_5)
print()

print("Pearson Chi-Square test")
print("Statistic:", chi_stat)
print("Degrees of freedom:", df)
print("p-value:", chi_p)
Jarque-Bera test
Statistic: 1714.1306134960598
p-value: 0.0

Shapiro-Wilk test
Statistic: 0.8246256113052368
p-value: 1.0949680507146874e-31

Anderson-Darling test
Statistic: 46.51960676156841
Critical values: [0.574 0.653 0.784 0.914 1.088]
Significance levels: [15.  10.   5.   2.5  1. ]
5% critical value: 0.784

Pearson Chi-Square test
Statistic: 376.68
Degrees of freedom: 7
p-value: 0.0
In [18]:
# -----------------------------
# Plot histogram
# -----------------------------

plt.figure(figsize=(8, 5))

plt.hist(x, bins=30, density=True, alpha=0.6)

x_grid = np.linspace(min(x), max(x), 300)

normal_density = norm.pdf(
    x_grid,
    loc=mu_hat,
    scale=sigma_hat
)

plt.plot(
    x_grid,
    normal_density,
    linewidth=2,
    label="Fitted normal density"
)

plt.title("Histogram of Exponential Data")
plt.xlabel("x")
plt.ylabel("Density")

plt.legend()
plt.grid(True)
plt.show()
In [19]:
# -----------------------------
# QQ-plot
# -----------------------------

stats.probplot(x, dist="norm", plot=plt)

plt.title("QQ-Plot of Exponential Data")

plt.grid(True)
plt.show()
In [20]:
# -----------------------------
# Skewness and kurtosis
# -----------------------------

skewness = stats.skew(x)
excess_kurtosis = stats.kurtosis(x)

print("Sample skewness:", skewness)
print("Sample excess kurtosis:", excess_kurtosis)
Sample skewness: 1.865034222864843
Sample excess kurtosis: 5.2178275282033955
In [ ]:
 

Salary Data¶

In [22]:
# =========================================
# CELL 1 — Import libraries
# =========================================

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from scipy.stats import (
    jarque_bera,
    shapiro,
    anderson,
    norm,
    skew,
    kurtosis
)

import scipy.stats as stats
In [23]:
# =========================================
# CELL 2 — Load dataset
# =========================================

df = pd.read_csv("ds_salaries.csv")

df.head()
Out[23]:
Unnamed: 0 work_year experience_level employment_type job_title salary salary_currency salary_in_usd employee_residence remote_ratio company_location company_size
0 0 2020 MI FT Data Scientist 70000 EUR 79833 DE 0 DE L
1 1 2020 SE FT Machine Learning Scientist 260000 USD 260000 JP 0 JP S
2 2 2020 SE FT Big Data Engineer 85000 GBP 109024 GB 50 GB M
3 3 2020 MI FT Product Data Analyst 20000 USD 20000 HN 0 HN S
4 4 2020 SE FT Machine Learning Engineer 150000 USD 150000 US 50 US L
In [24]:
# =========================================
# CELL 3 — Inspect dataset
# =========================================

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 607 entries, 0 to 606
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Unnamed: 0          607 non-null    int64 
 1   work_year           607 non-null    int64 
 2   experience_level    607 non-null    object
 3   employment_type     607 non-null    object
 4   job_title           607 non-null    object
 5   salary              607 non-null    int64 
 6   salary_currency     607 non-null    object
 7   salary_in_usd       607 non-null    int64 
 8   employee_residence  607 non-null    object
 9   remote_ratio        607 non-null    int64 
 10  company_location    607 non-null    object
 11  company_size        607 non-null    object
dtypes: int64(5), object(7)
memory usage: 57.0+ KB
In [25]:
# =========================================
# CELL 4 — Column names
# =========================================

df.columns
Out[25]:
Index(['Unnamed: 0', 'work_year', 'experience_level', 'employment_type',
       'job_title', 'salary', 'salary_currency', 'salary_in_usd',
       'employee_residence', 'remote_ratio', 'company_location',
       'company_size'],
      dtype='object')
In [26]:
# =========================================
# CELL 5 — Extract salary column
# =========================================

salary = df["salary_in_usd"]

salary.head()
Out[26]:
0     79833
1    260000
2    109024
3     20000
4    150000
Name: salary_in_usd, dtype: int64
In [27]:
# =========================================
# CELL 6 — Clean the data
# =========================================

salary = salary.dropna()

salary = salary[salary > 0]

salary = salary.values

print("Sample size:", len(salary))
Sample size: 607
In [28]:
# =========================================
# CELL 7 — Basic statistics
# =========================================

print("Mean:", np.mean(salary))
print("Median:", np.median(salary))
print("Standard deviation:", np.std(salary, ddof=1))
Mean: 112297.86985172982
Median: 101570.0
Standard deviation: 70957.25941139569
In [29]:
# =========================================
# CELL 8 — Histogram of salaries
# =========================================

plt.figure(figsize=(10,6))

plt.hist(salary, bins=40, density=True)

plt.title("Histogram of Salaries")
plt.xlabel("Salary (USD)")
plt.ylabel("Density")

plt.grid(True)

plt.show()
In [30]:
# =========================================
# CELL 9 — QQ-plot
# =========================================

stats.probplot(salary, dist="norm", plot=plt)

plt.title("QQ-Plot of Salaries")

plt.grid(True)

plt.show()
In [31]:
# =========================================
# CELL 10 — Skewness and kurtosis
# =========================================

salary_skewness = skew(salary)

salary_kurtosis = kurtosis(salary)

print("Skewness:", salary_skewness)

print("Excess kurtosis:", salary_kurtosis)
Skewness: 1.6634213360977623
Excess kurtosis: 6.291709208027671
In [32]:
# =========================================
# CELL 11 — Jarque-Bera test
# =========================================

jb_stat, jb_p = jarque_bera(salary)

print("Jarque-Bera statistic:", jb_stat)

print("p-value:", jb_p)
Jarque-Bera statistic: 1281.1111067841746
p-value: 6.460376212006185e-279
In [33]:
# =========================================
# CELL 12 — Shapiro-Wilk test
# =========================================

shapiro_stat, shapiro_p = shapiro(salary)

print("Shapiro-Wilk statistic:", shapiro_stat)

print("p-value:", shapiro_p)
Shapiro-Wilk statistic: 0.8983551263809204
p-value: 1.254381529574248e-19
In [34]:
# =========================================
# CELL 13 — Anderson-Darling test
# =========================================

ad_result = anderson(salary, dist="norm")

print("Anderson-Darling statistic:")

print(ad_result.statistic)

print("\nCritical values:")

print(ad_result.critical_values)

print("\nSignificance levels:")

print(ad_result.significance_level)
Anderson-Darling statistic:
7.167722523512339

Critical values:
[0.572 0.652 0.782 0.912 1.085]

Significance levels:
[15.  10.   5.   2.5  1. ]
In [35]:
# =========================================
# CELL 14 — Pearson Chi-Square test
# =========================================

mu_hat = np.mean(salary)

sigma_hat = np.std(salary, ddof=1)

k = 10

probabilities = np.linspace(0, 1, k + 1)

bin_edges = norm.ppf(
    probabilities,
    loc=mu_hat,
    scale=sigma_hat
)

bin_edges[0] = -np.inf
bin_edges[-1] = np.inf

observed, _ = np.histogram(
    salary,
    bins=bin_edges
)

expected = np.ones(k) * len(salary) / k

chi_stat = np.sum(
    (observed - expected)**2 / expected
)

df_chi = k - 1 - 2

chi_p = 1 - stats.chi2.cdf(
    chi_stat,
    df=df_chi
)

print("Chi-square statistic:", chi_stat)

print("Degrees of freedom:", df_chi)

print("p-value:", chi_p)
Chi-square statistic: 36.80560131795716
Degrees of freedom: 7
p-value: 5.10556341737356e-06
In [36]:
# =========================================
# CELL 15 — Pearson observed vs expected
# =========================================

pearson_table = pd.DataFrame({
    "Bin": np.arange(1, k + 1),
    "Observed": observed,
    "Expected": expected
})

pearson_table
Out[36]:
Bin Observed Expected
0 1 36 60.7
1 2 80 60.7
2 3 79 60.7
3 4 76 60.7
4 5 66 60.7
5 6 65 60.7
6 7 46 60.7
7 8 64 60.7
8 9 42 60.7
9 10 53 60.7
In [37]:
# =========================================
# CELL 16 — Summary table
# =========================================

alpha = 0.05

ad_critical_5 = ad_result.critical_values[
    list(ad_result.significance_level).index(5.0)
]

ad_decision = (
    "Reject H0"
    if ad_result.statistic > ad_critical_5
    else "Fail to reject H0"
)

results = pd.DataFrame({
    "Test": [
        "Jarque-Bera",
        "Shapiro-Wilk",
        "Anderson-Darling",
        "Pearson Chi-Square"
    ],
    "Statistic": [
        jb_stat,
        shapiro_stat,
        ad_result.statistic,
        chi_stat
    ],
    "p-value / critical value": [
        jb_p,
        shapiro_p,
        f"5% critical value = {ad_critical_5:.4f}",
        chi_p
    ],
    "Decision at alpha=0.05": [
        "Reject H0" if jb_p < alpha else "Fail to reject H0",
        "Reject H0" if shapiro_p < alpha else "Fail to reject H0",
        ad_decision,
        "Reject H0" if chi_p < alpha else "Fail to reject H0"
    ]
})

results
Out[37]:
Test Statistic p-value / critical value Decision at alpha=0.05
0 Jarque-Bera 1281.111107 0.0 Reject H0
1 Shapiro-Wilk 0.898355 0.0 Reject H0
2 Anderson-Darling 7.167723 5% critical value = 0.7820 Reject H0
3 Pearson Chi-Square 36.805601 0.000005 Reject H0
In [ ]:
 
In [38]:
# ==========================================================
# NORMALITY ANALYSIS OF NHANES HEIGHT DATA
# ==========================================================

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from scipy.stats import (
    jarque_bera,
    shapiro,
    anderson,
    skew,
    kurtosis,
    norm
)

import scipy.stats as stats

# ----------------------------------------------------------
# 1. Load dataset
# ----------------------------------------------------------

df = pd.read_csv("NHANES Weight and Height.csv")

print("Columns:")
print(df.columns)

# ----------------------------------------------------------
# 2. Select height column
# ----------------------------------------------------------

# Adjust column name if necessary
# Common names: "Height", "Height_cm", etc.

height_col = [col for col in df.columns if "height" in col.lower()][0]

print("\nUsing column:", height_col)

x = df[height_col]

# ----------------------------------------------------------
# 3. Clean data
# ----------------------------------------------------------

x = x.dropna()

x = x[x > 0]

x = x.values

print("\nSample size:", len(x))

# ----------------------------------------------------------
# 4. Basic statistics
# ----------------------------------------------------------

print("\nBasic statistics")
print("----------------------")
print("Mean:", np.mean(x))
print("Median:", np.median(x))
print("Standard deviation:", np.std(x, ddof=1))

# ----------------------------------------------------------
# 5. Histogram
# ----------------------------------------------------------

plt.figure(figsize=(10,6))

plt.hist(x, bins=30, density=True, alpha=0.6)

mu_hat = np.mean(x)
sigma_hat = np.std(x, ddof=1)

x_grid = np.linspace(min(x), max(x), 500)

normal_density = norm.pdf(
    x_grid,
    loc=mu_hat,
    scale=sigma_hat
)

plt.plot(
    x_grid,
    normal_density,
    linewidth=2,
    label="Fitted normal density"
)

plt.title("Histogram of Heights")
plt.xlabel("Height")
plt.ylabel("Density")

plt.legend()
plt.grid(True)

plt.show()

# ----------------------------------------------------------
# 6. QQ-plot
# ----------------------------------------------------------

stats.probplot(x, dist="norm", plot=plt)

plt.title("QQ-Plot of Heights")

plt.grid(True)

plt.show()

# ----------------------------------------------------------
# 7. Skewness and kurtosis
# ----------------------------------------------------------

sample_skewness = skew(x)

sample_kurtosis = kurtosis(x)

print("\nSkewness and Kurtosis")
print("----------------------")
print("Skewness:", sample_skewness)
print("Excess kurtosis:", sample_kurtosis)

# ----------------------------------------------------------
# 8. Jarque-Bera test
# ----------------------------------------------------------

jb_stat, jb_p = jarque_bera(x)

# ----------------------------------------------------------
# 9. Shapiro-Wilk test
# ----------------------------------------------------------

# Shapiro can fail for very large datasets
# Use a random subsample if necessary

if len(x) > 5000:
    x_shapiro = np.random.choice(x, 5000, replace=False)
else:
    x_shapiro = x

shapiro_stat, shapiro_p = shapiro(x_shapiro)

# ----------------------------------------------------------
# 10. Anderson-Darling test
# ----------------------------------------------------------

ad_result = anderson(x, dist="norm")

ad_stat = ad_result.statistic

ad_critical_5 = ad_result.critical_values[
    list(ad_result.significance_level).index(5.0)
]

ad_decision = (
    "Reject H0"
    if ad_stat > ad_critical_5
    else "Fail to reject H0"
)

# ----------------------------------------------------------
# 11. Pearson Chi-Square test
# ----------------------------------------------------------

k = 10

probabilities = np.linspace(0, 1, k + 1)

bin_edges = norm.ppf(
    probabilities,
    loc=mu_hat,
    scale=sigma_hat
)

bin_edges[0] = -np.inf
bin_edges[-1] = np.inf

observed, _ = np.histogram(
    x,
    bins=bin_edges
)

expected = np.ones(k) * len(x) / k

chi_stat = np.sum(
    (observed - expected)**2 / expected
)

df_chi = k - 1 - 2

chi_p = 1 - stats.chi2.cdf(
    chi_stat,
    df=df_chi
)

# ----------------------------------------------------------
# 12. Results table
# ----------------------------------------------------------

alpha = 0.05

results = pd.DataFrame({
    "Test": [
        "Jarque-Bera",
        "Shapiro-Wilk",
        "Anderson-Darling",
        "Pearson Chi-Square"
    ],
    "Statistic": [
        jb_stat,
        shapiro_stat,
        ad_stat,
        chi_stat
    ],
    "p-value / critical value": [
        jb_p,
        shapiro_p,
        f"5% critical value = {ad_critical_5:.4f}",
        chi_p
    ],
    "Decision at alpha=0.05": [
        "Reject H0" if jb_p < alpha else "Fail to reject H0",
        "Reject H0" if shapiro_p < alpha else "Fail to reject H0",
        ad_decision,
        "Reject H0" if chi_p < alpha else "Fail to reject H0"
    ]
})

print("\nNormality Test Results")
print("----------------------")

display(results)

# ----------------------------------------------------------
# 13. Pearson observed vs expected table
# ----------------------------------------------------------

pearson_table = pd.DataFrame({
    "Bin": np.arange(1, k + 1),
    "Observed": observed,
    "Expected": expected
})

print("\nObserved vs Expected Frequencies")
print("--------------------------------")

display(pearson_table)
Columns:
Index(['Unnamed: 0', 'Weight (kg)', 'Standing Height (cm)', 'BMI(kg/m**2)'], dtype='object')

Using column: Standing Height (cm)

Sample size: 8388

Basic statistics
----------------------
Mean: 166.64118979494518
Median: 166.2
Standard deviation: 10.079013460239327
Skewness and Kurtosis
----------------------
Skewness: 0.14247082433871722
Excess kurtosis: -0.41736992735109846

Normality Test Results
----------------------
Test Statistic p-value / critical value Decision at alpha=0.05
0 Jarque-Bera 89.258595 0.0 Reject H0
1 Shapiro-Wilk 0.994761 0.0 Reject H0
2 Anderson-Darling 10.993311 5% critical value = 0.7870 Reject H0
3 Pearson Chi-Square 78.161183 0.0 Reject H0
Observed vs Expected Frequencies
--------------------------------
Bin Observed Expected
0 1 820 838.8
1 2 1010 838.8
2 3 888 838.8
3 4 816 838.8
4 5 780 838.8
5 6 736 838.8
6 7 756 838.8
7 8 780 838.8
8 9 880 838.8
9 10 922 838.8
In [ ]:
 
In [ ]: