Complete guide to all 30 distributions.
Use when: Data is symmetric, bell-shaped, no outliers
from distfit_pro import get_distribution
import numpy as np
# Heights of adult males (cm)
data = np.random.normal(175, 7, 1000)
dist = get_distribution('normal')
dist.fit(data)
print(dist.summary())Parameters:
- loc (μ): mean/center
- scale (σ): standard deviation
When NOT to use: - Skewed data - Heavy tails - Bounded data (e.g., percentages)
Use when: Data is positive, right-skewed
# Income data ($)
data = np.random.lognormal(10, 0.5, 1000)
dist = get_distribution('lognormal')
dist.fit(data)
print(dist.summary())Parameters:
- s (σ): shape (log-scale std)
- scale (exp(μ)): scale
Common applications: - Income/wealth - Stock prices - File sizes - Particle sizes
Use when: Modeling time-to-failure, lifetimes
# Component lifetime (hours)
data = np.random.weibull(1.5, 1000) * 1000
dist = get_distribution('weibull')
dist.fit(data)
print(dist.summary())
# Reliability at t=500 hours
reliability = dist.reliability(500)
print(f"Reliability at 500h: {reliability:.4f}")Parameters:
- c (k): shape
- k < 1: decreasing failure rate (infant mortality)
- k = 1: constant failure rate (random failures)
- k > 1: increasing failure rate (wear-out)
scale(λ): scale
Applications: - Reliability engineering - Failure time analysis - Wind speed modeling
Use when: Waiting times, sum of exponentials
# Waiting time for 5 events
data = np.random.gamma(5, 2, 1000)
dist = get_distribution('gamma')
dist.fit(data)
print(dist.summary())Parameters:
- a (α): shape
- scale (θ): scale
Special cases: - α = 1: Exponential distribution - α = k/2, θ = 2: Chi-square with k df
Use when: Time between events (memoryless)
# Time between arrivals (minutes)
data = np.random.exponential(5, 1000)
dist = get_distribution('exponential')
dist.fit(data)
print(dist.summary())
# Probability of waiting < 3 minutes
prob = dist.cdf(np.array([3]))[0]
print(f"P(wait < 3 min) = {prob:.4f}")Key property: Memoryless!
# P(X > 10 | X > 5) = P(X > 5)
# Past doesn't affect futureUse when: Data is bounded between 0 and 1
# Success rates, percentages
data = np.random.beta(2, 5, 1000)
dist = get_distribution('beta')
dist.fit(data)
print(dist.summary())Parameters:
- a (α): shape 1
- b (β): shape 2
Applications: - Conversion rates - Probabilities - Proportions - Bayesian priors
Use when: Power-law, heavy tails, 80-20 rule
# Wealth distribution
data = (np.random.pareto(2, 1000) + 1) * 50000
dist = get_distribution('pareto')
dist.fit(data)
print(dist.summary())Applications: - Wealth/income distribution - City sizes - Word frequencies
Use when: Small samples, heavier tails than normal
# Small sample data
data = np.random.standard_t(5, 100)
dist = get_distribution('studentt')
dist.fit(data)
print(dist.summary())Parameters:
- df (ν): degrees of freedom
- As df → ∞, approaches Normal
Use when: Count of rare events in fixed interval
# Number of calls per hour
data = np.random.poisson(lam=3.5, size=1000)
dist = get_distribution('poisson')
dist.fit(data)
print(dist.summary())
# P(exactly 5 calls)
prob = dist.pdf(np.array([5]))[0]
print(f"P(X = 5) = {prob:.4f}")Parameter:
- mu (λ): rate (mean = variance)
Applications: - Call center arrivals - Website visitors - Defects in manufacturing
Use when: n independent yes/no trials
# 10 coin flips, p=0.5
data = np.random.binomial(n=10, p=0.5, size=1000)
dist = get_distribution('binomial')
dist.fit(data)
print(dist.summary())Parameters:
- n: number of trials
- p: success probability
Applications: - Quality control (pass/fail) - Survey responses (yes/no) - A/B testing
Use when: Overdispersed count data (variance > mean)
# Overdispersed counts
data = np.random.negative_binomial(5, 0.5, 1000)
dist = get_distribution('nbinom')
dist.fit(data)
print(dist.summary())Better than Poisson when: - Data shows more variability - Clustering of events
Continuous (25):
- Normal - symmetric, bell curve
- Lognormal - positive, right-skewed
- Weibull - reliability, lifetimes
- Gamma - waiting times
- Exponential - time between events
- Beta - bounded [0,1]
- Uniform - equal probability
- Triangular - three-point estimate
- Logistic - growth models
- Gumbel - extreme values (max)
- Frechet - extreme values (positive)
- Pareto - power law, 80-20
- Cauchy - undefined mean/variance
- Student's t - heavy tails
- Chi-squared - variance tests
- F - variance ratio
- Rayleigh - signal processing
- Laplace - sparse data
- Inverse Gamma - Bayesian priors
- Log-Logistic - survival analysis
Discrete (5):
- Poisson - rare event counts
- Binomial - n trials
- Negative Binomial - overdispersed
- Geometric - trials to first success
- Hypergeometric - sampling without replacement
Decision Tree:
- Is data discrete (counts) or continuous?
- Discrete → Poisson, Binomial, etc.
- Continuous → continue
- Is data bounded?
- [0, 1] → Beta
- [a, b] → Uniform, Triangular
- [0, ∞) → Lognormal, Gamma, Weibull, Exponential
- (-∞, ∞) → Normal, Logistic, Cauchy, Student's t
- Is data skewed?
- Right-skewed → Lognormal, Gamma, Weibull
- Symmetric → Normal, Logistic, Student's t
- Left-skewed → Reflected versions
- Heavy tails?
- Yes → Student's t, Cauchy, Pareto
- No → Normal, Logistic
- Special domain?
- Reliability → Weibull, Exponential
- Extreme values → Gumbel, Frechet
- Survival → Weibull, Log-Logistic
- :doc:`03_fitting_methods` - Different ways to fit
- :doc:`04_gof_tests` - Test if fit is good