Comprehensive Guide to Distribution Fitting in Python
Quick Start β’ Examples β’ Learning Path β’ Real-World Applications
- Quick Start
- Examples Overview
- Learning Path
- Installation
- Examples by Topic
- Real-World Applications
- Best Practices
- Troubleshooting
- Contributing
from distfit_pro import get_distribution
import numpy as np
# Generate sample data
data = np.random.normal(100, 15, 1000)
# Fit distribution
dist = get_distribution('normal')
dist.fit(data)
# Get statistics
print(f"Mean: {dist.mean():.2f}")
print(f"Std: {dist.std():.2f}")
print(f"AIC: {dist.aic():.2f}")from distfit_pro import find_best_distribution
# Try multiple distributions
candidates = ['normal', 'lognormal', 'gamma', 'weibull_min']
best = find_best_distribution(data, candidates)
print(f"Best distribution: {best.name}")
print(f"AIC: {best.aic():.2f}")This repository contains 20+ comprehensive examples organized into 7 categories:
examples/
βββ 01_basics/ # Start here!
β βββ basic_fitting.py
β βββ common_distributions.py
βββ 02_advanced_fitting/ # Advanced estimation
β βββ maximum_likelihood.py
β βββ method_of_moments.py
βββ 03_model_selection/ # Choose best distribution
β βββ aic_bic_comparison.py
β βββ cross_validation.py
β βββ hypothesis_testing.py
βββ 04_goodness_of_fit/ # Validate fits
β βββ ks_test.py
β βββ chi_square_test.py
β βββ anderson_darling.py
βββ 05_visualization/ # Beautiful plots
β βββ pdf_cdf_plots.py
β βββ qq_pp_plots.py
β βββ interactive_plots.py
βββ 06_real_world/ # Practical applications
β βββ finance_analysis.py
β βββ reliability_engineering.py
β βββ quality_control.py
βββ 07_advanced_topics/ # Expert techniques
βββ mixture_models.py
βββ bootstrap_confidence.py
βββ custom_distributions.py
-
01_basics/basic_fitting.py β
- First steps with distribution fitting
- Understanding parameters
- Simple visualizations
-
01_basics/common_distributions.py
- Normal, Exponential, Gamma, Weibull
- When to use each distribution
- Parameter interpretation
-
05_visualization/pdf_cdf_plots.py
- Visualize fitted distributions
- Compare multiple fits
- Publication-quality plots
-
03_model_selection/aic_bic_comparison.py
- AIC vs BIC
- Model comparison
- Avoiding overfitting
-
04_goodness_of_fit/ks_test.py
- Validate your fits
- Kolmogorov-Smirnov test
- Statistical significance
-
05_visualization/qq_pp_plots.py
- Q-Q plots for diagnostics
- Identify distribution issues
- Tail behavior analysis
-
06_real_world/ (Choose your domain)
- finance_analysis.py: Risk, VaR, portfolios
- reliability_engineering.py: Failure analysis, MTBF
- quality_control.py: SPC, Cp/Cpk
-
07_advanced_topics/mixture_models.py
- Gaussian mixture models
- Multiple populations
- EM algorithm
-
07_advanced_topics/bootstrap_confidence.py
- Uncertainty quantification
- Confidence intervals
- Parameter stability
-
07_advanced_topics/custom_distributions.py
- Create your own distributions
- Kernel Density Estimation
- Truncated distributions
File: 06_real_world/finance_analysis.py
# Value at Risk (VaR) calculation
returns = load_stock_returns()
dist = get_distribution('t') # Fat-tailed
dist.fit(returns)
var_95 = dist.ppf(0.05) # 95% VaR
print(f"Maximum expected loss: {var_95*100:.2f}%")Use Cases:
- Portfolio risk assessment
- VaR calculation
- Stress testing
- Option pricing
File: 06_real_world/quality_control.py
# Process capability analysis
measurements = load_process_data()
dist = get_distribution('normal')
dist.fit(measurements)
# Calculate Cpk
USL, LSL = 10.5, 9.5
Cpk = calculate_cpk(dist, USL, LSL)
print(f"Process capability: Cpk = {Cpk:.3f}")Use Cases:
- Process capability (Cp/Cpk)
- Control charts
- Six Sigma analysis
- Defect rate estimation
File: 06_real_world/reliability_engineering.py
# Weibull failure analysis
failure_times = load_failure_data()
dist = get_distribution('weibull_min')
dist.fit(failure_times)
# Mean Time Between Failures
mtbf = dist.mean()
print(f"MTBF: {mtbf:.0f} hours")
# Reliability at time t
R_1000 = dist.sf(1000) # Survival function
print(f"Reliability at 1000h: {R_1000:.3f}")Use Cases:
- Failure time analysis
- Maintenance scheduling
- Reliability prediction
- Warranty analysis
| Method | File | Complexity | Use When |
|---|---|---|---|
| Maximum Likelihood | 02_advanced_fitting/maximum_likelihood.py |
βββ | Standard approach, works well |
| Method of Moments | 02_advanced_fitting/method_of_moments.py |
ββ | Fast, simple parameters |
| Kernel Density | 07_advanced_topics/custom_distributions.py |
βββ | Non-parametric, complex data |
| Criterion | File | Pros | Cons |
|---|---|---|---|
| AIC | 03_model_selection/aic_bic_comparison.py |
Balances fit & complexity | Can overfit |
| BIC | 03_model_selection/aic_bic_comparison.py |
Penalizes complexity more | May underfit |
| Cross-Validation | 03_model_selection/cross_validation.py |
Data-driven | Computationally expensive |
| Hypothesis Tests | 03_model_selection/hypothesis_testing.py |
Statistical rigor | Binary decision |
| Test | File | Best For | Limitations |
|---|---|---|---|
| Kolmogorov-Smirnov | 04_goodness_of_fit/ks_test.py |
Overall fit | Sensitive to middle |
| Chi-Square | 04_goodness_of_fit/chi_square_test.py |
Categorical data | Requires binning |
| Anderson-Darling | 04_goodness_of_fit/anderson_darling.py |
Tail behavior | Specific distributions |
| Plot | File | Purpose |
|---|---|---|
| PDF/CDF | 05_visualization/pdf_cdf_plots.py |
See distribution shape |
| Q-Q Plot | 05_visualization/qq_pp_plots.py |
Diagnose fit quality |
| P-P Plot | 05_visualization/qq_pp_plots.py |
Check probability match |
| Interactive | 05_visualization/interactive_plots.py |
Explore data dynamically |
import matplotlib.pyplot as plt
# Look at your data!
plt.hist(data, bins=50, edgecolor='black')
plt.show()
# Check for:
# - Outliers
# - Multimodality
# - Skewness
# - Bounded rangescandidates = ['normal', 'lognormal', 'gamma', 'weibull_min']
results = {}
for dist_name in candidates:
dist = get_distribution(dist_name)
dist.fit(data)
results[dist_name] = dist.aic()
# Choose best by AIC
best = min(results, key=results.get)
print(f"Best: {best} (AIC={results[best]:.2f})")from scipy import stats
import numpy as np
# Fit distribution
dist.fit(data)
# Q-Q plot
percentiles = np.linspace(0.01, 0.99, len(data))
theoretical = dist.ppf(percentiles)
empirical = np.sort(data)
plt.scatter(theoretical, empirical)
plt.plot([data.min(), data.max()], [data.min(), data.max()], 'r--')
plt.show()
# Points should fall on diagonal line!# Use bootstrap for confidence intervals
from examples.advanced_topics.bootstrap_confidence import bootstrap_ci
ci_mean = bootstrap_ci(data, statistic=np.mean, n_bootstrap=1000)
print(f"Mean: {data.mean():.2f} [95% CI: {ci_mean[0]:.2f}, {ci_mean[1]:.2f}]")# For normal distribution:
# 1. Check normality
from scipy.stats import shapiro
stat, p_value = shapiro(data)
print(f"Shapiro-Wilk p-value: {p_value:.4f}")
# 2. Check for outliers
z_scores = np.abs(stats.zscore(data))
outliers = data[z_scores > 3]
print(f"Outliers: {len(outliers)}")Solution 1: Try different distributions
# Your data might not be normally distributed
candidates = ['lognormal', 'gamma', 'weibull_min', 'beta']Solution 2: Check for mixture
# Multiple populations?
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=2)Solution 3: Use non-parametric
# KDE doesn't assume a distribution
from scipy.stats import gaussian_kde
kde = gaussian_kde(data)Solution 1: Check data range
# Some distributions require positive data
if (data <= 0).any():
data = data[data > 0] # Filter
# Or shift: data = data - data.min() + 0.01Solution 2: Scale data
# Large values can cause numerical issues
data_scaled = (data - data.mean()) / data.std()Solution 3: Use different method
# Try Method of Moments instead of MLE
dist.fit(data, method='MoM')Rule of thumb:
- n < 50: Use 1-2 parameter distributions
- n = 50-200: Up to 3 parameters OK
- n > 200: Can use complex distributions
# Check sample size
if len(data) < 50:
candidates = ['normal', 'exponential'] # Simple
else:
candidates = ['normal', 'gamma', 'weibull_min'] # More complex| Distribution | Use When | Parameters | Domain |
|---|---|---|---|
| Normal | Symmetric, bell-shaped | ΞΌ, Ο | (-β, β) |
| Lognormal | Right-skewed, positive | ΞΌ, Ο | (0, β) |
| Exponential | Time between events | Ξ» | (0, β) |
| Gamma | Positive, flexible shape | Ξ±, Ξ² | (0, β) |
| Weibull | Failure times | Ξ² (shape), Ξ· (scale) | (0, β) |
| Beta | Bounded [0,1] | Ξ±, Ξ² | [0, 1] |
| Uniform | Equal probability | a, b | [a, b] |
# After fitting
dist.mean() # Expected value
dist.std() # Standard deviation
dist.var() # Variance
dist.median() # 50th percentile
dist.aic() # Akaike Information Criterion
dist.bic() # Bayesian Information Criterion
dist.pdf(x) # Probability density at x
dist.cdf(x) # Cumulative probability at x
dist.ppf(q) # Quantile (inverse CDF)
dist.sf(x) # Survival function (1 - CDF)# Navigate to examples directory
cd examples/
# Run any example
python 01_basics/basic_fitting.py
python 06_real_world/finance_analysis.py# Run all examples in a folder
for file in 01_basics/*.py; do python "$file"; done# Import example utilities
from examples.model_selection.aic_bic_comparison import compare_models
from examples.visualization.pdf_cdf_plots import plot_fit
# Use example functions
results = compare_models(data, ['normal', 'lognormal'])
plot_fit(data, best_dist)We welcome contributions! Here's how:
- Report Issues: Found a bug? Open an issue
- Suggest Examples: Have a use case? Share it
- Submit PR: Improved code? Send a pull request
- Follow existing code style
- Add docstrings and comments
- Include example output
- Test with different data
- Statistical Distributions by Forbes et al.
- Probability and Statistics for Engineers by Montgomery
- Akaike (1974) - AIC
- Schwarz (1978) - BIC
- Shapiro & Wilk (1965) - Normality test
MIT License - see LICENSE for details
Ali Sadeghi Aghili
- GitHub: @alisadeghiaghili
- Website: zil.ink/thedatascientist
If these examples helped you, please β star this repo!
Happy Distribution Fitting! π
Made with β€οΈ by Ali Sadeghi Aghili