Skip to content

alisadeghiaghili/py-distfit-pro

Repository files navigation

DistFit Pro 🎯

Professional Distribution Fitting for Python

A comprehensive, production-ready library for statistical distribution fitting that surpasses EasyFit and R's fitdistrplus with modern statistical methods, exceptional user experience, and robust software engineering.

Python 3.8+ License: MIT Version Documentation

English | فارسی | Deutsch | 📋 CHANGELOG


🌟 What's New in v1.0.0

🎉 MAJOR RELEASE - Complete Feature Set!

30 Statistical Distributions (25 continuous + 5 discrete)
Goodness-of-Fit Tests (KS, AD, Chi-Square, Cramér-von Mises)
Bootstrap Confidence Intervals (Parametric & Non-parametric with BCa)
Enhanced Diagnostics (Residuals, Influence, Outlier Detection)
Weighted Data Support (Survey data, stratified sampling, frequency counts)
Multiple Estimation Methods (MLE, Moments, Quantile matching)
Multilingual (English, فارسی, Deutsch)
Comprehensive Documentation (9 tutorials + API reference)
20+ Complete Examples (8,500+ lines across 7 folders) 🆕

📚 New: Comprehensive Examples
Explore 20+ production-ready examples covering:

  • Basics & common distributions
  • Advanced fitting methods (MLE, MoM)
  • Model selection (AIC, BIC, Cross-validation)
  • Goodness-of-fit testing
  • Beautiful visualizations (PDF, CDF, Q-Q plots, interactive)
  • Real-world applications (Finance, Reliability, Quality Control)
  • Advanced topics (Mixture models, Bootstrap, Custom distributions)

🚀 Why Choose DistFit Pro?

Better Than EasyFit

  • ✅ Free and open source (MIT license)
  • ✅ Python ecosystem integration (NumPy, SciPy, pandas)
  • ✅ Advanced GOF tests (not just visual assessment)
  • ✅ Bootstrap CI (uncertainty quantification)
  • ✅ Weighted data support
  • ✅ Automated model selection (AIC/BIC)

Better Than R's fitdistrplus

  • ✅ Simpler, cleaner API
  • ✅ Better performance (parallel processing built-in)
  • ✅ Modern visualizations (matplotlib + plotly)
  • ✅ Self-documenting code and outputs
  • ✅ Multilingual support
  • ✅ More distributions (30 vs 23)

Professional Quality

  • ✅ Production-ready code
  • ✅ Comprehensive test suite
  • ✅ Full documentation (9 tutorials + 20+ examples)
  • ✅ Type hints throughout
  • ✅ Clean, maintainable architecture

📦 Installation

pip install distfit-pro

Development Installation:

git clone https://github.com/alisadeghiaghili/py-distfit-pro.git
cd py-distfit-pro
pip install -e ".[dev]"

Requirements:

  • Python >= 3.8
  • NumPy >= 1.20
  • SciPy >= 1.7
  • Matplotlib >= 3.3
  • Plotly >= 5.0
  • joblib >= 1.0
  • tqdm >= 4.60

⚡ Quick Start

Basic Usage

from distfit_pro import get_distribution
import numpy as np

# Generate data
np.random.seed(42)
data = np.random.normal(loc=10, scale=2, size=1000)

# Fit distribution
dist = get_distribution('normal')
dist.fit(data, method='mle')

# View results
print(dist.summary())  # Complete statistical summary
print(dist.explain())  # Conceptual explanation

Goodness-of-Fit Testing

from distfit_pro.core.gof_tests import GOFTests

# Run all GOF tests
results = GOFTests.run_all_tests(data, dist)
print(GOFTests.summary_table(results))

Bootstrap Confidence Intervals

from distfit_pro.core.bootstrap import Bootstrap

# Parametric bootstrap (1000 samples, parallel)
ci_results = Bootstrap.parametric(data, dist, n_bootstrap=1000, n_jobs=-1)

for param, result in ci_results.items():
    print(result)

Diagnostics & Outliers

from distfit_pro.core.diagnostics import Diagnostics

# Residual analysis
residuals = Diagnostics.residual_analysis(data, dist)
print(residuals.summary())

# Detect outliers
outliers = Diagnostics.detect_outliers(data, dist, method='zscore')
print(outliers.summary())

Weighted Data

from distfit_pro.core.weighted import WeightedFitting

# Data with weights (e.g., survey sampling weights)
weights = np.random.uniform(0.5, 1.5, 1000)

# Weighted fit
params = WeightedFitting.fit_weighted_mle(data, weights, dist)
dist.params = params
dist.fitted = True

print(dist.summary())

📊 Supported Distributions

Continuous Distributions (25)

Distribution Use Cases Key Features
Normal Heights, test scores, errors Symmetric, bell curve
Lognormal Income, stock prices Right-skewed, positive
Weibull Reliability, lifetimes Flexible hazard rate
Gamma Waiting times, rainfall Sum of exponentials
Exponential Time between events Memoryless property
Beta Probabilities, rates Bounded [0,1]
Student's t Small samples Heavy tails
Pareto Wealth, power law 80-20 rule
Gumbel Extreme maxima Flood analysis
Laplace Differences, errors Double exponential

And 15 more: Uniform, Triangular, Logistic, Frechet, Cauchy, Chi-Square, F, Rayleigh, Inverse Gamma, Log-Logistic, and others.

Discrete Distributions (5)

  • Poisson - Count of rare events
  • Binomial - Success/failure trials
  • Negative Binomial - Overdispersed counts
  • Geometric - Trials to first success
  • Hypergeometric - Sampling without replacement

🎯 Core Features

1. Multiple Estimation Methods

# Maximum Likelihood (most accurate)
dist.fit(data, method='mle')

# Method of Moments (fast, robust)
dist.fit(data, method='moments')

# Quantile Matching (robust to outliers)
dist.fit(data, method='quantile', quantiles=[0.25, 0.5, 0.75])

2. Comprehensive GOF Tests

  • Kolmogorov-Smirnov - General purpose
  • Anderson-Darling - Sensitive to tails
  • Chi-Square - Frequency-based
  • Cramér-von Mises - Middle-focused

All tests include p-values, critical values, and interpretations.

3. Bootstrap Uncertainty Quantification

# Parametric bootstrap
Bootstrap.parametric(data, dist, n_bootstrap=1000)

# Non-parametric bootstrap (more conservative)
Bootstrap.nonparametric(data, dist, n_bootstrap=1000)

# BCa method (most accurate)
Bootstrap.bca_ci(boot_samples, estimate, data, estimator_func)

Features:

  • Parallel processing (uses all CPU cores)
  • Progress bars (tqdm integration)
  • Multiple confidence levels (90%, 95%, 99%)

4. Enhanced Diagnostics

Residual Analysis:

  • Quantile residuals
  • Pearson residuals
  • Deviance residuals
  • Standardized residuals

Influence Diagnostics:

  • Cook's distance
  • Leverage values
  • DFFITS
  • Automatic identification of influential observations

Outlier Detection (4 methods):

  • Z-score
  • IQR (Interquartile Range)
  • Likelihood-based
  • Mahalanobis distance

Diagnostic Plots:

  • Q-Q plot data
  • P-P plot data
  • Worm plot (detrended Q-Q)

5. Weighted Data Support

# Survey weights
WeightedFitting.fit_weighted_mle(data, sampling_weights, dist)

# Frequency data
WeightedFitting.fit_weighted_mle(values, frequencies, dist)

# Precision weights
weights = 1 / measurement_errors**2
WeightedFitting.fit_weighted_mle(measurements, weights, dist)

Utilities:

  • Weighted statistics (mean, var, quantiles)
  • Effective sample size calculation
  • Weighted bootstrap

6. Model Selection

# Compare distributions
from distfit_pro import list_distributions

candidates = ['normal', 'lognormal', 'gamma', 'weibull']
results = {}

for name in candidates:
    dist = get_distribution(name)
    dist.fit(data)
    
    # AIC = 2k - 2*log(L)
    k = len(dist.params)
    log_lik = np.sum(dist.logpdf(data))
    aic = 2 * k - 2 * log_lik
    
    results[name] = {'aic': aic, 'dist': dist}

# Best model
best = min(results.items(), key=lambda x: x[1]['aic'])
print(f"Best: {best[0]}")

🌐 Multilingual Support

DistFit Pro speaks 3 languages!

from distfit_pro import set_language

# 🇬🇧 English
set_language('en')
print(dist.explain())
# Output:
# 📊 Estimated Parameters:
#    • μ (mean): 10.0173
#    • σ (std): 1.9918
# 💡 Practical Applications:
#    • Measurement errors
#    • Heights and weights

# 🇮🇷 فارسی (Persian)
set_language('fa')
print(dist.explain())
# خروجی:
# 📊 پارامترهای برآورد شده:
#    • μ (میانگین): 10.0173
#    • σ (انحراف معیار): 1.9918
# 💡 کاربردهای عملی:
#    • خطاهای اندازه‌گیری
#    • قد و وزن

# 🇩🇪 Deutsch (German)
set_language('de')
print(dist.explain())
# Ausgabe:
# 📊 Geschätzte Parameter:
#    • μ (Mittelwert): 10.0173
#    • σ (Standardabweichung): 1.9918
# 💡 Praktische Anwendungen:
#    • Messfehler
#    • Größe und Gewicht

📚 Documentation

Comprehensive Tutorials

  1. The Basics - Your first distribution fit
  2. Distributions Guide - All 30 distributions explained
  3. Fitting Methods - MLE, Moments, Quantile
  4. GOF Tests - Test goodness-of-fit
  5. Bootstrap CI - Uncertainty quantification
  6. Diagnostics - Residuals, outliers, influence
  7. Weighted Data - Survey weights, frequencies
  8. Visualization - Beautiful plots
  9. Advanced Topics - Custom distributions, mixtures

Complete Examples

📁 examples/ - 20+ production-ready examples (8,500+ lines)

Quick Links


🔬 Real-World Examples

Example 1: Quality Control

import numpy as np
from distfit_pro import get_distribution
from distfit_pro.core.diagnostics import Diagnostics

# Manufacturing measurements
measurements = np.random.normal(100, 2, 1000)

# Fit distribution
dist = get_distribution('normal')
dist.fit(measurements)

# Detect outliers (defects)
outliers = Diagnostics.detect_outliers(
    measurements, 
    dist, 
    method='zscore',
    threshold=2.5  # Stricter for QC
)

print(f"Defect rate: {len(outliers.outlier_indices)/len(measurements)*100:.2f}%")

👉 See full example: examples/06_real_world/quality_control.py

Example 2: Financial Risk Analysis

# Stock returns
returns = load_stock_data('AAPL')['daily_return']

# Fit heavy-tailed distribution
dist = get_distribution('studentt')
dist.fit(returns)

# Value at Risk (99% confidence)
var_99 = dist.ppf(0.01)  # 1st percentile
print(f"VaR(99%): {var_99*100:.2f}%")

# Expected Shortfall
cvar_99 = dist.conditional_var(0.01)
print(f"CVaR(99%): {cvar_99*100:.2f}%")

# Bootstrap CI for VaR
from distfit_pro.core.bootstrap import Bootstrap
ci = Bootstrap.parametric(returns, dist, n_bootstrap=1000)

👉 See full example: examples/06_real_world/finance_analysis.py

Example 3: Survival Analysis

# Patient survival times
survival_times = np.array([12, 15, 18, 24, 30, 36, 48, 60])

# Fit Weibull distribution
dist = get_distribution('weibull')
dist.fit(survival_times)

# Reliability at 24 months
reliability = dist.reliability(24)
print(f"24-month survival: {reliability*100:.1f}%")

# Median survival time
median_survival = dist.ppf(0.5)
print(f"Median survival: {median_survival:.1f} months")

👉 See full example: examples/06_real_world/reliability_engineering.py


🚀 Performance

Benchmarks on Intel i7-10700K (8 cores):

Task Dataset Size Time (serial) Time (parallel) Speedup
Fit single distribution 10,000 15ms N/A -
Fit single distribution 1,000,000 450ms N/A -
Bootstrap (1000 samples) 10,000 18s 3.2s 5.6x
GOF tests (all 4) 10,000 85ms N/A -
Model selection (10 dists) 10,000 280ms 95ms 2.9x

Memory efficient: Handles datasets up to RAM limits.


📋 Version History

See CHANGELOG.md for detailed version history.

v1.0.0 - 2026-02-14 🎉

First Stable and Complete Release

✨ Major Features:

  • 30 Statistical Distributions (25 continuous + 5 discrete)
  • Multiple Estimation Methods (MLE, Moments, Quantile matching)
  • Goodness-of-Fit Tests (4 tests: KS, AD, Chi-Square, CvM)
  • Bootstrap Confidence Intervals (Parametric & Non-parametric with BCa)
  • Enhanced Diagnostics (4 residual types, influence, outlier detection)
  • Weighted Data Support (MLE + Moments)
  • Multilingual (English, فارسی, Deutsch)
  • Comprehensive Documentation (9 tutorials + API reference)
  • 20+ Complete Examples (8,500+ lines of code)
  • Parallel Processing (joblib with all cores)
  • Progress Bars (tqdm)

🛠️ Development

Current Status

Version: 1.0.0 ✅
Release Date: 2026-02-14
Status: Stable and Production-Ready

Completed Features

  • ✅ 30 Statistical Distributions
  • ✅ 3 Estimation Methods (MLE, Moments, Quantile)
  • ✅ 4 GOF Tests (KS, AD, Chi-Square, CvM)
  • ✅ Bootstrap CI (Parametric + Non-parametric + BCa)
  • ✅ Enhanced Diagnostics (4 residual types, influence, outliers)
  • ✅ Weighted Data Support (MLE + Moments)
  • ✅ Multilingual (EN/FA/DE)
  • ✅ Comprehensive Documentation (9 tutorials)
  • ✅ 20+ Complete Examples (8,500+ lines)
  • ✅ Parallel Processing (joblib)
  • ✅ Progress Bars (tqdm)

🤝 Contributing

Contributions welcome! See CONTRIBUTING.md.

Areas we need help:

  • Additional distributions
  • More GOF tests
  • Performance optimizations
  • Documentation improvements
  • Translations (add your language!)
  • More real-world examples

📄 License

MIT License - see LICENSE.

Free for commercial and personal use.


🙏 Acknowledgments

Inspired by:

  • R's fitdistrplus package (Delignette-Muller & Dutang)
  • MathWave's EasyFit software
  • SciPy's statistical distributions

Built with:

  • NumPy & SciPy - numerical computing
  • joblib - parallel processing
  • matplotlib & plotly - visualization
  • tqdm - progress bars

📞 Contact

Ali Sadeghi Aghili
🦄 Data Unicorn

🌐 zil.ink/thedatascientist
🔗 linktr.ee/aliaghili
💻 @alisadeghiaghili


⭐ Star History

If you find this project useful, please consider giving it a star! ⭐

It helps others discover the project and motivates continued development.


Made with ❤️, ☕, and rigorous statistical methodology by Ali Sadeghi Aghili

"Better statistics through better software."

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages