Skip to content

Commit 5adeab6

Browse files
add comparisons to paper, correction in examples notebooks
1 parent ddb09ee commit 5adeab6

File tree

8 files changed

+29
-9
lines changed

8 files changed

+29
-9
lines changed

benckmarks/fit/continuous.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,7 @@
6666
" distribution_class = phitter.continuous.CONTINUOUS_DISTRIBUTIONS[id_distribution]\n",
6767
" data = distribution_class(init_parameters_examples=True).sample(sample_size)\n",
6868
" ti = time.time()\n",
69-
" phi = phitter.Phitter(data=data=data)\n",
69+
" phi = phitter.Phitter(data=data)\n",
7070
" phi.fit(n_workers=n_workers)\n",
7171
" tf = time.time() - ti\n",
7272
" df_fit_time = df_fit_time.fillna(0)\n",

benckmarks/fit/discrete.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -72,7 +72,7 @@
7272
" for id_distribution, distribution_class in phitter.discrete.DISCRETE_DISTRIBUTIONS.items():\n",
7373
" data = distribution_class(init_parameters_examples=True).sample(sample_size)\n",
7474
" ti = time.time()\n",
75-
" phi = phitter.Phitter(data=data=data, fit_type=\"discrete\")\n",
75+
" phi = phitter.Phitter(data=data, fit_type=\"discrete\")\n",
7676
" phi.fit(n_workers=n_workers)\n",
7777
" tf = time.time() - ti\n",
7878
" df_fit_time = df_fit_time.fillna(0)\n",

examples/fit/fit_continuous_college.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -239,7 +239,7 @@
239239
"metadata": {},
240240
"outputs": [],
241241
"source": [
242-
"phi = phitter.Phitter(data=data=data)\n",
242+
"phi = phitter.Phitter(data=data)\n",
243243
"phi.fit(n_workers=2)"
244244
]
245245
},

examples/fit/fit_continuous_iris.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -158,7 +158,7 @@
158158
"metadata": {},
159159
"outputs": [],
160160
"source": [
161-
"phi = phitter.Phitter(data=data=data)\n",
161+
"phi = phitter.Phitter(data=data)\n",
162162
"phi.fit(n_workers=2)"
163163
]
164164
},

examples/fit/fit_continuous_ncdb.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -257,7 +257,7 @@
257257
"metadata": {},
258258
"outputs": [],
259259
"source": [
260-
"phi = phitter.Phitter(data=data=data)\n",
260+
"phi = phitter.Phitter(data=data)\n",
261261
"phi.fit(n_workers=4)"
262262
]
263263
},

examples/fit/fit_continuous_winequality.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@
8686
"metadata": {},
8787
"outputs": [],
8888
"source": [
89-
"phi = phitter.Phitter(data=data=data)\n",
89+
"phi = phitter.Phitter(data=data)\n",
9090
"phi.fit(n_workers=2)"
9191
]
9292
},

examples/fit/fit_specific_distribution.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,7 @@
7575
"metadata": {},
7676
"outputs": [],
7777
"source": [
78-
"phi = phitter.Phitter(data=data=data, distributions_to_fit=[\"beta\"])\n",
78+
"phi = phitter.Phitter(data=data, distributions_to_fit=[\"beta\"])\n",
7979
"phi.fit()"
8080
]
8181
},

paper/paper.md

Lines changed: 22 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -62,16 +62,36 @@ To address these challenges, there is a clear need for an accessible, open-sourc
6262

6363
# Comparison with Existing Tools
6464

65-
The process of fitting probability distributions to data is a fundamental step in various scientific and analytical disciplines. It allows for the modeling of random phenomena, enabling tasks such as statistical inference, forecasting, and simulation. Several Python libraries have been developed to facilitate this process, providing tools for identifying the best-fitting theoretical distribution for a given dataset. This section describes two such prominent libraries: `distfit` (Taskesen et al., 2020) and `fitter` (Thomas Cokelaer).
65+
The process of fitting probability distributions to data is a fundamental step in various scientific and analytical disciplines. It allows for the modeling of random phenomena, enabling tasks such as statistical inference, forecasting, and simulation. Several Python libraries have been developed to facilitate this process, providing tools for identifying the best-fitting theoretical distribution for a given dataset. This section describes two such prominent libraries: `distfit` [@Taskesen_distfit_is_a_2020] and `fitter` [@Thomas2024cokelaer].
6666

6767
## The `distfit` Library
6868

6969
The `distfit` library, created by Erdogan Taskesen and released in 2020, is a Python package designed for fitting probability density functions to univariate data. It can determine the best fit from 89 theoretical distributions using metrics like RSS/SSE, Wasserstein, KS, and Energy. Beyond parametric fitting, `distfit` also supports non-parametric methods (quantile and percentile) and discrete fitting using the binomial distribution. The library offers functionalities for predictions and a range of visualizations, including basic plots, QQ plots, and the ability to overlay multiple fitted distributions. Notably, `distfit` supports parallel computing to enhance performance and is available under the MIT License .
7070

71-
## The `fitter` Library
71+
## The `fitter` Library
7272

7373
The `fitter` library, developed by Thomas Cokelaer, is a Python tool for simplifying the process of fitting probability distributions to data. It automatically attempts to fit a dataset to around 80 distributions from the SciPy package, ranking them based on the sum of the square errors (SSE). `fitter` supports parallelism to speed up the fitting process, especially with larger datasets [9, 10, 9]. It also provides a standalone command-line application for fitting distributions from CSV files. Users can manually specify a subset of distributions for fitting if desired. The library is under active development and is licensed under the GNU Library or Lesser General Public License (LGPL).
7474

75+
## Speed Comparison: Phitter vs Distfit vs Fitter
76+
77+
The following table presents a performance comparison of the Phitter, Distfit, and Fitter libraries in terms of parameter estimation time using their default configurations. Each library was evaluated on normally distributed datasets of varying sizes: 100, 1,000, 10,000, 100,000, and 1,000,000 samples.
78+
79+
| Library / Sample Size | 100 | 1,000 | 10,000 | 100,000 | 1,000,000 |
80+
| :-------------------- | -----: | -----: | -----: | ------: | --------: |
81+
| **Phitter** | 1.120 | 1.818 | 9.102 | 79.829 | 791.674 |
82+
| **Distfit** | 2.604 | 5.279 | 28.575 | 299.398 | 2726.630 |
83+
| **Fitter** | 37.252 | 30.380 | 31.522 | 401.644 | 1322.134 |
84+
85+
- **Phitter** tests 75 continuous probability distributions.
86+
- **Distfit** evaluates 85 continuous distributions. See [Distfit Parametric Distributions](https://erdogant.github.io/distfit/pages/html/Parametric.html).
87+
- **Fitter** iterates over all continuous distributions available in `scipy.stats`, automatically excluding those whose parameter estimation exceeds 30 seconds.
88+
89+
## Goodness-of-Fit Comparison
90+
91+
- **Phitter** supports statistical goodness-of-fit tests including **Chi-Square**, **Kolmogorov–Smirnov**, and **Anderson–Darling**.
92+
- **Distfit**, by default, relies on error-based metrics such as **RSS/SSE**, **Wasserstein distance**, and **energy distance**. It does not perform hypothesis testing unless explicitly instructed to use a function from `scipy.stats.goodness_of_fit`.
93+
- **Fitter** always reports the **Kolmogorov–Smirnov** test statistic and p-value. However, its primary selection criterion is the minimization of the **sum of squared errors (SSE)**.
94+
7595
# Documentation
7696

7797
Find the complete Phitter documentation [here](https://docs-phitter-kernel.netlify.app/).

0 commit comments

Comments
 (0)