add comparisons to paper, correction in examples notebooks

sebastianherreramonterrosa · sebastianherreramonterrosa · commit 5adeab6401ae · 2025-05-12T22:49:10.000-05:00
diff --git a/benckmarks/fit/continuous.ipynb b/benckmarks/fit/continuous.ipynb
@@ -66,7 +66,7 @@
     "            distribution_class = phitter.continuous.CONTINUOUS_DISTRIBUTIONS[id_distribution]\n",
     "            data = distribution_class(init_parameters_examples=True).sample(sample_size)\n",
     "            ti = time.time()\n",
-    "            phi = phitter.Phitter(data=data=data)\n",
+    "            phi = phitter.Phitter(data=data)\n",
     "            phi.fit(n_workers=n_workers)\n",
     "            tf = time.time() - ti\n",
     "            df_fit_time = df_fit_time.fillna(0)\n",
diff --git a/benckmarks/fit/discrete.ipynb b/benckmarks/fit/discrete.ipynb
@@ -72,7 +72,7 @@
     "        for id_distribution, distribution_class in phitter.discrete.DISCRETE_DISTRIBUTIONS.items():\n",
     "            data = distribution_class(init_parameters_examples=True).sample(sample_size)\n",
     "            ti = time.time()\n",
-    "            phi = phitter.Phitter(data=data=data, fit_type=\"discrete\")\n",
+    "            phi = phitter.Phitter(data=data, fit_type=\"discrete\")\n",
     "            phi.fit(n_workers=n_workers)\n",
     "            tf = time.time() - ti\n",
     "            df_fit_time = df_fit_time.fillna(0)\n",
diff --git a/examples/fit/fit_continuous_college.ipynb b/examples/fit/fit_continuous_college.ipynb
@@ -239,7 +239,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "phi = phitter.Phitter(data=data=data)\n",
+    "phi = phitter.Phitter(data=data)\n",
     "phi.fit(n_workers=2)"
    ]
   },
diff --git a/examples/fit/fit_continuous_iris.ipynb b/examples/fit/fit_continuous_iris.ipynb
@@ -158,7 +158,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "phi = phitter.Phitter(data=data=data)\n",
+    "phi = phitter.Phitter(data=data)\n",
     "phi.fit(n_workers=2)"
    ]
   },
diff --git a/examples/fit/fit_continuous_ncdb.ipynb b/examples/fit/fit_continuous_ncdb.ipynb
@@ -257,7 +257,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "phi = phitter.Phitter(data=data=data)\n",
+    "phi = phitter.Phitter(data=data)\n",
     "phi.fit(n_workers=4)"
    ]
   },
diff --git a/examples/fit/fit_continuous_winequality.ipynb b/examples/fit/fit_continuous_winequality.ipynb
@@ -86,7 +86,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "phi = phitter.Phitter(data=data=data)\n",
+    "phi = phitter.Phitter(data=data)\n",
     "phi.fit(n_workers=2)"
    ]
   },
diff --git a/examples/fit/fit_specific_distribution.ipynb b/examples/fit/fit_specific_distribution.ipynb
@@ -75,7 +75,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "phi = phitter.Phitter(data=data=data, distributions_to_fit=[\"beta\"])\n",
+    "phi = phitter.Phitter(data=data, distributions_to_fit=[\"beta\"])\n",
     "phi.fit()"
    ]
   },
diff --git a/paper/paper.md b/paper/paper.md
@@ -62,16 +62,36 @@ To address these challenges, there is a clear need for an accessible, open-sourc
 
 # Comparison with Existing Tools
 
-The process of fitting probability distributions to data is a fundamental step in various scientific and analytical disciplines. It allows for the modeling of random phenomena, enabling tasks such as statistical inference, forecasting, and simulation. Several Python libraries have been developed to facilitate this process, providing tools for identifying the best-fitting theoretical distribution for a given dataset. This section describes two such prominent libraries: `distfit` (Taskesen et al., 2020) and `fitter` (Thomas Cokelaer).
+The process of fitting probability distributions to data is a fundamental step in various scientific and analytical disciplines. It allows for the modeling of random phenomena, enabling tasks such as statistical inference, forecasting, and simulation. Several Python libraries have been developed to facilitate this process, providing tools for identifying the best-fitting theoretical distribution for a given dataset. This section describes two such prominent libraries: `distfit` [@Taskesen_distfit_is_a_2020] and `fitter` [@Thomas2024cokelaer].
 
 ## The `distfit` Library
 
 The `distfit` library, created by Erdogan Taskesen and released in 2020, is a Python package designed for fitting probability density functions to univariate data. It can determine the best fit from 89 theoretical distributions using metrics like RSS/SSE, Wasserstein, KS, and Energy. Beyond parametric fitting, `distfit` also supports non-parametric methods (quantile and percentile) and discrete fitting using the binomial distribution. The library offers functionalities for predictions and a range of visualizations, including basic plots, QQ plots, and the ability to overlay multiple fitted distributions. Notably, `distfit` supports parallel computing to enhance performance and is available under the MIT License .
 
-##   The `fitter` Library
+## The `fitter` Library
 
 The `fitter` library, developed by Thomas Cokelaer, is a Python tool for simplifying the process of fitting probability distributions to data. It automatically attempts to fit a dataset to around 80 distributions from the SciPy package, ranking them based on the sum of the square errors (SSE). `fitter` supports parallelism to speed up the fitting process, especially with larger datasets [9, 10, 9]. It also provides a standalone command-line application for fitting distributions from CSV files. Users can manually specify a subset of distributions for fitting if desired. The library is under active development and is licensed under the GNU Library or Lesser General Public License (LGPL).
 
+## Speed Comparison: Phitter vs Distfit vs Fitter
+
+The following table presents a performance comparison of the Phitter, Distfit, and Fitter libraries in terms of parameter estimation time using their default configurations. Each library was evaluated on normally distributed datasets of varying sizes: 100, 1,000, 10,000, 100,000, and 1,000,000 samples.
+
+| Library / Sample Size |    100 |  1,000 | 10,000 | 100,000 | 1,000,000 |
+| :-------------------- | -----: | -----: | -----: | ------: | --------: |
+| **Phitter**           |  1.120 |  1.818 |  9.102 |  79.829 |   791.674 |
+| **Distfit**           |  2.604 |  5.279 | 28.575 | 299.398 |  2726.630 |
+| **Fitter**            | 37.252 | 30.380 | 31.522 | 401.644 |  1322.134 |
+
+-   **Phitter** tests 75 continuous probability distributions.
+-   **Distfit** evaluates 85 continuous distributions. See [Distfit Parametric Distributions](https://erdogant.github.io/distfit/pages/html/Parametric.html).
+-   **Fitter** iterates over all continuous distributions available in `scipy.stats`, automatically excluding those whose parameter estimation exceeds 30 seconds.
+
+## Goodness-of-Fit Comparison
+
+-   **Phitter** supports statistical goodness-of-fit tests including **Chi-Square**, **Kolmogorov–Smirnov**, and **Anderson–Darling**.
+-   **Distfit**, by default, relies on error-based metrics such as **RSS/SSE**, **Wasserstein distance**, and **energy distance**. It does not perform hypothesis testing unless explicitly instructed to use a function from `scipy.stats.goodness_of_fit`.
+-   **Fitter** always reports the **Kolmogorov–Smirnov** test statistic and p-value. However, its primary selection criterion is the minimization of the **sum of squared errors (SSE)**.
+
 # Documentation
 
 Find the complete Phitter documentation [here](https://docs-phitter-kernel.netlify.app/).

Original file line number	Diff line number	Diff line change
`@@ -239,7 +239,7 @@`
`239`	`239`	`"metadata": {},`
`240`	`240`	`"outputs": [],`
`241`	`241`	`"source": [`
`242`		`- "phi = phitter.Phitter(data=data=data)\n",`
	`242`	`+ "phi = phitter.Phitter(data=data)\n",`
`243`	`243`	`"phi.fit(n_workers=2)"`
`244`	`244`	`]`
`245`	`245`	`},`
Original file line number	Diff line number	Diff line change
`@@ -158,7 +158,7 @@`
`158`	`158`	`"metadata": {},`
`159`	`159`	`"outputs": [],`
`160`	`160`	`"source": [`
`161`		`- "phi = phitter.Phitter(data=data=data)\n",`
	`161`	`+ "phi = phitter.Phitter(data=data)\n",`
`162`	`162`	`"phi.fit(n_workers=2)"`
`163`	`163`	`]`
`164`	`164`	`},`
Original file line number	Diff line number	Diff line change
`@@ -257,7 +257,7 @@`
`257`	`257`	`"metadata": {},`
`258`	`258`	`"outputs": [],`
`259`	`259`	`"source": [`
`260`		`- "phi = phitter.Phitter(data=data=data)\n",`
	`260`	`+ "phi = phitter.Phitter(data=data)\n",`
`261`	`261`	`"phi.fit(n_workers=4)"`
`262`	`262`	`]`
`263`	`263`	`},`
Original file line number	Diff line number	Diff line change
`@@ -86,7 +86,7 @@`
`86`	`86`	`"metadata": {},`
`87`	`87`	`"outputs": [],`
`88`	`88`	`"source": [`
`89`		`- "phi = phitter.Phitter(data=data=data)\n",`
	`89`	`+ "phi = phitter.Phitter(data=data)\n",`
`90`	`90`	`"phi.fit(n_workers=2)"`
`91`	`91`	`]`
`92`	`92`	`},`
Original file line number	Diff line number	Diff line change
`@@ -75,7 +75,7 @@`
`75`	`75`	`"metadata": {},`
`76`	`76`	`"outputs": [],`
`77`	`77`	`"source": [`
`78`		`- "phi = phitter.Phitter(data=data=data, distributions_to_fit=[\"beta\"])\n",`
	`78`	`+ "phi = phitter.Phitter(data=data, distributions_to_fit=[\"beta\"])\n",`
`79`	`79`	`"phi.fit()"`
`80`	`80`	`]`
`81`	`81`	`},`