Skip to content

Commit 7f527f3

Browse files
Merge pull request #70 from Quantmetry/doc_readme
Doc readme
2 parents 8a2c62d + 41ba9bf commit 7f527f3

37 files changed

+444778
-1362
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ __pycache__/
99
# documentation specific
1010
docs/_build/
1111
docs/generated/
12+
docs/examples/tutorials/
1213

1314
# Distribution / packaging
1415

.pre-commit-config.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,3 +23,5 @@ repos:
2323
rev: v1.1.1
2424
hooks:
2525
- id: mypy
26+
args: [--ignore-missing-imports]
27+
additional_dependencies: [types-requests]

README.rst

Lines changed: 104 additions & 132 deletions
Original file line numberDiff line numberDiff line change
@@ -39,165 +39,141 @@ Python 3.8+
3939
🛠 Installation
4040
===============
4141

42-
Install via `pip`:
42+
Qolmat can be installed in different ways:
4343

4444
.. code:: sh
4545
46-
$ pip install qolmat
47-
48-
If you need to use tensorflow, you can install it with the following 'pip' command:
49-
50-
.. code:: sh
51-
52-
$ pip install qolmat[tensorflow]
53-
54-
To install directly from the github repository :
55-
56-
.. code:: sh
57-
58-
$ pip install git+https://github.com/Quantmetry/qolmat
46+
$ pip install qolmat # installation via `pip`
47+
$ pip install qolmat[tensorflow] # if you need tensorflow
48+
$ pip install git+https://github.com/Quantmetry/qolmat # or directly from the github repository
5949
6050
⚡️ Quickstart
6151
==============
6252

63-
Let us start with a basic imputation problem. Here, we generate one-dimensional noisy time series.
64-
65-
.. code-block:: python
53+
Let us start with a basic imputation problem.
54+
We generate one-dimensional noisy time series with missing values.
55+
With just these few lines of code, you can see how easy it is to
6656

67-
import matplotlib.pyplot as plt
68-
import numpy as np
69-
import pandas as pd
70-
71-
np.random.seed(42)
72-
t = np.linspace(0,1,1000)
73-
y = np.cos(2*np.pi*t*10)+np.random.randn(1000)/2
74-
df = pd.DataFrame({'y': y}, index=pd.Series(t, name='index'))
75-
76-
For this demonstration, let us create artificial holes in our dataset.
57+
- impute missing values with one particular imputer;
58+
- benchmark multiple imputation methods with different metrics.
7759

7860
.. code-block:: python
7961
80-
from qolmat.utils.data import add_holes
81-
plt.rcParams.update({'font.size': 18})
82-
83-
ratio_masked = 0.1
84-
mean_size = 20
85-
df_with_nan = add_holes(df, ratio_masked=ratio_masked, mean_size=mean_size)
86-
is_na = df_with_nan['y'].isna()
62+
import numpy as np
63+
import pandas as pd
8764
88-
plt.figure(figsize=(25,4))
89-
plt.plot(df_with_nan['y'],'.')
90-
plt.plot(df.loc[is_na, 'y'],'.')
91-
plt. grid()
92-
plt.xlim(0,1)
65+
from qolmat.benchmark import comparator, missing_patterns
66+
from qolmat.imputations import imputers
67+
from qolmat.utils import data
9368
94-
plt.legend(['Data', 'Missing data'])
95-
plt.savefig('readme1.png')
96-
plt.show()
97-
98-
.. image:: https://raw.githubusercontent.com/Quantmetry/qolmat/main/docs/images/readme1.png
99-
:align: center
69+
# load and prepare csv data
10070
101-
To impute missing data, there are several methods that can be imported with ``from qolmat.imputations import imputers``.
102-
The creation of an imputation dictionary will enable us to benchmark the various imputations.
71+
df_data = data.get_data("Beijing")
72+
columns = ["TEMP", "PRES", "WSPM"]
73+
df_data = df_data[columns]
74+
df_with_nan = data.add_holes(df_data, ratio_masked=0.2, mean_size=120)
10375
104-
.. code-block:: python
105-
106-
from sklearn.linear_model import LinearRegression
107-
from qolmat.imputations import imputers
108-
109-
imputer_mean = imputers.ImputerMean()
110-
imputer_median = imputers.ImputerMedian()
111-
imputer_mode = imputers.ImputerMode()
112-
imputer_locf = imputers.ImputerLOCF()
113-
imputer_nocb = imputers.ImputerNOCB()
114-
imputer_interpol = imputers.ImputerInterpolation(method="linear")
115-
imputer_spline = imputers.ImputerInterpolation(method="spline", order=2)
116-
imputer_shuffle = imputers.ImputerShuffle()
117-
imputer_residuals = imputers.ImputerResiduals(period=10, model_tsa="additive", extrapolate_trend="freq", method_interpolation="linear")
118-
imputer_rpca = imputers.ImputerRPCA(columnwise=True, period=10, max_iter=200, tau=2, lam=.3)
119-
imputer_rpca_opti = imputers.ImputerRPCA(columnwise=True, period = 10, max_iter=100)
120-
imputer_ou = imputers.ImputerEM(model="multinormal", method="sample", max_iter_em=34, n_iter_ou=15, dt=1e-3)
121-
imputer_tsou = imputers.ImputerEM(model="VAR1", method="sample", max_iter_em=34, n_iter_ou=15, dt=1e-3)
122-
imputer_tsmle = imputers.ImputerEM(model="VAR1", method="mle", max_iter_em=34, n_iter_ou=15, dt=1e-3)
123-
imputer_knn = imputers.ImputerKNN(k=10)
124-
imputer_mice = imputers.ImputerMICE(estimator=LinearRegression(), sample_posterior=False, max_iter=100, missing_values=np.nan)
125-
imputer_regressor = imputers.ImputerRegressor(estimator=LinearRegression())
126-
127-
dict_imputers = {
76+
# impute and compare
77+
imputer_mean = imputers.ImputerMean(groups=("station",))
78+
imputer_interpol = imputers.ImputerInterpolation(method="linear", groups=("station",))
79+
imputer_var1 = imputers.ImputerEM(model="VAR", groups=("station",), method="mle", max_iter_em=50, n_iter_ou=15, dt=1e-3, p=1)
80+
dict_imputers = {
12881
"mean": imputer_mean,
129-
"median": imputer_median,
130-
"mode": imputer_mode,
13182
"interpolation": imputer_interpol,
132-
"spline": imputer_spline,
133-
"shuffle": imputer_shuffle,
134-
"residuals": imputer_residuals,
135-
"OU": imputer_ou,
136-
"TSOU": imputer_tsou,
137-
"TSMLE": imputer_tsmle,
138-
"RPCA": imputer_rpca,
139-
"RPCA_opti": imputer_rpca_opti,
140-
"locf": imputer_locf,
141-
"nocb": imputer_nocb,
142-
"knn": imputer_knn,
143-
"ols": imputer_regressor,
144-
"mice_ols": imputer_mice,
145-
}
146-
147-
It is possible to define a parameter dictionary for an imputer with three pieces of information: min, max and type. The aim of the dictionary is to determine the optimal parameters for data imputation. Here, we call this dictionary ``dict_config_opti``.
148-
149-
.. code-block:: python
150-
151-
search_params = {
152-
"RPCA_opti": {
153-
"tau": {"min": .5, "max": 5, "type":"Real"},
154-
"lam": {"min": .1, "max": 1, "type":"Real"},
155-
}
83+
"VAR(1) process": imputer_var1
15684
}
157-
158-
Then with the comparator function in ``from qolmat.benchmark import comparator``, we can compare the different imputation methods. This **does not use knowledge on missing values**, but it relies data masking instead. For more details on how imputors and comparator work, please see the following `link <https://qolmat.readthedocs.io/en/latest/explanation.html>`_.
159-
160-
.. code-block:: python
161-
162-
from qolmat.benchmark import comparator
163-
164-
generator_holes = missing_patterns.EmpiricalHoleGenerator(n_splits=4, ratio_masked=0.1)
165-
166-
comparison = comparator.Comparator(
85+
generator_holes = missing_patterns.EmpiricalHoleGenerator(n_splits=4, ratio_masked=0.1)
86+
comparison = comparator.Comparator(
16787
dict_imputers,
168-
['y'],
88+
columns,
16989
generator_holes = generator_holes,
17090
metrics = ["mae", "wmape", "KL_columnwise", "ks_test", "energy"],
171-
n_calls_opt = 10,
172-
dict_config_opti = dict_config_opti,
17391
)
174-
results = comparison.compare(df_with_nan)
175-
176-
We can observe the benchmark results.
92+
results = comparison.compare(df_with_nan)
93+
results.style.highlight_min(color="lightsteelblue", axis=1)
17794
178-
.. image:: https://raw.githubusercontent.com/Quantmetry/qolmat/main/docs/images/readme2.png
95+
.. image:: https://raw.githubusercontent.com/Quantmetry/qolmat/main/docs/images/readme_tabular_comparison.png
17996
:align: center
18097

181-
Finally, we keep the best ``TSMLE`` imputor we represent.
98+
📘 Documentation
99+
================
182100

183-
.. code-block:: python
101+
The full documentation can be found `on this link <https://qolmat.readthedocs.io/en/latest/>`_.
102+
103+
**How does Qolmat work ?**
184104

185-
dfs_imputed = imputer_tsmle.fit_transform(df_with_nan)
105+
Qolmat allows model selection for scikit-learn compatible imputation algorithms, by performing three steps pictured below:
106+
1) For each of the K folds, Qolmat artificially masks a set of observed values using a default or user specified `hole generator <explanation.html#hole-generator>`_,
107+
2) For each fold and each compared `imputation method <imputers.html>`_, Qolmat fills both the missing and the masked values, then computes each of the default or user specified `performance metrics <explanation.html#metrics>`_.
108+
3) For each compared imputer, Qolmat pools the computed metrics from the K folds into a single value.
186109

187-
plt.figure(figsize=(25,5))
188-
plt.plot(df['y'],'.g')
189-
plt.plot(dfs_imputed['y'],'.r')
190-
plt.plot(df_with_nan['y'],'.b')
191-
plt.show()
110+
This is very similar in spirit to the `cross_val_score <https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html>`_ function for scikit-learn.
192111

193-
.. image:: https://raw.githubusercontent.com/Quantmetry/qolmat/main/docs/images/readme3.png
112+
.. image:: https://raw.githubusercontent.com/Quantmetry/qolmat/main/docs/images/schema_qolmat.png
194113
:align: center
195114

115+
**Imputation methods**
116+
117+
The following table contains the available imputation methods. We distinguish single imputation methods (aiming for pointwise accuracy, mostly deterministic) from multiple imputation methods (aiming for distribution similarity, mostly stochastic).
118+
119+
.. list-table::
120+
:widths: 25 70 15 15
121+
:header-rows: 1
122+
123+
* - Method
124+
- Description
125+
- Tabular or Time series
126+
- Single or Multiple
127+
* - mean
128+
- Imputes the missing values using the mean along each column
129+
- tabular
130+
- single
131+
* - median
132+
- Imputes the missing values using the median along each column
133+
- tabular
134+
- single
135+
* - LOCF
136+
- Imputes missing entries by carrying the last observation forward for each columns
137+
- time series
138+
- single
139+
* - shuffle
140+
- Imputes missing entries with the random value of each column
141+
- tabular
142+
- multiple
143+
* - interpolation
144+
- Imputes missing using some interpolation strategies supported by pd.Series.interpolate
145+
- time series
146+
- single
147+
* - impute on residuals
148+
- The series are de-seasonalised, residuals are imputed via linear interpolation, then residuals are re-seasonalised
149+
- time series
150+
- single
151+
* - MICE
152+
- Multiple Imputation by Chained Equation
153+
- tabular
154+
- both
155+
* - RPCA
156+
- Robust Principal Component Analysis
157+
- both
158+
- single
159+
* - SoftImpute
160+
- Iterative method for matrix completion that uses nuclear-norm regularization
161+
- tabular
162+
- single
163+
* - KNN
164+
- K-nearest kneighbors
165+
- tabular
166+
- single
167+
* - EM sampler
168+
- Imputes missing values via EM algorithm
169+
- both
170+
- both
171+
* - TabDDPM
172+
- Imputer based on Denoising Diffusion Probabilistic Models
173+
- both
174+
- both
196175

197-
📘 Documentation
198-
================
199176

200-
The full documentation can be found `on this link <https://qolmat.readthedocs.io/en/latest/>`_.
201177

202178
📝 Contributing
203179
===============
@@ -222,8 +198,6 @@ Qolmat has been developed by Quantmetry.
222198
🔍 References
223199
==============
224200

225-
Qolmat methods belong to the field of conformal inference.
226-
227201
[1] Candès, Emmanuel J., et al. “Robust principal component analysis?.”
228202
Journal of the ACM (JACM) 58.3 (2011): 1-37,
229203
(`pdf <https://arxiv.org/abs/0912.3599>`__)
@@ -234,15 +208,13 @@ Journal of advanced transportation 2018 (2018).
234208
(`pdf <https://www.hindawi.com/journals/jat/2018/7191549/>`__)
235209

236210
[3] Chen, Yuxin, et al. “Bridging convex and nonconvex optimization in
237-
robust PCA: Noise, outliers, and missing data.” arXiv preprint
238-
arXiv:2001.05484 (2020), (`pdf <https://arxiv.org/abs/2001.05484>`__)
211+
robust PCA: Noise, outliers, and missing data.” Annals of statistics, 49(5), 2948 (2021), (`pdf <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9491514/pdf/nihms-1782570.pdf>`__)
239212

240213
[4] Shahid, Nauman, et al. “Fast robust PCA on graphs.” IEEE Journal of
241214
Selected Topics in Signal Processing 10.4 (2016): 740-756.
242215
(`pdf <https://arxiv.org/abs/1507.08173>`__)
243216

244-
[5] Jiashi Feng, et al. “Online robust pca via stochastic opti-
245-
mization.“ Advances in neural information processing systems, 26, 2013.
217+
[5] Jiashi Feng, et al. “Online robust pca via stochastic optimization.“ Advances in neural information processing systems, 26, 2013.
246218
(`pdf <https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.721.7506&rep=rep1&type=pdf>`__)
247219

248220
[6] García, S., Luengo, J., & Herrera, F. "Data preprocessing in data mining". 2015.

0 commit comments

Comments
 (0)