Skip to content

Commit 4430486

Browse files
Merge pull request #148 from scikit-learn-contrib/dev
Dev
2 parents beb6c2a + 47565ff commit 4430486

File tree

17 files changed

+439
-17
lines changed

17 files changed

+439
-17
lines changed

.bumpversion.cfg

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
[bumpversion]
2-
current_version = 0.1.6
2+
current_version = 0.1.7
33
commit = True
44
tag = True
55

HISTORY.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,12 @@
22
History
33
=======
44

5+
0.1.7 (2024-06-13)
6+
------------------
7+
* Little's test implemented in a new hole_characterization module
8+
* Documentation now includes an analysis section with a tutorial
9+
* Hole generators now provide reproducible outputs
10+
511
0.1.5 (2024-04-17)
612
------------------
713

Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
coverage:
2-
pytest --cov-branch --cov=qolmat --cov-report=xml
2+
pytest --cov-branch --cov=qolmat --cov-report=xml tests
33

44
doctest:
55
pytest --doctest-modules --pyargs qolmat

docs/analysis.rst

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
2+
Analysis
3+
========
4+
This section gives a better understanding of the holes in a dataset.
5+
6+
1. General approach
7+
-------------------
8+
9+
As described in section :ref:`hole_generator`, there are 3 main types of missing data mechanism: MCAR, MAR and MNAR.
10+
The analysis module provides tools to characterize the type of holes.
11+
12+
The MNAR case is the trickiest, the user must first consider whether their missing data mechanism is MNAR. In the meantime, we make assume that the missing-data mechanism is ignorable (ie., it is not MNAR). If an MNAR mechanism is suspected, please see this article :ref:`An approach to test for MNAR [1]<Noonan-article>` for relevant actions.
13+
14+
Then Qolmat proposes a test to determine whether the missing data mechanism is MCAR or MAR.
15+
16+
2. How to use the results
17+
-------------------------
18+
19+
At the end of the MCAR test, it can then be assumed whether the missing data mechanism is MCAR or not. This serves three differents purposes:
20+
21+
a. Diagnosis
22+
^^^^^^^^^^^^
23+
24+
If the result of the MCAR test is "The MCAR hypothesis is rejected", we can then ask ourselves over which range of values holes are more present.
25+
The test result can then be used for continuous data quality management.
26+
27+
b. Estimation
28+
^^^^^^^^^^^^^
29+
30+
Some estimation methods are not suitable for the MAR case. For example, dropping the nans introduces bias into the estimator, it is necessary to have validated that the missing-data mechanism is MCAR.
31+
32+
c. Imputation
33+
^^^^^^^^^^^^^
34+
35+
Qolmat allows model selection imputation algorithms. For each of the K folds, Qolmat artificially masks a set of observed values using a default or user-specified hole generator. It seems natural to create these masks according to the same missing-data mechanism as determined by the test. Here is the documentation on using Qolmat for imputation `model selection <https://qolmat.readthedocs.io/en/latest/#:~:text=How%20does%20Qolmat%20work%20%3F>`_.
36+
37+
3. The MCAR Tests
38+
-----------------
39+
40+
There are several statistical tests to determine if the missing data mechanism is MCAR or MAR. Most tests are based on the notion of missing pattern.
41+
A missing pattern, also called a pattern, is the structure of observed and missing values in a dataset. For example, for a dataset with two columns, the possible patterns are: (0, 0), (1, 0), (0, 1), (1, 1). The value 1 indicates that the value in the column is missing.
42+
43+
The MCAR missing-data mechanism means that there is independence between the presence of holes and the observed values. In other words, the data distribution is the same for all patterns.
44+
45+
a. Little's Test
46+
^^^^^^^^^^^^^^^^
47+
48+
The best-known MCAR test is the :ref:`Little [2]<Little-article>` test, and it has been implemented in :class:`LittleTest`. Keep in mind that the Little's test is designed to test the homogeneity of means across the missing patterns and won't be efficient to detect the heterogeneity of covariance accross missing patterns.
49+
50+
b. PKLM Test
51+
^^^^^^^^^^^^
52+
53+
The :ref:`PKLM [2]<PKLM-article>` (Projected Kullback-Leibler MCAR) test compares the distributions of different missing patterns on random projections in the variable space of the data. This recent test applies to mixed-type data. It is not implemented yet in Qolmat.
54+
55+
References
56+
----------
57+
58+
.. _Noonan-article:
59+
60+
[1] Noonan, Jack, et al. `An integrated approach to test for missing not at random. <https://arxiv.org/abs/2208.07813>`_ arXiv preprint arXiv:2208.07813 (2022).
61+
62+
.. _Little-article:
63+
64+
[2] Little, R. J. A. `A Test of Missing Completely at Random for Multivariate Data with Missing Values. <https://www.tandfonline.com/doi/abs/10.1080/01621459.1988.10478722>`_ Journal of the American Statistical Association, Volume 83, 1988 - Issue 404.
65+
66+
.. _PKLM-article:
67+
68+
[3] Spohn, Meta-Lina, et al. `PKLM: A flexible MCAR test using Classification. <https://arxiv.org/abs/2109.10150>`_ arXiv preprint arXiv:2109.10150 (2021).

docs/conf.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@
2727
author = "Quantmetry"
2828

2929
# The full version, including alpha/beta/rc tags
30-
version = "0.1.6"
30+
version = "0.1.7"
3131
release = version
3232

3333
# -- General configuration ---------------------------------------------------

docs/index.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,3 +25,11 @@
2525
:caption: API
2626

2727
api
28+
29+
.. toctree::
30+
:maxdepth: 2
31+
:hidden:
32+
:caption: ANALYSIS
33+
34+
analysis
35+
examples/tutorials/plot_tuto_mcar

examples/RPCA.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -199,7 +199,6 @@ plt.show()
199199

200200
```python
201201
%%time
202-
# rpca_noisy = RpcaNoisy(period=10, tau=1, lam=0.4, rank=2, list_periods=[10], list_etas=[0.01], norm="L2")
203202
rpca_noisy = RpcaNoisy(tau=1, lam=0.4, rank=2, norm="L2")
204203
M, A = rpca_noisy.decompose(D, Omega)
205204
# imputed = X

examples/tutorials/plot_tuto_categorical.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,8 @@
2222
# We get the data and focus on the explanatory variables
2323
df = data.get_data("Titanic")
2424
df = df.drop(columns=["Survived"])
25+
print("Dataset shape:", df.shape)
26+
df.head()
2527

2628
# %%
2729
# 2. Mixed type imputation methods
@@ -61,7 +63,7 @@
6163
imputer_hgb = ImputerRegressor(estimator=pipestimator, handler_nan="none")
6264
imputer_wrap_hgb = preprocessing.WrapperTransformer(imputer_hgb, bt)
6365

64-
# %%
66+
# %%
6567
# 3. Mixed type model selection
6668
# ---------------------------------------------------------------
6769
# Let us now compare these three aproaches by measuring their ability to impute uniformly

examples/tutorials/plot_tuto_hole_generator.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -282,7 +282,7 @@ def plot_cdf(
282282

283283

284284
# %%
285-
# d. Grouped Hole Generator
285+
# e. Grouped Hole Generator
286286
# ***************************************************************
287287
# The holes are generated according to the groups defined by the user.
288288
# This metohd is implemented in the
Lines changed: 165 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,165 @@
1+
"""
2+
============================================
3+
Tutorial for Testing the MCAR Case
4+
============================================
5+
6+
In this tutorial, we show how to test the MCAR case using the Little's test.
7+
"""
8+
9+
# %%
10+
# First import some libraries
11+
from matplotlib import pyplot as plt
12+
13+
import numpy as np
14+
import pandas as pd
15+
from scipy.stats import norm
16+
17+
from qolmat.analysis.holes_characterization import LittleTest
18+
from qolmat.benchmark.missing_patterns import UniformHoleGenerator
19+
20+
plt.rcParams.update({"font.size": 12})
21+
22+
23+
# %%
24+
# Generating random data
25+
# ----------------------
26+
27+
rng = np.random.RandomState(42)
28+
data = rng.multivariate_normal(mean=[0, 0], cov=[[1, 0], [0, 1]], size=200)
29+
df = pd.DataFrame(data=data, columns=["Column 1", "Column 2"])
30+
31+
q975 = norm.ppf(0.975)
32+
33+
# %%
34+
# The Little's test
35+
# ---------------------------------------------------------------
36+
# First, we need to introduce the concept of a missing pattern. A missing pattern, also called a
37+
# pattern, is the structure of observed and missing values in a dataset. For example, in a
38+
# dataset with two columns, the possible patterns are: (0, 0), (1, 0), (0, 1), (1, 1). The value 1
39+
# (0) indicates that the column value is missing (observed).
40+
#
41+
# The null hypothesis, H0, is: "The means of observations within each pattern are similar.".
42+
#
43+
# We choose to use the classic threshold of 5%. If the test p-value is below this threshold,
44+
# we reject the null hypothesis.
45+
#
46+
# This notebook shows how the Little's test performs on a simplistic case and its limitations. We
47+
# instanciate a test object with a random state for reproducibility.
48+
49+
test_mcar = LittleTest(random_state=rng)
50+
51+
# %%
52+
# Case 1: MCAR holes (True negative)
53+
# ==================================
54+
55+
56+
hole_gen = UniformHoleGenerator(
57+
n_splits=1, random_state=rng, subset=["Column 2"], ratio_masked=0.2
58+
)
59+
df_mask = hole_gen.generate_mask(df)
60+
df_nan = df.where(~df_mask, np.nan)
61+
62+
has_nan = df_mask.any(axis=1)
63+
df_observed = df.loc[~has_nan]
64+
df_hidden = df.loc[has_nan]
65+
66+
plt.scatter(df_observed["Column 1"], df_observed[["Column 2"]], label="Fully observed values")
67+
plt.scatter(df_hidden[["Column 1"]], df_hidden[["Column 2"]], label="Values with missing C2")
68+
69+
plt.legend(
70+
loc="lower left",
71+
fontsize=8,
72+
)
73+
plt.xlabel("Column 1")
74+
plt.ylabel("Column 2")
75+
plt.title("Case 1: MCAR data")
76+
plt.grid()
77+
plt.show()
78+
79+
# %%
80+
result = test_mcar.test(df_nan)
81+
print(f"Test p-value: {result:.2%}")
82+
# %%
83+
# The p-value is larger than 0.05, therefore we don't reject the HO MCAR assumption. In this case
84+
# this is a true negative.
85+
86+
# %%
87+
# Case 2: MAR holes with mean bias (True positive)
88+
# ================================================
89+
90+
df_mask = pd.DataFrame({"Column 1": False, "Column 2": df["Column 1"] > q975}, index=df.index)
91+
92+
df_nan = df.where(~df_mask, np.nan)
93+
94+
has_nan = df_mask.any(axis=1)
95+
df_observed = df.loc[~has_nan]
96+
df_hidden = df.loc[has_nan]
97+
98+
plt.scatter(df_observed["Column 1"], df_observed[["Column 2"]], label="Fully observed values")
99+
plt.scatter(df_hidden[["Column 1"]], df_hidden[["Column 2"]], label="Values with missing C2")
100+
101+
plt.legend(
102+
loc="lower left",
103+
fontsize=8,
104+
)
105+
plt.xlabel("Column 1")
106+
plt.ylabel("Column 2")
107+
plt.title("Case 2: MAR data with mean bias")
108+
plt.grid()
109+
plt.show()
110+
111+
# %%
112+
113+
result = test_mcar.test(df_nan)
114+
print(f"Test p-value: {result:.2%}")
115+
# %%
116+
# The p-value is smaller than 0.05, therefore we reject the HO MCAR assumption. In this case
117+
# this is a true positive.
118+
119+
# %%
120+
# Case 3: MAR holes with any mean bias (False negative)
121+
# =====================================================
122+
#
123+
# The specific case is designed to emphasize the Little's test limits. In the case, we generate
124+
# holes when the absolute value of the first feature is high. This missingness mechanism is clearly
125+
# MAR but the means between missing patterns is not statistically different.
126+
127+
df_mask = pd.DataFrame(
128+
{"Column 1": False, "Column 2": df["Column 1"].abs() > q975}, index=df.index
129+
)
130+
131+
df_nan = df.where(~df_mask, np.nan)
132+
133+
has_nan = df_mask.any(axis=1)
134+
df_observed = df.loc[~has_nan]
135+
df_hidden = df.loc[has_nan]
136+
137+
plt.scatter(df_observed["Column 1"], df_observed[["Column 2"]], label="Fully observed values")
138+
plt.scatter(df_hidden[["Column 1"]], df_hidden[["Column 2"]], label="Values with missing C2")
139+
140+
plt.legend(
141+
loc="lower left",
142+
fontsize=8,
143+
)
144+
plt.xlabel("Column 1")
145+
plt.ylabel("Column 2")
146+
plt.title("Case 3: MAR data without any mean bias")
147+
plt.grid()
148+
plt.show()
149+
150+
# %%
151+
152+
result = test_mcar.test(df_nan)
153+
print(f"Test p-value: {result:.2%}")
154+
# %%
155+
# The p-value is larger than 0.05, therefore we don't reject the HO MCAR assumption. In this case
156+
# this is a false negative since the missingness mechanism is MAR.
157+
158+
# %%
159+
# Limitations
160+
# -----------
161+
# In this tutoriel, we can see that Little's test fails to detect covariance heterogeneity between
162+
# patterns.
163+
#
164+
# We also note that the Little's test does not handle categorical data or temporally
165+
# correlated data.

0 commit comments

Comments
 (0)