Merge pull request #144 from scikit-learn-contrib/mcar-test-implementation

JulienRoussel77 · web-flow · commit f7cbba6f42c8 · 2024-06-13T09:41:20.000+02:00
Mcar test implementation
diff --git a/HISTORY.rst b/HISTORY.rst
@@ -2,6 +2,12 @@
 History
 =======
 
+0.1.7 (2024-06-13)
+------------------
+* Little's test implemented in a new hole_characterization module
+* Documentation now includes an analysis section with a tutorial
+* Hole generators now provide reproducible outputs
+
 0.1.5 (2024-04-17)
 ------------------
 
diff --git a/Makefile b/Makefile
@@ -1,5 +1,5 @@
 coverage:
-	pytest --cov-branch --cov=qolmat --cov-report=xml
+	pytest --cov-branch --cov=qolmat --cov-report=xml tests
 
 doctest:
 	pytest --doctest-modules --pyargs qolmat
diff --git a/docs/analysis.rst b/docs/analysis.rst
@@ -0,0 +1,68 @@
+
+Analysis
+========
+This section gives a better understanding of the holes in a dataset.
+
+1. General approach
+-------------------
+
+As described in section :ref:`hole_generator`, there are 3 main types of missing data mechanism: MCAR, MAR and MNAR.
+The analysis module provides tools to characterize the type of holes.
+
+The MNAR case is the trickiest, the user must first consider whether their missing data mechanism is MNAR. In the meantime, we make assume that the missing-data mechanism is ignorable (ie., it is not MNAR). If an MNAR mechanism is suspected, please see this article :ref:`An approach to test for MNAR [1]<Noonan-article>` for relevant actions.
+
+Then Qolmat proposes a test to determine whether the missing data mechanism is MCAR or MAR.
+
+2. How to use the results
+-------------------------
+
+At the end of the MCAR test, it can then be assumed whether the missing data mechanism is MCAR or not. This serves three differents purposes:
+
+a. Diagnosis
+^^^^^^^^^^^^
+
+If the result of the MCAR test is "The MCAR hypothesis is rejected", we can then ask ourselves over which range of values holes are more present.
+The test result can then be used for continuous data quality management.
+
+b. Estimation
+^^^^^^^^^^^^^
+
+Some estimation methods are not suitable for the MAR case. For example, dropping the nans introduces bias into the estimator, it is necessary to have validated that the missing-data mechanism is MCAR.
+
+c. Imputation
+^^^^^^^^^^^^^
+
+Qolmat allows model selection imputation algorithms. For each of the K folds, Qolmat artificially masks a set of observed values using a default or user-specified hole generator. It seems natural to create these masks according to the same missing-data mechanism as determined by the test. Here is the documentation on using Qolmat for imputation `model selection <https://qolmat.readthedocs.io/en/latest/#:~:text=How%20does%20Qolmat%20work%20%3F>`_.
+
+3. The MCAR Tests
+-----------------
+
+There are several statistical tests to determine if the missing data mechanism is MCAR or MAR. Most tests are based on the notion of missing pattern.
+A missing pattern, also called a pattern, is the structure of observed and missing values in a dataset. For example, for a dataset with two columns, the possible patterns are: (0, 0), (1, 0), (0, 1), (1, 1). The value 1 indicates that the value in the column is missing.
+
+The MCAR missing-data mechanism means that there is independence between the presence of holes and the observed values. In other words, the data distribution is the same for all patterns.
+
+a. Little's Test
+^^^^^^^^^^^^^^^^
+
+The best-known MCAR test is the :ref:`Little [2]<Little-article>` test, and it has been implemented in :class:`LittleTest`. Keep in mind that the Little's test is designed to test the homogeneity of means across the missing patterns and won't be efficient to detect the heterogeneity of covariance accross missing patterns.
+
+b. PKLM Test
+^^^^^^^^^^^^
+
+The :ref:`PKLM [2]<PKLM-article>` (Projected Kullback-Leibler MCAR) test compares the distributions of different missing patterns on random projections in the variable space of the data. This recent test applies to mixed-type data. It is not implemented yet in Qolmat.
+
+References
+----------
+
+.. _Noonan-article:
+
+[1] Noonan, Jack, et al. `An integrated approach to test for missing not at random. <https://arxiv.org/abs/2208.07813>`_ arXiv preprint arXiv:2208.07813 (2022).
+
+.. _Little-article:
+
+[2] Little, R. J. A. `A Test of Missing Completely at Random for Multivariate Data with Missing Values. <https://www.tandfonline.com/doi/abs/10.1080/01621459.1988.10478722>`_ Journal of the American Statistical Association, Volume 83, 1988 - Issue 404.
+
+.. _PKLM-article:
+
+[3] Spohn, Meta-Lina, et al. `PKLM: A flexible MCAR test using Classification. <https://arxiv.org/abs/2109.10150>`_ arXiv preprint arXiv:2109.10150 (2021).
diff --git a/docs/index.rst b/docs/index.rst
@@ -25,3 +25,11 @@
    :caption: API
 
    api
+
+.. toctree::
+   :maxdepth: 2
+   :hidden:
+   :caption: ANALYSIS
+
+   analysis
+   examples/tutorials/plot_tuto_mcar
diff --git a/examples/RPCA.md b/examples/RPCA.md
@@ -199,7 +199,6 @@ plt.show()
 
 ```python
 %%time
-# rpca_noisy = RpcaNoisy(period=10, tau=1, lam=0.4, rank=2, list_periods=[10], list_etas=[0.01], norm="L2")
 rpca_noisy = RpcaNoisy(tau=1, lam=0.4, rank=2, norm="L2")
 M, A = rpca_noisy.decompose(D, Omega)
 # imputed = X
diff --git a/examples/tutorials/plot_tuto_hole_generator.py b/examples/tutorials/plot_tuto_hole_generator.py
@@ -282,7 +282,7 @@ def plot_cdf(
 
 
 # %%
-# d. Grouped Hole Generator
+# e. Grouped Hole Generator
 # ***************************************************************
 # The holes are generated according to the groups defined by the user.
 # This metohd is implemented in the
diff --git a/examples/tutorials/plot_tuto_mcar.py b/examples/tutorials/plot_tuto_mcar.py
@@ -0,0 +1,165 @@
+"""
+============================================
+Tutorial for Testing the MCAR Case
+============================================
+
+In this tutorial, we show how to test the MCAR case using the Little's test.
+"""
+
+# %%
+# First import some libraries
+from matplotlib import pyplot as plt
+
+import numpy as np
+import pandas as pd
+from scipy.stats import norm
+
+from qolmat.analysis.holes_characterization import LittleTest
+from qolmat.benchmark.missing_patterns import UniformHoleGenerator
+
+plt.rcParams.update({"font.size": 12})
+
+
+# %%
+# Generating random data
+# ----------------------
+
+rng = np.random.RandomState(42)
+data = rng.multivariate_normal(mean=[0, 0], cov=[[1, 0], [0, 1]], size=200)
+df = pd.DataFrame(data=data, columns=["Column 1", "Column 2"])
+
+q975 = norm.ppf(0.975)
+
+# %%
+# The Little's test
+# ---------------------------------------------------------------
+# First, we need to introduce the concept of a missing pattern. A missing pattern, also called a
+# pattern, is the structure of observed and missing values in a dataset. For example, in a
+# dataset with two columns, the possible patterns are: (0, 0), (1, 0), (0, 1), (1, 1). The value 1
+# (0) indicates that the column value is missing (observed).
+#
+# The null hypothesis, H0, is: "The means of observations within each pattern are similar.".
+#
+# We choose to use the classic threshold of 5%. If the test p-value is below this threshold,
+# we reject the null hypothesis.
+#
+# This notebook shows how the Little's test performs on a simplistic case and its limitations. We
+# instanciate a test object with a random state for reproducibility.
+
+test_mcar = LittleTest(random_state=rng)
+
+# %%
+# Case 1: MCAR holes (True negative)
+# ==================================
+
+
+hole_gen = UniformHoleGenerator(
+    n_splits=1, random_state=rng, subset=["Column 2"], ratio_masked=0.2
+)
+df_mask = hole_gen.generate_mask(df)
+df_nan = df.where(~df_mask, np.nan)
+
+has_nan = df_mask.any(axis=1)
+df_observed = df.loc[~has_nan]
+df_hidden = df.loc[has_nan]
+
+plt.scatter(df_observed["Column 1"], df_observed[["Column 2"]], label="Fully observed values")
+plt.scatter(df_hidden[["Column 1"]], df_hidden[["Column 2"]], label="Values with missing C2")
+
+plt.legend(
+    loc="lower left",
+    fontsize=8,
+)
+plt.xlabel("Column 1")
+plt.ylabel("Column 2")
+plt.title("Case 1: MCAR missingness mechanism")
+plt.grid()
+plt.show()
+
+# %%
+result = test_mcar.test(df_nan)
+print(f"Test p-value: {result:.2%}")
+# %%
+# The p-value is quite high, therefore we don't reject H0.
+# We can then suppose that our missingness mechanism is MCAR.
+
+# %%
+# Case 2: MAR holes with mean bias (True positive)
+# ================================================
+
+df_mask = pd.DataFrame({"Column 1": False, "Column 2": df["Column 1"] > q975}, index=df.index)
+
+df_nan = df.where(~df_mask, np.nan)
+
+has_nan = df_mask.any(axis=1)
+df_observed = df.loc[~has_nan]
+df_hidden = df.loc[has_nan]
+
+plt.scatter(df_observed["Column 1"], df_observed[["Column 2"]], label="Fully observed values")
+plt.scatter(df_hidden[["Column 1"]], df_hidden[["Column 2"]], label="Values with missing C2")
+
+plt.legend(
+    loc="lower left",
+    fontsize=8,
+)
+plt.xlabel("Column 1")
+plt.ylabel("Column 2")
+plt.title("Case 2: MAR missingness mechanism")
+plt.grid()
+plt.show()
+
+# %%
+
+result = test_mcar.test(df_nan)
+print(f"Test p-value: {result:.2%}")
+# %%
+# The p-value is lower than the classic threshold (5%).
+# H0 is then rejected and we can suppose that our missingness mechanism is MAR.
+
+# %%
+# Case 3: MAR holes with any mean bias (False negative)
+# =====================================================
+#
+# The specific case is designed to emphasize the Little's test limits. In the case, we generate
+# holes when the absolute value of the first feature is high. This missingness mechanism is clearly
+# MAR but the means between missing patterns is not statistically different.
+
+df_mask = pd.DataFrame(
+    {"Column 1": False, "Column 2": df["Column 1"].abs() > q975}, index=df.index
+)
+
+df_nan = df.where(~df_mask, np.nan)
+
+has_nan = df_mask.any(axis=1)
+df_observed = df.loc[~has_nan]
+df_hidden = df.loc[has_nan]
+
+plt.scatter(df_observed["Column 1"], df_observed[["Column 2"]], label="Fully observed values")
+plt.scatter(df_hidden[["Column 1"]], df_hidden[["Column 2"]], label="Values with missing C2")
+
+plt.legend(
+    loc="lower left",
+    fontsize=8,
+)
+plt.xlabel("Column 1")
+plt.ylabel("Column 2")
+plt.title("Case 3: MAR missingness mechanism undetected by the Little's test")
+plt.grid()
+plt.show()
+
+# %%
+
+result = test_mcar.test(df_nan)
+print(f"Test p-value: {result:.2%}")
+# %%
+# The p-value is higher than the classic threshold (5%).
+# H0 is not rejected whereas the missingness mechanism is clearly MAR.
+
+# %%
+# Limitations
+# -----------
+# In this tutoriel, we can see that Little's test fails to detect covariance heterogeneity between
+# patterns.
+#
+# We also note that the Little's test does not handle categorical data or temporally
+# correlated data.
diff --git a/qolmat/analysis/holes_characterization.py b/qolmat/analysis/holes_characterization.py
@@ -0,0 +1,93 @@
+from abc import ABC, abstractmethod
+from typing import Optional, Union
+
+import numpy as np
+import pandas as pd
+from scipy.stats import chi2
+
+from qolmat.imputations.imputers import ImputerEM
+
+
+class McarTest(ABC):
+    """
+    Astract class for MCAR tests.
+    """
+
+    @abstractmethod
+    def test(self, df: pd.DataFrame) -> float:
+        pass
+
+
+class LittleTest(McarTest):
+    """
+    This class implements the Little's test, which is designed to detect the heterogeneity accross
+    the missing patterns. The null hypothesis is "The missing data mechanism is MCAR". The
+    shortcoming of this test is that it won't detect the heterogeneity of covariance.
+
+    References
+    ----------
+    Little. "A Test of Missing Completely at Random for Multivariate Data with Missing Values."
+    Journal of the American Statistical Association, Volume 83, 1988 - Issue 404
+
+    Parameters
+    ----------
+    imputer : Optional[ImputerEM]
+        Imputer based on the EM algorithm. The 'model' attribute must be equal to 'multinormal'.
+        If None, the default ImputerEM is taken.
+    random_state : Union[None, int, np.random.RandomState], optional
+        Controls the randomness of the fit_transform, by default None
+    """
+
+    def __init__(
+        self,
+        imputer: Optional[ImputerEM] = None,
+        random_state: Union[None, int, np.random.RandomState] = None,
+    ):
+        super().__init__()
+        if imputer and imputer.model != "multinormal":
+            raise AttributeError(
+                "The ImputerEM model must be 'multinormal' to use the Little's test"
+            )
+        self.imputer = imputer
+        self.random_state = random_state
+
+    def test(self, df: pd.DataFrame) -> float:
+        """
+        Apply the Little's test over a real dataframe.
+
+
+        Parameters
+        ----------
+        df : pd.DataFrame
+            The input dataset with missing values.
+
+        Returns
+        -------
+        float
+            The p-value of the test.
+        """
+        imputer = self.imputer or ImputerEM(random_state=self.random_state)
+        imputer = imputer._fit_element(df)
+
+        d0 = 0
+        n_rows, n_cols = df.shape
+        degree_f = -n_cols
+        ml_means = imputer.means
+        ml_cov = n_rows / (n_rows - 1) * imputer.cov
+
+        # Iterate over the patterns
+
+        df_nan = df.notna()
+        for tup_pattern, df_nan_pattern in df_nan.groupby(df_nan.columns.tolist()):
+            n_rows_pattern, _ = df_nan_pattern.shape
+            ind_pattern = df_nan_pattern.index
+            df_pattern = df.loc[ind_pattern, list(tup_pattern)]
+            obs_mean = df_pattern.mean().to_numpy()
+
+            diff_means = obs_mean - ml_means[list(tup_pattern)]
+            inv_sigma_pattern = np.linalg.inv(ml_cov[:, tup_pattern][tup_pattern, :])
+
+            d0 += n_rows_pattern * np.dot(np.dot(diff_means, inv_sigma_pattern), diff_means.T)
+            degree_f += tup_pattern.count(True)
+
+        return 1 - float(chi2.cdf(d0, degree_f))
diff --git a/qolmat/benchmark/missing_patterns.py b/qolmat/benchmark/missing_patterns.py
diff --git a/tests/analysis/test_holes_characterization.py b/tests/analysis/test_holes_characterization.py
diff --git a/tests/benchmark/test_missing_patterns.py b/tests/benchmark/test_missing_patterns.py
diff --git a/tests/imputations/rpca/test_rpca_noisy.py b/tests/imputations/rpca/test_rpca_noisy.py