Skip to content

Commit 256d8c9

Browse files
adam2392bloebp
andauthored
[ENH] Faster and more flexible code, and code sharing for kernel tests (#19)
Towards: #15 Changes proposed in this pull request: - refactors code to setup for kcd test - allows any of the pairwise kernel strings to be passed in from sklearn (which is significantly faster than using partial because sklearn optimizes the in-house kernels) - also requires kernel functions to be a specific API, so it's easier to test, implement and document This should all make implementation of the kcd test pretty straightforward --------- Signed-off-by: Adam Li <[email protected]> Co-authored-by: Patrick Bloebaum <[email protected]>
1 parent 1938c19 commit 256d8c9

22 files changed

+1536
-995
lines changed

.github/workflows/main.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ jobs:
2222
runs-on: ubuntu-latest
2323
strategy:
2424
matrix:
25-
poetry-version: [1.3.0]
25+
poetry-version: [1.6.1]
2626
steps:
2727
- name: Checkout repository
2828
uses: actions/checkout@v3
@@ -59,7 +59,7 @@ jobs:
5959
matrix:
6060
os: [ubuntu, macos, windows]
6161
python-version: [3.8, 3.9, "3.10"]
62-
poetry-version: [1.3.0]
62+
poetry-version: [1.6.1]
6363
name: build ${{ matrix.os }} - py${{ matrix.python-version }}
6464
runs-on: ${{ matrix.os }}-latest
6565
defaults:
@@ -122,7 +122,7 @@ jobs:
122122
matrix:
123123
os: [ubuntu, macos, windows]
124124
python-version: [3.8, "3.10"] # oldest and newest supported versions
125-
poetry-version: [1.3.0]
125+
poetry-version: [1.6.1]
126126
name: Unit-test ${{ matrix.os }} - py${{ matrix.python-version }}
127127
runs-on: ${{ matrix.os }}-latest
128128
defaults:

CITATION.cff

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
# YAML 1.2
2+
---
3+
# Metadata for citation of this software according to the CFF format (https://citation-file-format.github.io/)
4+
cff-version: 1.2.0
5+
title: 'Pywhy-Stats: Statistical inference in Python.'
6+
abstract: 'Pywhy-Stats is a Python library that leverages a simple API for performing independence and conditional independence testing.'
7+
authors:
8+
- given-names: Adam
9+
family-names: Li
10+
affiliation: 'Department of Computer Science, Columbia University, New York, NY, USA'
11+
orcid: 'https://orcid.org/0000-0001-8421-365X'
12+
- given-names: Patrick
13+
family-names: Blöbaum
14+
affiliation: 'Amazon'
15+
16+
type: software
17+
repository-code: 'https://github.com/py-why/pywhy-stats'
18+
license: MIT
19+
keywords:
20+
- causality
21+
- pywhy
22+
- statistics
23+
- independece testing
24+
...

CONTRIBUTING.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -141,6 +141,11 @@ When you're ready to contribute code to address an open issue, please follow the
141141

142142
</details>
143143

144+
5. Adding your name to the CITATION.cff file
145+
146+
We are a community-driven open-source project and want to make sure all contributors are acknowledged. If you are a new contributor, add your name
147+
to the ``CITATION.cff`` file and relevant metadata.
148+
144149
### Writing docstrings
145150

146151
We use [Sphinx](https://www.sphinx-doc.org/en/master/index.html) to build our API docs, which automatically parses all docstrings

doc/api.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ contains the p-value and the test statistic and optionally additional informatio
5757
Testing for conditional independence among variables is a core part
5858
of many data analysis procedures.
5959

60-
.. currentmodule:: pywhy_stats
60+
.. currentmodule:: pywhy_stats.independence
6161
.. autosummary::
6262
:toctree: generated/
6363

doc/conditional_independence.rst

Lines changed: 17 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -80,20 +80,21 @@ various proposals in the literature for estimating CMI, which we summarize here:
8080
estimating :math:`P(y|x)` and :math:`P(y|x,z)`, which can be used as plug-in estimates
8181
to the equation for CMI.
8282

83-
:mod:`pywhy_stats.fisherz` Partial (Pearson) Correlation
84-
--------------------------------------------------------
83+
:mod:`pywhy_stats.independence.fisherz` Partial (Pearson) Correlation
84+
---------------------------------------------------------------------
8585
Partial correlation based on the Pearson correlation is equivalent to CMI in the setting
8686
of normally distributed data. Computing partial correlation is fast and efficient and
8787
thus attractive to use. However, this **relies on the assumption that the variables are Gaussiany**,
8888
which may be unrealistic in certain datasets.
8989

90+
.. currentmodule:: pywhy_stats.independence
9091
.. autosummary::
9192
:toctree: generated/
9293

9394
fisherz
9495

95-
:mod:`pywhy_stats.power_divergence` Discrete, Categorical and Binary Data
96-
-------------------------------------------------------------------------
96+
:mod:`pywhy_stats.independence.power_divergence` Discrete, Categorical and Binary Data
97+
--------------------------------------------------------------------------------------
9798
If one has discrete data, then the test to use is based on Chi-square tests. The :math:`G^2`
9899
class of tests will construct a contingency table based on the number of levels across
99100
each discrete variable. An exponential amount of data is needed for increasing levels
@@ -104,8 +105,8 @@ for a discrete variable.
104105

105106
power_divergence
106107

107-
Kernel-Approaches
108-
-----------------
108+
:mod:`pywhy_stats.independence.kci` Kernel-Approaches
109+
-----------------------------------------------------
109110
Kernel independence tests are statistical methods used to determine if two random variables are independent or
110111
conditionally independent. One such test is the Hilbert-Schmidt Independence Criterion (HSIC), which examines the
111112
independence between two random variables, X and Y. HSIC employs kernel methods and, more specifically, it computes
@@ -125,6 +126,12 @@ Kernel-based tests are attractive for many applications, since they are semi-par
125126
that have been shown to be robust in the machine-learning field. For more information, see :footcite:`Zhang2011`.
126127

127128

129+
.. currentmodule:: pywhy_stats.independence
130+
.. autosummary::
131+
:toctree: generated/
132+
133+
kci
134+
128135
Classifier-based Approaches
129136
---------------------------
130137
Another suite of approaches that rely on permutation testing is the classifier-based approach.
@@ -144,9 +151,9 @@ helps maintain dependence between (X, Z) and (Y, Z) (if it exists), but generate
144151
conditionally independent dataset.
145152

146153

147-
=======================
148-
Conditional Discrepancy
149-
=======================
154+
=========================================
155+
Conditional Distribution 2-Sample Testing
156+
=========================================
150157

151158
.. currentmodule:: pywhy_stats
152159

@@ -170,23 +177,7 @@ indices of the distribution, one can convert the CD test:
170177
:math:`P_{i=j}(y|x) =? P_{i=k}(y|x)` into the CI test :math:`P(y|x,i) = P(y|x)`, which can
171178
be tested with the Chi-square CI tests.
172179

173-
Kernel-Approaches
174-
-----------------
175-
Kernel-based tests are attractive since they are semi-parametric and use kernel-based ideas
176-
that have been shown to be robust in the machine-learning field. The Kernel CD test is a test
177-
that computes a test statistic from kernels of the data and uses a weighted permutation testing
178-
based on the estimated propensity scores to generate samples from the null distribution
179-
:footcite:`Park2021conditional`, which are then used to estimate a pvalue.
180-
181-
182-
Bregman-Divergences
183-
-------------------
184-
The Bregman CD test is a divergence-based test
185-
that computes a test statistic from estimated Von-Neumann divergences of the data and uses a
186-
weighted permutation testing based on the estimated propensity scores to generate samples from the null distribution
187-
:footcite:`Yu2020Bregman`, which are then used to estimate a pvalue.
188-
189180
==========
190181
References
191182
==========
192-
.. footbibliography::
183+
.. footbibliography::

doc/conf.py

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@
3939

4040
# If your documentation needs a minimal Sphinx version, state it here.
4141
#
42-
needs_sphinx = "4.0"
42+
needs_sphinx = "5.0"
4343

4444
# Add any Sphinx extension module names here, as strings. They can be
4545
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
@@ -146,9 +146,7 @@
146146
"PValueResult": "pywhy_stats.pvalue_result.PValueResult",
147147
# numpy
148148
"NDArray": "numpy.ndarray",
149-
# "ArrayLike": "numpy.typing.ArrayLike",
150149
"ArrayLike": ":term:`array_like`",
151-
"fisherz": "pywhy_stats.fisherz",
152150
}
153151

154152
autodoc_typehints_format = "short"

doc/whats_new/v0.1.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ Version 0.1
2626
Changelog
2727
---------
2828

29-
- |Feature| Implement partial correlation test :func:`pywhy_stats.fisherz`, by `Adam Li`_ (:pr:`7`)
29+
- |Feature| Implement partial correlation test :func:`pywhy_stats.independence.fisherz`, by `Adam Li`_ (:pr:`7`)
3030
- |Feature| Add (un)conditional kernel independence test by `Patrick Blöbaum`_, co-authored by `Adam Li`_ (:pr:`14`)
3131
- |Feature| Add categorical independence tests by `Adam Li`_, (:pr:`18`)
3232

0 commit comments

Comments
 (0)