Skip to content

Commit cefdedf

Browse files
djinnomecthoyt
andauthored
JZ paper updates (#309)
Closes #308 --------- Co-authored-by: Charles Tapley Hoyt <[email protected]>
1 parent 082cdfb commit cefdedf

File tree

2 files changed

+86
-43
lines changed

2 files changed

+86
-43
lines changed

paper/paper.bib

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -456,3 +456,33 @@ @article{tikka2017b
456456
volume = {18},
457457
}
458458

459+
@inproceedings{agrawal2024automated,
460+
author = {Agrawal, Raj and Witty, Sam and Zane, Andy and Bingham, Eli},
461+
editor = {Globerson, A. and Mackey, L. and Belgrave, D. and Fan, A. and Paquet, U. and Tomczak, J. and Zhang, C.},
462+
publisher = {Curran Associates, Inc.},
463+
url = {https://proceedings.neurips.cc/paper_files/paper/2024/file/1d10fe211f5139de49f94c6f0c7cecbe-Paper-Conference.pdf},
464+
booktitle = {Advances in Neural Information Processing Systems},
465+
date = {2024},
466+
pages = {16102--16132},
467+
title = {Automated Efficient Estimation using Monte Carlo Efficient Influence Functions},
468+
volume = {37},
469+
}
470+
471+
@thesis{wang2023tractable,
472+
author = {Wang, Benjie},
473+
url = {https://ora.ox.ac.uk/objects/uuid:2fafc463-3a9f-40cb-a48c-e33272c691b8},
474+
date = {2023},
475+
title = {Tractable Probabilistic Models for Causal Learning and Reasoning},
476+
type = {Doctoral Dissertation},
477+
urldate = {2024-01-16},
478+
}
479+
480+
@misc{chirho,
481+
author = {Bingham, Eli and Witty, Sam},
482+
publisher = {GitHub},
483+
date = {2025},
484+
howpublished = {\url{https://github.com/BasisResearch/chirho}},
485+
journaltitle = {GitHub repository},
486+
title = {Causal Reasoning with ChiRho},
487+
}
488+

paper/paper.md

Lines changed: 56 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -165,13 +165,13 @@ old adage that correlation does not imply causation.
165165
A key step in causal inference is **causal identification** during which it is
166166
determined whether it is theoretically possible to estimate a causal effect from
167167
available data, given prior knowledge about relationships between variables and
168-
a causal query, such as a:
168+
a causal query, such as:
169169

170-
1. **Interventional Query**, which asks: _what would happen if we intervene?_
171-
For example, what would be the average effect if everyone received treatment?
172-
2. **Counterfactual Query**, which asks: _what would have happened to specific
173-
individuals in an alternative scenario?_ For example, would a given patient,
174-
who did recover, have recovered anyway without treatment?.
170+
1. **Interventional Query**, which asks: _what will happen if we intervene?_ For
171+
example, what would be the average effect if everyone received treatment?
172+
2. **Counterfactual Query**, which asks: _what would have happened had we done
173+
something different?_ For example, would a given patient, who recovered after
174+
receiving treatment, have recovered anyway without treatment?.
175175
3. **Transportability Query**, which asks: _can causal findings from one
176176
population be validly applied to another, and if so, how can evidence from
177177
multiple studies or populations be combined to draw conclusions about a
@@ -183,11 +183,13 @@ queries to data from (randomized) controlled trials, observational studies, or
183183
mixtures thereof. $Y_0$ focuses on the qualitative investigation of causation,
184184
helping researchers determine _whether_ a causal relationship can be estimated
185185
from available data before attempting to estimate _how strong_ that relationship
186-
is. $Y_0$ provides a domain-specific language for expressing causal queries,
187-
tools for representing and manipulating graphical causal models that represent
188-
prior knowledge about either single or multiple populations, and implementations
189-
of numerous identification algorithms from the recent causal inference
190-
literature.
186+
is. Furthermore, $Y_0$ provides guidance on how to transform the causal query
187+
into a symbolic estimand that can be non-parametrically estimated from the
188+
available data. $Y_0$ provides a domain-specific language for representing
189+
causal queries and estimands as symbolic probabilistic expressions, tools for
190+
representing causal graphical models with unobserved confounders, such as
191+
acyclic directed mixed graphs (ADMGs), and implementations of numerous
192+
identification algorithms from the recent causal inference literature.
191193

192194
# Statement of Need
193195

@@ -228,11 +230,15 @@ future algorithms and workflows.
228230
**Probabilistic Expressions** $Y_0$ implements an internal domain-specific
229231
language that can capture variables, counterfactual variables, population
230232
variables, and probabilistic expressions in which they appear. It covers the
231-
three levels of Pearl's Causal Hierarchy [@bareinboim2022], including the
232-
probability of sufficient causation $P(Y_X \mid X^*, Y^*)$, necessary causation
233-
$P(Y^*_{X^*} \mid X, Y)$, and necessary and sufficient causation
234-
$P(Y_X, Y^*_{X^*})$. Expressions can be converted to SymPy [@meurer2017sympy] or
235-
LaTeX expressions and be rendered in Jupyter notebooks.
233+
three levels of Pearl's Causal Hierarchy [@bareinboim2022], including
234+
association $P(Y=y \mid
235+
X=x^\ast)$, represented as \texttt{P(Y | \textasciitilde
236+
X)}, interventions $P_{do(X=x^\ast)}(Y=y, Z=z)$, represented as
237+
\texttt{P[\textasciitilde X](Y, Z)} and counterfactuals
238+
$P(Y_{do(X=x^\ast)}=y^\ast\mid X=x, Y=y)$, represented as
239+
\texttt{P(\textasciitilde Y @ \textasciitilde X | X, Y)}. Expressions can be
240+
converted to SymPy [@meurer2017sympy] or LaTeX expressions and can be rendered
241+
in Jupyter notebooks.
236242

237243
**Data Structure** $Y_0$ builds on NetworkX [@hagberg2008networkx] to implement
238244
an (acyclic) directed mixed graph data structure, used in many identification
@@ -255,9 +261,8 @@ Verma constraints [@tian2012verma].
255261
algorithms of any causal inference package. It implements `ID`
256262
[@shpitser2006id], `IDC` [@shpitser2007idc], `ID*` [@shpitser2012idstar], `IDC*`
257263
[@shpitser2012idstar], surrogate outcomes (`TRSO`) [@tikka2019trso], `tian-ID`
258-
[@tian2010identifying], transport [@correa2020transport], counterfactual
259-
transport [@correa2022cftransport], and identification for causal queries over
260-
hierarchical causal models [@weinstein2024hierarchicalcausalmodels].
264+
[@tian2010identifying], transport [@correa2020transport], and counterfactual
265+
transport [@correa2022cftransport].
261266

262267
# Case Study
263268

@@ -280,10 +285,13 @@ cigarettes. Therefore, we add a _bidirected_ edge in \autoref{cancer}B.
280285
Unfortunately, `ID` can not produce an estimand for \autoref{cancer}B, which
281286
motivates the usage of an alternative algorithm that incorporates observational
282287
and/or interventional data. For example, if data from an observational study
283-
($\pi^{\ast}$) and data from an interventional trial on smoking ($\pi_1$) are
284-
available, the surrogate outcomes algorithm (`TRSO`) [@tikka2019trso] estimates
285-
the effect of smoking on the risk of cancer in \autoref{cancer}B as
286-
$\sum_{Tar} P^{\pi^{\ast}}(Cancer | Smoking, Tar) P_{\text{Smoking}}^{{\pi_1}}(Tar)$.
288+
associating smoking with tar and cancer ($\pi^{\ast}$) and data from a
289+
randomized trial studying the causal effect of smoking on tar buildup in the
290+
lungs ($\pi_1$) are available, the surrogate outcomes algorithm (`TRSO`)
291+
[@tikka2019trso] estimates the effect of smoking on the risk of cancer in
292+
\autoref{cancer}B as
293+
$\sum_{Tar} P^{\pi^{\ast}}(Cancer |
294+
Smoking, Tar) P_{\text{Smoking}}^{{\pi_1}}(Tar)$.
287295
Code and a more detailed description of this case study can be found in the
288296
following
289297
[Jupyter notebook](https://github.com/y0-causal-inference/y0/blob/main/notebooks/Surrogate%20Outcomes.ipynb).
@@ -308,34 +316,39 @@ its further development:
308316
# Future Directions
309317

310318
There remain several high value identification algorithms to include in $Y_0$ in
311-
the future. For example, the cyclic identification algorithm (`ioID`)
319+
the future. First, the cyclic identification algorithm (`ioID`)
312320
[@forré2019causalcalculuspresencecycles] is important to work with more
313321
realistic graphs that contain cycles, such as how biomolecular signaling
314-
pathways often contain feedback loops. Further, missing data identification
322+
pathways often contain feedback loops. Second, Missing data identification
315323
algorithms can account for data that is missing not at random (MNAR) by modeling
316-
the underlying missingness mechanism [@mohan2021]. Several algorithms noted in
317-
the review by @JSSv099i05, such as generalized identification (`gID`)
318-
[@lee2019general] and generalized counterfactual identification (`gID*`)
319-
[@correa2021counterfactual], can be formulated as special cases of
320-
counterfactual transportability. Therefore, we plan to improve the user
321-
experience by exposing more powerful algorithms like counterfactual transport
322-
through a simplified APIs corresponding to special cases like `gID` and `gID*`.
323-
Similarly, we plan to implement probabilistic expression simplification
324-
[@tikka2017b] to improve the consistency of the estimands output from
325-
identification algorithms.
324+
the underlying missingness mechanism [@mohan2021]. Third, algorithms that
325+
provide sufficient conditions for identification in hierarchical causal models
326+
[@weinstein2024hierarchicalcausalmodels] would be useful for supporting causal
327+
identification in probabilistic programming languages, such as ChiRho [@chirho].
328+
329+
Several algorithms noted in the review by @JSSv099i05, such as generalized
330+
identification (`gID`) [@lee2019general] and generalized counterfactual
331+
identification (`gID*`) [@correa2021counterfactual], can be formulated as
332+
special cases of counterfactual transportability. Therefore, we plan to improve
333+
the user experience by exposing more powerful algorithms like counterfactual
334+
transport through a simplified APIs corresponding to special cases like `gID`
335+
and `gID*`. Similarly, we plan to implement probabilistic expression
336+
simplification [@tikka2017b] to improve the consistency of the estimands output
337+
from identification algorithms.
326338

327339
It remains an open research question how to estimate the causal effect for an
328-
arbitrary estimand produced by an algorithm more sophisticated than `ID`. Two
329-
potential avenues for overcoming this might be a combination of the Pyro
330-
probabilistic programming langauge [@bingham2018pyro] and its causal inference
331-
extension [ChiRho](https://github.com/BasisResearch/chirho). Tractable circuits
332-
[@darwiche2022causalinferenceusingtractable] also present a new paradigm for
333-
generic estimation. Such a generalization would be a lofty achievement and
334-
enable the automation of downstream applications in experimental design.
340+
arbitrary estimand produced by an algorithm more sophisticated than `ID`.
341+
@agrawal2024automated recently demonstrated automatically generating an
342+
efficient and robust estimator for causal queries more sophisticated than `ID`
343+
using ChiRho [@chirho], a causal extension of the Pyro probabilistic programming
344+
language [@bingham2018pyro]. Probabilistic circuits
345+
[@darwiche2022causalinferenceusingtractable; @wang2023tractable] also present a
346+
new paradigm for tractable causal estimation. Such a generalization would enable
347+
the automation of downstream applications in experimental design.
335348

336349
# Availability and Usage
337350

338-
`y0` is available as a package on [PyPI](https://pypi.org/project/y0) with the
351+
$Y_0$ is available as a package on [PyPI](https://pypi.org/project/y0) with the
339352
source code available at
340353
[https://github.com/y0-causal-inference/y0](https://github.com/y0-causal-inference/y0)
341354
under a BSD 3-clause license, archived to Zenodo at

0 commit comments

Comments
 (0)