@@ -165,13 +165,13 @@ old adage that correlation does not imply causation.
165165A key step in causal inference is ** causal identification** during which it is
166166determined whether it is theoretically possible to estimate a causal effect from
167167available data, given prior knowledge about relationships between variables and
168- a causal query, such as a :
168+ a causal query, such as:
169169
170- 1 . ** Interventional Query** , which asks: _ what would happen if we intervene?_
171- For example, what would be the average effect if everyone received treatment?
172- 2 . ** Counterfactual Query** , which asks: _ what would have happened to specific
173- individuals in an alternative scenario ?_ For example, would a given patient,
174- who did recover , have recovered anyway without treatment?.
170+ 1 . ** Interventional Query** , which asks: _ what will happen if we intervene?_ For
171+ example, what would be the average effect if everyone received treatment?
172+ 2 . ** Counterfactual Query** , which asks: _ what would have happened had we done
173+ something different ?_ For example, would a given patient, who recovered after
174+ receiving treatment , have recovered anyway without treatment?.
1751753 . ** Transportability Query** , which asks: _ can causal findings from one
176176 population be validly applied to another, and if so, how can evidence from
177177 multiple studies or populations be combined to draw conclusions about a
@@ -183,11 +183,13 @@ queries to data from (randomized) controlled trials, observational studies, or
183183mixtures thereof. $Y_0$ focuses on the qualitative investigation of causation,
184184helping researchers determine _ whether_ a causal relationship can be estimated
185185from available data before attempting to estimate _ how strong_ that relationship
186- is. $Y_0$ provides a domain-specific language for expressing causal queries,
187- tools for representing and manipulating graphical causal models that represent
188- prior knowledge about either single or multiple populations, and implementations
189- of numerous identification algorithms from the recent causal inference
190- literature.
186+ is. Furthermore, $Y_0$ provides guidance on how to transform the causal query
187+ into a symbolic estimand that can be non-parametrically estimated from the
188+ available data. $Y_0$ provides a domain-specific language for representing
189+ causal queries and estimands as symbolic probabilistic expressions, tools for
190+ representing causal graphical models with unobserved confounders, such as
191+ acyclic directed mixed graphs (ADMGs), and implementations of numerous
192+ identification algorithms from the recent causal inference literature.
191193
192194# Statement of Need
193195
@@ -228,11 +230,15 @@ future algorithms and workflows.
228230** Probabilistic Expressions** $Y_0$ implements an internal domain-specific
229231language that can capture variables, counterfactual variables, population
230232variables, and probabilistic expressions in which they appear. It covers the
231- three levels of Pearl's Causal Hierarchy [ @bareinboim2022 ] , including the
232- probability of sufficient causation $P(Y_X \mid X^* , Y^* )$, necessary causation
233- $P(Y^* _ {X^* } \mid X, Y)$, and necessary and sufficient causation
234- $P(Y_X, Y^* _ {X^* })$. Expressions can be converted to SymPy [ @meurer2017sympy ] or
235- LaTeX expressions and be rendered in Jupyter notebooks.
233+ three levels of Pearl's Causal Hierarchy [ @bareinboim2022 ] , including
234+ association $P(Y=y \mid
235+ X=x^\ast)$, represented as \texttt{P(Y | \textasciitilde
236+ X)}, interventions $P_ {do(X=x^\ast)}(Y=y, Z=z)$, represented as
237+ \texttt{P[ \textasciitilde X] (Y, Z)} and counterfactuals
238+ $P(Y_ {do(X=x^\ast)}=y^\ast\mid X=x, Y=y)$, represented as
239+ \texttt{P(\textasciitilde Y @ \textasciitilde X | X, Y)}. Expressions can be
240+ converted to SymPy [ @meurer2017sympy ] or LaTeX expressions and can be rendered
241+ in Jupyter notebooks.
236242
237243** Data Structure** $Y_0$ builds on NetworkX [ @hagberg2008networkx ] to implement
238244an (acyclic) directed mixed graph data structure, used in many identification
@@ -255,9 +261,8 @@ Verma constraints [@tian2012verma].
255261algorithms of any causal inference package. It implements ` ID `
256262[ @shpitser2006id ] , ` IDC ` [ @shpitser2007idc ] , ` ID* ` [ @shpitser2012idstar ] , ` IDC* `
257263[ @shpitser2012idstar ] , surrogate outcomes (` TRSO ` ) [ @tikka2019trso ] , ` tian-ID `
258- [ @tian2010identifying ] , transport [ @correa2020transport ] , counterfactual
259- transport [ @correa2022cftransport ] , and identification for causal queries over
260- hierarchical causal models [ @weinstein2024hierarchicalcausalmodels ] .
264+ [ @tian2010identifying ] , transport [ @correa2020transport ] , and counterfactual
265+ transport [ @correa2022cftransport ] .
261266
262267# Case Study
263268
@@ -280,10 +285,13 @@ cigarettes. Therefore, we add a _bidirected_ edge in \autoref{cancer}B.
280285Unfortunately, ` ID ` can not produce an estimand for \autoref{cancer}B, which
281286motivates the usage of an alternative algorithm that incorporates observational
282287and/or interventional data. For example, if data from an observational study
283- ($\pi^{\ast}$) and data from an interventional trial on smoking ($\pi_1$) are
284- available, the surrogate outcomes algorithm (` TRSO ` ) [ @tikka2019trso ] estimates
285- the effect of smoking on the risk of cancer in \autoref{cancer}B as
286- $\sum_ {Tar} P^{\pi^{\ast}}(Cancer | Smoking, Tar) P_ {\text{Smoking}}^{{\pi_1}}(Tar)$.
288+ associating smoking with tar and cancer ($\pi^{\ast}$) and data from a
289+ randomized trial studying the causal effect of smoking on tar buildup in the
290+ lungs ($\pi_1$) are available, the surrogate outcomes algorithm (` TRSO ` )
291+ [ @tikka2019trso ] estimates the effect of smoking on the risk of cancer in
292+ \autoref{cancer}B as
293+ $\sum_ {Tar} P^{\pi^{\ast}}(Cancer |
294+ Smoking, Tar) P_ {\text{Smoking}}^{{\pi_1}}(Tar)$.
287295Code and a more detailed description of this case study can be found in the
288296following
289297[ Jupyter notebook] ( https://github.com/y0-causal-inference/y0/blob/main/notebooks/Surrogate%20Outcomes.ipynb ) .
@@ -308,34 +316,39 @@ its further development:
308316# Future Directions
309317
310318There remain several high value identification algorithms to include in $Y_0$ in
311- the future. For example , the cyclic identification algorithm (` ioID ` )
319+ the future. First , the cyclic identification algorithm (` ioID ` )
312320[ @forr é2019causalcalculuspresencecycles] is important to work with more
313321realistic graphs that contain cycles, such as how biomolecular signaling
314- pathways often contain feedback loops. Further, missing data identification
322+ pathways often contain feedback loops. Second, Missing data identification
315323algorithms can account for data that is missing not at random (MNAR) by modeling
316- the underlying missingness mechanism [ @mohan2021 ] . Several algorithms noted in
317- the review by @JSSv099i05 , such as generalized identification (` gID ` )
318- [ @lee2019general ] and generalized counterfactual identification (` gID* ` )
319- [ @correa2021counterfactual ] , can be formulated as special cases of
320- counterfactual transportability. Therefore, we plan to improve the user
321- experience by exposing more powerful algorithms like counterfactual transport
322- through a simplified APIs corresponding to special cases like ` gID ` and ` gID* ` .
323- Similarly, we plan to implement probabilistic expression simplification
324- [ @tikka2017b ] to improve the consistency of the estimands output from
325- identification algorithms.
324+ the underlying missingness mechanism [ @mohan2021 ] . Third, algorithms that
325+ provide sufficient conditions for identification in hierarchical causal models
326+ [ @weinstein2024hierarchicalcausalmodels ] would be useful for supporting causal
327+ identification in probabilistic programming languages, such as ChiRho [ @chirho ] .
328+
329+ Several algorithms noted in the review by @JSSv099i05 , such as generalized
330+ identification (` gID ` ) [ @lee2019general ] and generalized counterfactual
331+ identification (` gID* ` ) [ @correa2021counterfactual ] , can be formulated as
332+ special cases of counterfactual transportability. Therefore, we plan to improve
333+ the user experience by exposing more powerful algorithms like counterfactual
334+ transport through a simplified APIs corresponding to special cases like ` gID `
335+ and ` gID* ` . Similarly, we plan to implement probabilistic expression
336+ simplification [ @tikka2017b ] to improve the consistency of the estimands output
337+ from identification algorithms.
326338
327339It remains an open research question how to estimate the causal effect for an
328- arbitrary estimand produced by an algorithm more sophisticated than ` ID ` . Two
329- potential avenues for overcoming this might be a combination of the Pyro
330- probabilistic programming langauge [ @bingham2018pyro ] and its causal inference
331- extension [ ChiRho] ( https://github.com/BasisResearch/chirho ) . Tractable circuits
332- [ @darwiche2022causalinferenceusingtractable] also present a new paradigm for
333- generic estimation. Such a generalization would be a lofty achievement and
334- enable the automation of downstream applications in experimental design.
340+ arbitrary estimand produced by an algorithm more sophisticated than ` ID ` .
341+ @agrawal2024automated recently demonstrated automatically generating an
342+ efficient and robust estimator for causal queries more sophisticated than ` ID `
343+ using ChiRho [ @chirho ] , a causal extension of the Pyro probabilistic programming
344+ language [ @bingham2018pyro ] . Probabilistic circuits
345+ [ @darwiche2022causalinferenceusingtractable; @wang2023tractable ] also present a
346+ new paradigm for tractable causal estimation. Such a generalization would enable
347+ the automation of downstream applications in experimental design.
335348
336349# Availability and Usage
337350
338- ` y0 ` is available as a package on [ PyPI] ( https://pypi.org/project/y0 ) with the
351+ $Y_0$ is available as a package on [ PyPI] ( https://pypi.org/project/y0 ) with the
339352source code available at
340353[ https://github.com/y0-causal-inference/y0 ] ( https://github.com/y0-causal-inference/y0 )
341354under a BSD 3-clause license, archived to Zenodo at
0 commit comments