You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/src/index.md
+36-4Lines changed: 36 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@ CurrentModule = TMLE
6
6
7
7
## Overview
8
8
9
-
TMLE.jl is a Julia implementation of the Targeted Minimum Loss-Based Estimation ([TMLE](https://link.springer.com/book/10.1007/978-1-4419-9782-1)) framework. If you are interested in efficient and unbiased estimation of causal effects, you are in the right place. Since TMLE uses machine-learning methods to estimate nuisance estimands, the present package is based upon [MLJ](https://alan-turing-institute.github.io/MLJ.jl/dev/).
9
+
TMLE.jl is a Julia implementation of the Targeted Minimum Loss-Based Estimation ([TMLE](https://link.springer.com/book/10.1007/978-1-4419-9782-1)) framework. If you are interested in leveraging the power of modern machine-learning methods while preserving interpretability and statistical inference guarantees, you are in the right place. TMLE.jl is compatible with any [MLJ](https://alan-turing-institute.github.io/MLJ.jl/dev/) compliant algorithm and any dataset respecting the [Tables](https://tables.juliadata.org/stable/) interface.
10
10
11
11
## Installation
12
12
@@ -20,7 +20,7 @@ Pkg> add TMLE
20
20
21
21
To run an estimation procedure, we need 3 ingredients:
22
22
23
-
1. A dataset: here a simulation dataset.
23
+
### 1. A dataset: here a simulation dataset
24
24
25
25
For illustration, assume we know the actual data generating process is as follows:
The goal of this package is to provide an entry point for semi-parametric asymptotic unbiased and efficient estimation in Julia. The two main general estimators that are known to achieve these properties are the One-Step estimator and the Targeted Maximum-Likelihood estimator. Most of the current effort has been centered around estimands that are composite of the counterfactual mean.
86
+
87
+
Distinguishing Features:
88
+
89
+
- Estimands: Counterfactual Mean, Average Treatment Effect, Interactions, Any composition thereof
90
+
- Estimators: TMLE, One-Step, in both canonical and cross-validated versions.
91
+
- Machine-Learning: Any [MLJ](https://alan-turing-institute.github.io/MLJ.jl/stable/) compatible model
92
+
- Dataset: Any dataset respecting the [Tables](https://tables.juliadata.org/stable/) interface (e.g. [DataFrames.jl](https://dataframes.juliadata.org/stable/))
93
+
- Factorial Treatment Variables:
94
+
- Multiple treatments
95
+
- Categorical treatment values
96
+
97
+
## Citing TMLE.jl
98
+
99
+
If you use TMLE.jl for your own work and would like to cite us, here are the BibTeX and APA formats:
100
+
101
+
- BibTeX
102
+
103
+
```bibtex
104
+
@software{Labayle_TMLE_jl,
105
+
author = {Labayle, Olivier and Beentjes, Sjoerd and Khamseh, Ava and Ponting, Chris},
106
+
title = {{TMLE.jl}},
107
+
url = {https://github.com/olivierlabayle/TMLE.jl}
108
+
}
109
+
```
110
+
111
+
- APA
112
+
113
+
Labayle, O., Beentjes, S., Khamseh, A., & Ponting, C. TMLE.jl [Computer software]. https://github.com/olivierlabayle/TMLE.jl
Copy file name to clipboardExpand all lines: docs/src/user_guide/estimation.md
+68-20Lines changed: 68 additions & 20 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@ CurrentModule = TMLE
4
4
5
5
# Estimation
6
6
7
-
## Estimating a single Estimand
7
+
## Constructing and Using Estimators
8
8
9
9
```@setup estimation
10
10
using Random
@@ -51,11 +51,12 @@ scm = SCM([
51
51
)
52
52
```
53
53
54
-
Once a statistical estimand has been defined, we can proceed with estimation. At the moment, we provide 3 main types of estimators:
54
+
Once a statistical estimand has been defined, we can proceed with estimation. There are two semi-parametric efficient estimators in TMLE.jl:
55
55
56
-
- Targeted Maximum Likelihood Estimator (`TMLEE`)
57
-
- One-Step Estimator (`OSE`)
58
-
- Naive Plugin Estimator (`NAIVE`)
56
+
- The Targeted Maximum-Likelihood Estimator (`TMLEE`)
57
+
- The One-Step Estimator (`OSE`)
58
+
59
+
While they have similar asymptotic properties, their finite sample performance may be different. They also have a very distinguishing feature, the TMLE is a plugin estimator, which means it respects the natural bounds of the estimand of interest. In contrast, the OSE may in theory report values outside these bounds. In practice, this is not often the case and the estimand of interest may not impose any restriction on its domain.
59
60
60
61
Drawing from the example dataset and `SCM` from the Walk Through section, we can estimate the ATE for `T₁`. Let's use TMLE:
61
62
@@ -72,27 +73,25 @@ result₁
72
73
nothing # hide
73
74
```
74
75
75
-
We see that both models corresponding to variables `Y` and `T₁` were fitted in the process but that the model for `T₂` was not because it was not necessary to estimate this estimand.
76
-
77
-
The `cache` contains estimates for the nuisance functions that were necessary to estimate the ATE. For instance, we can see what is the value of ``\epsilon`` corresponding to the clever covariate.
76
+
The `cache` (see below) contains estimates for the nuisance functions that were necessary to estimate the ATE. For instance, we can see what is the value of ``\epsilon`` corresponding to the clever covariate.
78
77
79
78
```@example estimation
80
79
ϵ = last_fluctuation_epsilon(cache)
81
80
```
82
81
83
-
The `result₁` structure corresponds to the estimation result and should report 3 main elements:
82
+
The `result₁` structure corresponds to the estimation result and will display the result of a T-Test including:
84
83
85
84
- A point estimate.
86
85
- A 95% confidence interval.
87
86
- A p-value (Corresponding to the test that the estimand is different than 0).
88
87
89
-
This is only summary statistics but since both the TMLE and OSE are asymptotically linear estimators, standard Z/T tests from [HypothesisTests.jl](https://juliastats.org/HypothesisTests.jl/stable/) can be performed.
88
+
Both the TMLE and OSE are asymptotically linear estimators, standard Z/T tests from [HypothesisTests.jl](https://juliastats.org/HypothesisTests.jl/stable/) can be performed and `confint` and `pvalue` methods used.
We could now get an interest in the Average Treatment Effect of `T₂` that we will estimate with an`OSE`:
94
+
Let us now turn to the Average Treatment Effect of `T₂`, we will estimate it with a`OSE`:
96
95
97
96
```@example estimation
98
97
Ψ₂ = ATE(
@@ -109,24 +108,73 @@ nothing # hide
109
108
110
109
Again, required nuisance functions are fitted and stored in the cache.
111
110
112
-
## CV-Estimation
111
+
## Specifying Models
113
112
114
-
Both TMLEand OSE can be used with sample-splitting, which, for an additional computational cost, further reduces the assumptions we need to make regarding our data generating process ([see here](https://arxiv.org/abs/2203.06469)). Note that this sample-splitting procedure should not be confused with the sample-splitting happening in Super Learning. Using both CV-TMLE and Super-Learning will result in two nested sample-splitting loops.
113
+
By default, TMLE.jl uses generalized linear models for the estimation of relevant and nuisance factors such as the outcome mean and the propensity score. However, this is not the recommended usage since the estimators' performance is closely related to how well we can estimate these factors. More sophisticated models can be provided using the `models` keyword argument of each estimator which is essentially a `NamedTuple` mapping variables' names to their respective model.
115
114
116
-
To leverage sample-splitting, simply specify a `resampling` strategy when building an estimator:
115
+
Rather than specifying a specific model for each variable it may be easier to override the default models using the `default_models` function:
116
+
117
+
For example one can override all default models with XGBoost models from `MLJXGBoostInterface`:
117
118
118
119
```@example estimation
119
-
cvtmle = TMLEE(resampling=CV())
120
-
cvresult₁, _ = cvtmle(Ψ₁, dataset);
120
+
using MLJXGBoostInterface
121
+
xgboost_regressor = XGBoostRegressor()
122
+
xgboost_classifier = XGBoostClassifier()
123
+
models = default_models(
124
+
Q_binary=xgboost_classifier,
125
+
Q_continuous=xgboost_regressor,
126
+
G=xgboost_classifier
127
+
)
128
+
tmle_gboost = TMLEE(models=models)
121
129
```
122
130
123
-
Similarly, one could build CV-OSE:
131
+
The advantage of using `default_models` is that it will automatically prepend each model with a [ContinuousEncoder](https://alan-turing-institute.github.io/MLJ.jl/dev/transformers/#MLJModels.ContinuousEncoder) to make sure the correct types are passed to the downstream models.
132
+
133
+
Super Learning ([Stack](https://alan-turing-institute.github.io/MLJ.jl/dev/model_stacking/#Model-Stacking)) as well as variable specific models can be defined as well. Here is a more customized version:
134
+
135
+
```@example estimation
136
+
lr = LogisticClassifier(lambda=0.)
137
+
stack_binary = Stack(
138
+
metalearner=lr,
139
+
xgboost=xgboost_classifier,
140
+
lr=lr
141
+
)
142
+
143
+
models = (
144
+
T₁ = with_encoder(xgboost_classifier), # T₁ with XGBoost prepended with a Continuous Encoder
145
+
default_models( # For all other variables use the following defaults
146
+
Q_binary=stack_binary, # A Super Learner
147
+
Q_continuous=xgboost_regressor, # An XGBoost
148
+
# Unspecified G defaults to Logistic Regression
149
+
)...
150
+
)
151
+
152
+
tmle_custom = TMLEE(models=models)
153
+
```
154
+
155
+
Notice that `with_encoder` is simply a shorthand to construct a pipeline with a `ContinuousEncoder` and that the resulting `models` is simply a `NamedTuple`.
156
+
157
+
## CV-Estimation
158
+
159
+
Canonical TMLE/OSE are essentially using the dataset twice, once for the estimation of the nuisance functions and once for the estimation of the parameter of interest. This means that there is a risk of over-fitting and residual bias ([see here](https://arxiv.org/abs/2203.06469) for some discussion). One way to address this limitation is to use a technique called sample-splitting / cross-validating. In order to activate the sample-splitting mode, simply provide a `MLJ.ResamplingStrategy` using the `resampling` keyword argument:
160
+
161
+
```@example estimation
162
+
TMLEE(resampling=StratifiedCV());
163
+
```
164
+
165
+
or
124
166
125
167
```julia
126
-
cvose =OSE(resampling=CV(nfolds=3))
168
+
OSE(resampling=StratifiedCV(nfolds=3));
127
169
```
128
170
129
-
## Caching model fits
171
+
There are some practical considerations
172
+
173
+
- Choice of `resampling` Strategy: The theory behind sample-splitting requires the nuisance functions to be sufficiently well estimated on **each and every** fold. A practical aspect of it is that each fold should contain a sample representative of the dataset. In particular, when the treatment and outcome variables are categorical it is important to make sure the proportions are preserved. This is typically done using `StratifiedCV`.
174
+
- Computational Complexity: Sample-splitting results in ``K`` fits of the nuisance functions, drastically increasing computational complexity. In particular, if the nuisance functions are estimated using (P-fold) Super-Learning, this will result in two nested cross-validation loops and ``K \times P`` fits.
175
+
- Caching of Nuisance Functions: Because the `resampling` strategy typically needs to preserve the outcome and treatment proportions, very little reuse of cached models is possible (see [Caching Models](@ref)).
176
+
177
+
## Caching Models
130
178
131
179
Let's now see how the `cache` can be reused with a new estimand, say the Total Average Treatment Effect of both `T₁` and `T₂`.
0 commit comments