Skip to content

Commit fb34846

Browse files
author
Miruna Oprescu
authored
Add details on choosing first stage models (#372)
* Add details on choosing first stage models * Modified docs, README and added a new notebook
1 parent ab70f73 commit fb34846

File tree

4 files changed

+740
-2
lines changed

4 files changed

+740
-2
lines changed

README.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@ For information on use cases and background material on causal inference and het
3131
- [Usage Examples](#usage-examples)
3232
- [Estimation Methods](#estimation-methods)
3333
- [Interpretability](#interpretability)
34+
- [Causal Model Selection and Cross-Validation](#causal-model-selection-and-cross-validation)
3435
- [Inference](#inference)
3536
- [For Developers](#for-developers)
3637
- [Running the tests](#running-the-tests)
@@ -416,6 +417,39 @@ See the <a href="#references">References</a> section for more details.
416417
mdl, _ = scorer.ensemble([mdl for _, mdl in models])
417418
```
418419

420+
</details>
421+
422+
<details>
423+
<summary>First Stage Model Selection (click to expand)</summary>
424+
425+
First stage models can be selected either by passing in cross-validated models (e.g. `sklearn.linear_model.LassoCV`) to EconML's estimators or perform the first stage model selection outside of EconML and pass in the selected model. Unless selecting among a large set of hyperparameters, choosing first stage models externally is the preferred method due to statistical and computational advantages.
426+
427+
```Python
428+
from econml.dml import LinearDML
429+
from sklearn import clone
430+
from sklearn.ensemble import RandomForestRegressor
431+
from sklearn.model_selection import GridSearchCV
432+
433+
cv_model = GridSearchCV(
434+
estimator=RandomForestRegressor(),
435+
param_grid={
436+
"max_depth": [3, None],
437+
"n_estimators": (10, 30, 50, 100, 200),
438+
"max_features": (2, 4, 6),
439+
},
440+
cv=5,
441+
)
442+
# First stage model selection within EconML
443+
# This is more direct, but computationally and statistically less efficient
444+
est = LinearDML(model_y=cv_model, model_t=cv_model)
445+
# First stage model selection ouside of EconML
446+
# This is the most efficient, but requires boilerplate code
447+
model_t = clone(cv_model).fit(W, T).best_estimator_
448+
model_y = clone(cv_model).fit(W, Y).best_estimator_
449+
est = LinearDML(model_y=model_t, model_t=model_y)
450+
```
451+
452+
419453
</details>
420454

421455
### Inference

doc/spec/estimation/dml.rst

Lines changed: 24 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -430,19 +430,41 @@ Usage FAQs
430430

431431
.. testcode::
432432

433-
from econml.dml import DML
433+
from econml.dml import SparseLinearDML
434434
from sklearn.ensemble import RandomForestRegressor
435435
from sklearn.model_selection import GridSearchCV
436436
first_stage = lambda: GridSearchCV(
437437
estimator=RandomForestRegressor(),
438438
param_grid={
439439
'max_depth': [3, None],
440-
'n_estimators': (10, 30, 50, 100, 200, 400, 600, 800, 1000),
440+
'n_estimators': (10, 30, 50, 100, 200),
441441
'max_features': (2,4,6)
442442
}, cv=10, n_jobs=-1, scoring='neg_mean_squared_error'
443443
)
444444
est = SparseLinearDML(model_y=first_stage(), model_t=first_stage())
445445

446+
Alternatively, you can pick the best first stage models outside of the EconML framework and pass in the selected models to EconML.
447+
This can save on runtime and computational resources. Furthermore, it is statistically more stable since all data is being used for
448+
training rather than a fold. E.g.:
449+
450+
.. testcode::
451+
452+
from econml.dml import LinearDML
453+
from sklearn.ensemble import RandomForestRegressor
454+
from sklearn.model_selection import GridSearchCV
455+
first_stage = lambda: GridSearchCV(
456+
estimator=RandomForestRegressor(),
457+
param_grid={
458+
'max_depth': [3, None],
459+
'n_estimators': (10, 30, 50, 100, 200),
460+
'max_features': (2,4,6)
461+
}, cv=10, n_jobs=-1, scoring='neg_mean_squared_error'
462+
)
463+
model_y = first_stage().fit(X, Y).best_estimator_
464+
model_t = first_stage().fit(X, T).best_estimator_
465+
est = LinearDML(model_y=model_y, model_t=model_t)
466+
467+
446468
- **How do I select the hyperparameters of the final model (if any)?**
447469

448470
You can use cross-validated classes for the final model too. Our default debiased lasso performs cross validation

doc/spec/estimation/dr.rst

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -431,6 +431,39 @@ Usage FAQs
431431
est.fit(y, T, X=X, W=W)
432432
point = est.effect(X, T0=T0, T1=T1)
433433

434+
Alternatively, you can pick the best first stage models outside of the EconML framework and pass in the selected models to EconML.
435+
This can save on runtime and computational resources. Furthermore, it is statistically more stable since all data is being used for
436+
training rather than a fold. E.g.:
437+
438+
.. testcode::
439+
440+
from econml.drlearner import DRLearner
441+
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
442+
from sklearn.model_selection import GridSearchCV
443+
model_reg = lambda: GridSearchCV(
444+
estimator=RandomForestRegressor(),
445+
param_grid={
446+
'max_depth': [3, None],
447+
'n_estimators': (10, 50, 100)
448+
}, cv=5, n_jobs=-1, scoring='neg_mean_squared_error'
449+
)
450+
model_clf = lambda: GridSearchCV(
451+
estimator=RandomForestClassifier(min_samples_leaf=10),
452+
param_grid={
453+
'max_depth': [3, None],
454+
'n_estimators': (10, 50, 100)
455+
}, cv=5, n_jobs=-1, scoring='neg_mean_squared_error'
456+
)
457+
XW = np.hstack([X, W])
458+
model_regression = model_reg().fit(XW, Y).best_estimator_
459+
model_propensity = model_clf().fit(XW, T).best_estimator_
460+
est = DRLearner(model_regression=model_regression,
461+
model_propensity=model_propensity,
462+
model_final=model_regression, cv=5)
463+
est.fit(y, T, X=X, W=W)
464+
point = est.effect(X, T0=T0, T1=T1)
465+
466+
434467
- **What if I have many treatments?**
435468

436469
The method allows for multiple discrete (categorical) treatments and will estimate a CATE model for each treatment.

notebooks/Choosing First Stage Models.ipynb

Lines changed: 649 additions & 0 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)