Add details on choosing first stage models (#372)

Miruna Oprescu · web-flow · commit fb3484615c5b · 2021-01-15T13:14:45.000-05:00
* Add details on choosing first stage models

* Modified docs, README and added a new notebook
diff --git a/README.md b/README.md
@@ -31,6 +31,7 @@ For information on use cases and background material on causal inference and het
   - [Usage Examples](#usage-examples)
     - [Estimation Methods](#estimation-methods)
     - [Interpretability](#interpretability)
+    - [Causal Model Selection and Cross-Validation](#causal-model-selection-and-cross-validation)
     - [Inference](#inference)
 - [For Developers](#for-developers)
   - [Running the tests](#running-the-tests)
@@ -416,6 +417,39 @@ See the <a href="#references">References</a> section for more details.
   mdl, _ = scorer.ensemble([mdl for _, mdl in models])
   ```
 
+</details>
+
+<details>
+  <summary>First Stage Model Selection (click to expand)</summary>
+
+First stage models can be selected either by passing in cross-validated models (e.g. `sklearn.linear_model.LassoCV`) to EconML's estimators or perform the first stage model selection outside of EconML and pass in the selected model. Unless selecting among a large set of hyperparameters, choosing first stage models externally is the preferred method due to statistical and computational advantages.
+
+```Python
+from econml.dml import LinearDML
+from sklearn import clone
+from sklearn.ensemble import RandomForestRegressor
+from sklearn.model_selection import GridSearchCV
+
+cv_model = GridSearchCV(
+              estimator=RandomForestRegressor(),
+              param_grid={
+                  "max_depth": [3, None],
+                  "n_estimators": (10, 30, 50, 100, 200),
+                  "max_features": (2, 4, 6),
+              },
+              cv=5,
+           )
+# First stage model selection within EconML
+# This is more direct, but computationally and statistically less efficient
+est = LinearDML(model_y=cv_model, model_t=cv_model)
+# First stage model selection ouside of EconML
+# This is the most efficient, but requires boilerplate code
+model_t = clone(cv_model).fit(W, T).best_estimator_
+model_y = clone(cv_model).fit(W, Y).best_estimator_
+est = LinearDML(model_y=model_t, model_t=model_y)
+```
+
+
 </details>
 
 ### Inference
diff --git a/doc/spec/estimation/dml.rst b/doc/spec/estimation/dml.rst
@@ -430,19 +430,41 @@ Usage FAQs
 
     .. testcode::
 
-        from econml.dml import DML
+        from econml.dml import SparseLinearDML
         from sklearn.ensemble import RandomForestRegressor
         from sklearn.model_selection import GridSearchCV
         first_stage = lambda: GridSearchCV(
                         estimator=RandomForestRegressor(),
                         param_grid={
                                 'max_depth': [3, None],
-                                'n_estimators': (10, 30, 50, 100, 200, 400, 600, 800, 1000),
+                                'n_estimators': (10, 30, 50, 100, 200),
                                 'max_features': (2,4,6)
                             }, cv=10, n_jobs=-1, scoring='neg_mean_squared_error'
                         )
         est = SparseLinearDML(model_y=first_stage(), model_t=first_stage())
 
+    Alternatively, you can pick the best first stage models outside of the EconML framework and pass in the selected models to EconML. 
+    This can save on runtime and computational resources. Furthermore, it is statistically more stable since all data is being used for
+    training rather than a fold. E.g.:
+
+    .. testcode::
+
+        from econml.dml import LinearDML
+        from sklearn.ensemble import RandomForestRegressor
+        from sklearn.model_selection import GridSearchCV
+        first_stage = lambda: GridSearchCV(
+                        estimator=RandomForestRegressor(),
+                        param_grid={
+                                'max_depth': [3, None],
+                                'n_estimators': (10, 30, 50, 100, 200),
+                                'max_features': (2,4,6)
+                            }, cv=10, n_jobs=-1, scoring='neg_mean_squared_error'
+                        )
+        model_y = first_stage().fit(X, Y).best_estimator_
+        model_t = first_stage().fit(X, T).best_estimator_
+        est = LinearDML(model_y=model_y, model_t=model_t)
+
+
 - **How do I select the hyperparameters of the final model (if any)?**
 
     You can use cross-validated classes for the final model too. Our default debiased lasso performs cross validation
diff --git a/doc/spec/estimation/dr.rst b/doc/spec/estimation/dr.rst
@@ -431,6 +431,39 @@ Usage FAQs
         est.fit(y, T, X=X, W=W)
         point = est.effect(X, T0=T0, T1=T1)
 
+    Alternatively, you can pick the best first stage models outside of the EconML framework and pass in the selected models to EconML. 
+    This can save on runtime and computational resources. Furthermore, it is statistically more stable since all data is being used for
+    training rather than a fold. E.g.:
+
+    .. testcode::
+
+        from econml.drlearner import DRLearner
+        from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
+        from sklearn.model_selection import GridSearchCV
+        model_reg = lambda: GridSearchCV(
+                        estimator=RandomForestRegressor(),
+                        param_grid={
+                                'max_depth': [3, None],
+                                'n_estimators': (10, 50, 100)
+                            }, cv=5, n_jobs=-1, scoring='neg_mean_squared_error'
+                        )
+        model_clf = lambda: GridSearchCV(
+                        estimator=RandomForestClassifier(min_samples_leaf=10),
+                        param_grid={
+                                'max_depth': [3, None],
+                                'n_estimators': (10, 50, 100)
+                            }, cv=5, n_jobs=-1, scoring='neg_mean_squared_error'
+                        )
+        XW = np.hstack([X, W])
+        model_regression = model_reg().fit(XW, Y).best_estimator_
+        model_propensity = model_clf().fit(XW, T).best_estimator_
+        est = DRLearner(model_regression=model_regression, 
+                        model_propensity=model_propensity,
+                        model_final=model_regression, cv=5)
+        est.fit(y, T, X=X, W=W)
+        point = est.effect(X, T0=T0, T1=T1)
+
+
 - **What if I have many treatments?**
 
     The method allows for multiple discrete (categorical) treatments and will estimate a CATE model for each treatment.
diff --git a/notebooks/Choosing First Stage Models.ipynb b/notebooks/Choosing First Stage Models.ipynb