-
Notifications
You must be signed in to change notification settings - Fork 1k
Open
Description
Is there a way to get the model estimated by DoWhy? I'm asking because I can't replicate it exactly with the formula given, where there are multiple treatments and effect modifiers:
Y = T1 + T2 + W + T1:Z + T2:Z
Where:
- Y: outcome
- T1 and T2: treatments
- W: confounder
- Z: effect modifier
I generated this example to show what I mean when I say the regression estimates just don't match exactly:
# --- Imports ---
import numpy as np
import pandas as pd
import statsmodels.api as sm
from dowhy import CausalModel
# --- 1. Synthetic data ---
np.random.seed(42)
n = 1000
Z = np.random.normal(0, 1, n)
W = np.random.normal(0, 1, n)
p1 = 1 / (1 + np.exp(-0.5 * W))
p2 = 1 / (1 + np.exp(-0.3 * W + 0.2 * np.random.normal(size=n)))
T1 = np.random.binomial(1, p1)
T2 = np.random.binomial(1, p2)
Y = (
2
+ 2*T1
+ 1.5*T2
+ 1.5*(T1*Z)
+ 0.5*(T2*Z)
+ 0.8*W
+ np.random.normal(0, 1, n)
)
data = pd.DataFrame({"Y": Y, "T1": T1, "T2": T2, "Z": Z, "W": W})
# --- 2. DoWhy model ---
model = CausalModel(
data=data,
treatment=["T1", "T2"],
outcome="Y",
common_causes=["W"],
effect_modifiers=["Z"],
)
identified_estimand = model.identify_effect()
estimate = model.estimate_effect(
identified_estimand,
method_name="backdoor.linear_regression",
)
# --- 3. Fit equivalent model manually ---
data["T1_Z"] = data["T1"] * data["Z"]
data["T2_Z"] = data["T2"] * data["Z"]
X = sm.add_constant(data[["T1", "T2", "W", "Z", "T1_Z", "T2_Z"]])
ols = sm.OLS(data["Y"], X).fit()
print("\nOLS params:\n", ols.params)
# --- 4. Prediction-based ATEs (g-computation) ---
def make_design(df, T1_value=None, T2_value=None):
df_copy = df.copy()
if T1_value is not None:
df_copy["T1"] = T1_value
if T2_value is not None:
df_copy["T2"] = T2_value
df_copy["T1_Z"] = df_copy["T1"] * df_copy["Z"]
df_copy["T2_Z"] = df_copy["T2"] * df_copy["Z"]
X_ = sm.add_constant(df_copy[["T1", "T2", "W", "Z", "T1_Z", "T2_Z"]], has_constant="add")
X_ = X_[ols.model.exog_names]
return X_
# ATE for T1
X_t1_1 = make_design(data, T1_value=1)
X_t1_0 = make_design(data, T1_value=0)
pred_t1_1 = ols.predict(X_t1_1)
pred_t1_0 = ols.predict(X_t1_0)
ate_T1 = (pred_t1_1 - pred_t1_0).mean()
# ATE for T2
X_t2_1 = make_design(data, T2_value=1)
X_t2_0 = make_design(data, T2_value=0)
pred_t2_1 = ols.predict(X_t2_1)
pred_t2_0 = ols.predict(X_t2_0)
ate_T2 = (pred_t2_1 - pred_t2_0).mean()
print("\nDoWhy estimate.value:", estimate.value)
print("Prediction-based ATEs (statsmodels):", ate_T1+ate_T2)
This code generates this output:
OLS params:
const 1.990098
T1 1.973336
T2 1.511018
W 0.823388
Z -0.007988
T1_Z 1.507216
T2_Z 0.589484
dtype: float64
DoWhy estimate.value: 3.524925884282054
Prediction-based ATEs (statsmodels): 3.5248872024993263
With real data, this discrepancy gets even bigger and starts behaving strangely, is there a way to replicate the regression DoWhy does with statsmodels or get the regression model itself so that I can run residual analysis or the usual analysis in any regression modelling?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels