Replication of DoWhy backdoor.linear_regression with multiple treatments and effect modifiers

Is there a way to get the model estimated by DoWhy? I'm asking because I can't replicate it exactly with the formula given, where there are multiple treatments and effect modifiers:

Y = T1 + T2 + W + T1:Z + T2:Z

Where:

- Y: outcome
- T1 and T2: treatments
- W: confounder
- Z: effect modifier

I generated this example to show what I mean when I say the regression estimates just don't match exactly:

```
# --- Imports ---
import numpy as np
import pandas as pd
import statsmodels.api as sm
from dowhy import CausalModel

# --- 1. Synthetic data ---
np.random.seed(42)
n = 1000
Z = np.random.normal(0, 1, n)
W = np.random.normal(0, 1, n)
p1 = 1 / (1 + np.exp(-0.5 * W))
p2 = 1 / (1 + np.exp(-0.3 * W + 0.2 * np.random.normal(size=n)))
T1 = np.random.binomial(1, p1)
T2 = np.random.binomial(1, p2)
Y = (
    2
    + 2*T1
    + 1.5*T2
    + 1.5*(T1*Z)
    + 0.5*(T2*Z)
    + 0.8*W
    + np.random.normal(0, 1, n)
)
data = pd.DataFrame({"Y": Y, "T1": T1, "T2": T2, "Z": Z, "W": W})

# --- 2. DoWhy model ---
model = CausalModel(
    data=data,
    treatment=["T1", "T2"],
    outcome="Y",
    common_causes=["W"],
    effect_modifiers=["Z"],
)
identified_estimand = model.identify_effect()

estimate = model.estimate_effect(
    identified_estimand,
    method_name="backdoor.linear_regression",
)

# --- 3. Fit equivalent model manually ---
data["T1_Z"] = data["T1"] * data["Z"]
data["T2_Z"] = data["T2"] * data["Z"]
X = sm.add_constant(data[["T1", "T2", "W", "Z", "T1_Z", "T2_Z"]])
ols = sm.OLS(data["Y"], X).fit()
print("\nOLS params:\n", ols.params)

# --- 4. Prediction-based ATEs (g-computation) ---
def make_design(df, T1_value=None, T2_value=None):
    df_copy = df.copy()
    if T1_value is not None:
        df_copy["T1"] = T1_value
    if T2_value is not None:
        df_copy["T2"] = T2_value
    df_copy["T1_Z"] = df_copy["T1"] * df_copy["Z"]
    df_copy["T2_Z"] = df_copy["T2"] * df_copy["Z"]
    X_ = sm.add_constant(df_copy[["T1", "T2", "W", "Z", "T1_Z", "T2_Z"]], has_constant="add")
    X_ = X_[ols.model.exog_names]
    return X_

# ATE for T1
X_t1_1 = make_design(data, T1_value=1)
X_t1_0 = make_design(data, T1_value=0)
pred_t1_1 = ols.predict(X_t1_1)
pred_t1_0 = ols.predict(X_t1_0)
ate_T1 = (pred_t1_1 - pred_t1_0).mean()

# ATE for T2
X_t2_1 = make_design(data, T2_value=1)
X_t2_0 = make_design(data, T2_value=0)
pred_t2_1 = ols.predict(X_t2_1)
pred_t2_0 = ols.predict(X_t2_0)
ate_T2 = (pred_t2_1 - pred_t2_0).mean()

print("\nDoWhy estimate.value:", estimate.value)
print("Prediction-based ATEs (statsmodels):", ate_T1+ate_T2)
```


This code generates this output:
```
OLS params:
 const    1.990098
T1       1.973336
T2       1.511018
W        0.823388
Z       -0.007988
T1_Z     1.507216
T2_Z     0.589484
dtype: float64

DoWhy estimate.value: 3.524925884282054
Prediction-based ATEs (statsmodels): 3.5248872024993263
```

With real data, this discrepancy gets even bigger and starts behaving strangely, is there a way to replicate the regression DoWhy does with statsmodels or get the regression model itself so that I can run residual analysis or the usual analysis in any regression modelling?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replication of DoWhy backdoor.linear_regression with multiple treatments and effect modifiers #1367

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Replication of DoWhy backdoor.linear_regression with multiple treatments and effect modifiers #1367

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions