Skip to content

Replication of DoWhy backdoor.linear_regression with multiple treatments and effect modifiers #1367

@Cirokun

Description

@Cirokun

Is there a way to get the model estimated by DoWhy? I'm asking because I can't replicate it exactly with the formula given, where there are multiple treatments and effect modifiers:

Y = T1 + T2 + W + T1:Z + T2:Z

Where:

  • Y: outcome
  • T1 and T2: treatments
  • W: confounder
  • Z: effect modifier

I generated this example to show what I mean when I say the regression estimates just don't match exactly:

# --- Imports ---
import numpy as np
import pandas as pd
import statsmodels.api as sm
from dowhy import CausalModel

# --- 1. Synthetic data ---
np.random.seed(42)
n = 1000
Z = np.random.normal(0, 1, n)
W = np.random.normal(0, 1, n)
p1 = 1 / (1 + np.exp(-0.5 * W))
p2 = 1 / (1 + np.exp(-0.3 * W + 0.2 * np.random.normal(size=n)))
T1 = np.random.binomial(1, p1)
T2 = np.random.binomial(1, p2)
Y = (
    2
    + 2*T1
    + 1.5*T2
    + 1.5*(T1*Z)
    + 0.5*(T2*Z)
    + 0.8*W
    + np.random.normal(0, 1, n)
)
data = pd.DataFrame({"Y": Y, "T1": T1, "T2": T2, "Z": Z, "W": W})

# --- 2. DoWhy model ---
model = CausalModel(
    data=data,
    treatment=["T1", "T2"],
    outcome="Y",
    common_causes=["W"],
    effect_modifiers=["Z"],
)
identified_estimand = model.identify_effect()

estimate = model.estimate_effect(
    identified_estimand,
    method_name="backdoor.linear_regression",
)

# --- 3. Fit equivalent model manually ---
data["T1_Z"] = data["T1"] * data["Z"]
data["T2_Z"] = data["T2"] * data["Z"]
X = sm.add_constant(data[["T1", "T2", "W", "Z", "T1_Z", "T2_Z"]])
ols = sm.OLS(data["Y"], X).fit()
print("\nOLS params:\n", ols.params)

# --- 4. Prediction-based ATEs (g-computation) ---
def make_design(df, T1_value=None, T2_value=None):
    df_copy = df.copy()
    if T1_value is not None:
        df_copy["T1"] = T1_value
    if T2_value is not None:
        df_copy["T2"] = T2_value
    df_copy["T1_Z"] = df_copy["T1"] * df_copy["Z"]
    df_copy["T2_Z"] = df_copy["T2"] * df_copy["Z"]
    X_ = sm.add_constant(df_copy[["T1", "T2", "W", "Z", "T1_Z", "T2_Z"]], has_constant="add")
    X_ = X_[ols.model.exog_names]
    return X_

# ATE for T1
X_t1_1 = make_design(data, T1_value=1)
X_t1_0 = make_design(data, T1_value=0)
pred_t1_1 = ols.predict(X_t1_1)
pred_t1_0 = ols.predict(X_t1_0)
ate_T1 = (pred_t1_1 - pred_t1_0).mean()

# ATE for T2
X_t2_1 = make_design(data, T2_value=1)
X_t2_0 = make_design(data, T2_value=0)
pred_t2_1 = ols.predict(X_t2_1)
pred_t2_0 = ols.predict(X_t2_0)
ate_T2 = (pred_t2_1 - pred_t2_0).mean()

print("\nDoWhy estimate.value:", estimate.value)
print("Prediction-based ATEs (statsmodels):", ate_T1+ate_T2)

This code generates this output:

OLS params:
 const    1.990098
T1       1.973336
T2       1.511018
W        0.823388
Z       -0.007988
T1_Z     1.507216
T2_Z     0.589484
dtype: float64

DoWhy estimate.value: 3.524925884282054
Prediction-based ATEs (statsmodels): 3.5248872024993263

With real data, this discrepancy gets even bigger and starts behaving strangely, is there a way to replicate the regression DoWhy does with statsmodels or get the regression model itself so that I can run residual analysis or the usual analysis in any regression modelling?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions