Skip to content

Commit 2986496

Browse files
authored
Add a geolift example to the docs (#135)
* add geolift data simulation code * add data to datesets + tests * save the time index with geolift.csv + add example to integration tests * add geolift example + add to docs * suppress progress bar for posterior predictive sampling * geolift notebook updates * add blurb to index.rst and README.md
1 parent 1257e87 commit 2986496

File tree

11 files changed

+1137
-5
lines changed

11 files changed

+1137
-5
lines changed

README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -91,6 +91,9 @@ This is appropriate when you have multiple units, one of which is treated. You b
9191

9292
> The data (treated and untreated units), pre-treatment model fit, and counterfactual (i.e. the synthetic control) are plotted (top). The causal impact is shown as a blue shaded region. The Bayesian analysis shows shaded Bayesian credible regions of the model fit and counterfactual. Also shown is the causal impact (middle) and cumulative causal impact (bottom).
9393
94+
### Geographical lift (Geolift)
95+
We can also use synthetic control methods to analyse data from geographical lift studies. For example, we can try to evaluate the causal impact of an intervention (e.g. a marketing campaign) run in one geographical area by using control geographical areas which are similar to the intervention area but which did not recieve the specific marketing intervention.
96+
9497
### ANCOVA
9598

9699
This is appropriate for non-equivalent group designs when you have a single pre and post intervention measurement and have a treament and a control group.

causalpy/data/datasets.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@
1515
"rd": {"filename": "regression_discontinuity.csv"},
1616
"sc": {"filename": "synthetic_control.csv"},
1717
"anova1": {"filename": "ancova_generated.csv"},
18+
"geolift1": {"filename": "geolift1.csv"},
1819
}
1920

2021

causalpy/data/geolift1.csv

Lines changed: 209 additions & 0 deletions
Large diffs are not rendered by default.

causalpy/data/simulate_data.py

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -220,3 +220,75 @@ def generate_ancova_data(
220220
post = pre + treatment_effect * group + np.random.normal(size=N) * sigma
221221
df = pd.DataFrame({"group": group, "pre": pre, "post": post})
222222
return df
223+
224+
225+
def generate_geolift_data():
226+
"""Generate synthetic data for a geolift example. This will consists of 6 untreated
227+
countries. The treated unit `Denmark` is a weighted combination of the untreated
228+
units. We additionally specify a treatment effect which takes effect after the
229+
`treatment_time`. The timeseries data is observed at weekly resolution and has
230+
annual seasonality, with this seasonality being a drawn from a Gaussian Process with
231+
a periodic kernel."""
232+
n_years = 4
233+
treatment_time = pd.to_datetime("2022-01-01")
234+
causal_impact = 0.2
235+
236+
def create_series(n=52, amplitude=1, length_scale=2):
237+
return np.tile(
238+
generate_seasonality(n=n, amplitude=amplitude, length_scale=2) + 3, n_years
239+
)
240+
241+
time = pd.date_range(start="2019-01-01", periods=52 * n_years, freq="W")
242+
243+
untreated = [
244+
"Austria",
245+
"Belgium",
246+
"Bulgaria",
247+
"Croatia",
248+
"Cyprus",
249+
"Czech_Republic",
250+
]
251+
252+
df = (
253+
pd.DataFrame({country: create_series() for country in untreated})
254+
.assign(time=time)
255+
.set_index("time")
256+
)
257+
258+
# create treated unit as a weighted sum of the untreated units
259+
weights = np.random.dirichlet(np.ones(len(untreated)), size=1)[0]
260+
df = df.assign(Denmark=np.dot(df[untreated].values, weights))
261+
262+
# add observation noise
263+
for col in untreated + ["Denmark"]:
264+
df[col] += np.random.normal(size=len(df), scale=0.1)
265+
266+
# add treatment effect
267+
df["Denmark"] += np.where(df.index < treatment_time, 0, causal_impact)
268+
return df
269+
270+
271+
# -----------------
272+
# UTILITY FUNCTIONS
273+
# -----------------
274+
275+
276+
def generate_seasonality(n=12, amplitude=1, length_scale=0.5):
277+
"""Generate monthly seasonality by sampling from a Gaussian process with a
278+
Gaussian kernel, using numpy code"""
279+
# Generate the covariance matrix
280+
x = np.linspace(0, 1, n)
281+
x1, x2 = np.meshgrid(x, x)
282+
cov = periodic_kernel(
283+
x1, x2, period=1, length_scale=length_scale, amplitude=amplitude
284+
)
285+
# Generate the seasonality
286+
seasonality = np.random.multivariate_normal(np.zeros(n), cov)
287+
return seasonality
288+
289+
290+
def periodic_kernel(x1, x2, period=1, length_scale=1, amplitude=1):
291+
"""Generate a periodic kernal for gaussian process"""
292+
return amplitude**2 * np.exp(
293+
-2 * np.sin(np.pi * np.abs(x1 - x2) / period) ** 2 / length_scale**2
294+
)

causalpy/pymc_models.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -46,15 +46,17 @@ def fit(self, X, y, coords: Optional[Dict[str, Any]] = None) -> None:
4646
with self.model:
4747
self.idata = pm.sample(**self.sample_kwargs)
4848
self.idata.extend(pm.sample_prior_predictive())
49-
self.idata.extend(pm.sample_posterior_predictive(self.idata))
49+
self.idata.extend(
50+
pm.sample_posterior_predictive(self.idata, progressbar=False)
51+
)
5052
return self.idata
5153

5254
def predict(self, X):
5355
"""Predict data given input data `X`"""
5456
self._data_setter(X)
5557
with self.model: # sample with new input data
5658
post_pred = pm.sample_posterior_predictive(
57-
self.idata, var_names=["y_hat", "mu"]
59+
self.idata, var_names=["y_hat", "mu"], progressbar=False
5860
)
5961
return post_pred
6062

causalpy/tests/test_integration_pymc_examples.py

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -186,3 +186,22 @@ def test_ancova():
186186
assert isinstance(result, cp.pymc_experiments.PrePostNEGD)
187187
assert len(result.idata.posterior.coords["chain"]) == sample_kwargs["chains"]
188188
assert len(result.idata.posterior.coords["draw"]) == sample_kwargs["draws"]
189+
190+
191+
@pytest.mark.integration
192+
def test_geolift1():
193+
df = cp.load_data("geolift1")
194+
df["time"] = pd.to_datetime(df["time"])
195+
df.set_index("time", inplace=True)
196+
treatment_time = pd.to_datetime("2022-01-01")
197+
result = cp.pymc_experiments.SyntheticControl(
198+
df,
199+
treatment_time,
200+
formula="""Denmark ~ 0 + Austria + Belgium + Bulgaria + Croatia + Cyprus
201+
+ Czech_Republic""",
202+
prediction_model=cp.pymc_models.WeightedSumFitter(sample_kwargs=sample_kwargs),
203+
)
204+
assert isinstance(df, pd.DataFrame)
205+
assert isinstance(result, cp.pymc_experiments.SyntheticControl)
206+
assert len(result.idata.posterior.coords["chain"]) == sample_kwargs["chains"]
207+
assert len(result.idata.posterior.coords["draw"]) == sample_kwargs["draws"]

docs/examples.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ Synthetic Control
1919
notebooks/sc2_skl.ipynb
2020
notebooks/sc_pymc_brexit.ipynb
2121
notebooks/its_covid.ipynb
22+
notebooks/geolift1.ipynb
2223

2324

2425
Difference in Differences

docs/index.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,10 @@ This is appropriate when you have multiple units, one of which is treated. You b
6969

7070
.. image:: ../img/synthetic_control_pymc.svg
7171

72+
Geographical Lift / Geolift
73+
""""""""""""""""""""""""""""
74+
We can also use synthetic control methods to analyse data from geographical lift studies. For example, we can try to evaluate the causal impact of an intervention (e.g. a marketing campaign) run in one geographical area by using control geographical areas which are similar to the intervention area but which did not recieve the specific marketing intervention.
75+
7276
ANCOVA
7377
""""""
7478

0 commit comments

Comments
 (0)