Skip to content

FEAT: placebo testing for paid media channels#1274

Open
toj9 wants to merge 3 commits intofacebookexperimental:mainfrom
toj9:placebo
Open

FEAT: placebo testing for paid media channels#1274
toj9 wants to merge 3 commits intofacebookexperimental:mainfrom
toj9:placebo

Conversation

@toj9
Copy link

@toj9 toj9 commented Jun 1, 2025

Project Robyn

Summary

Introduces a placebo test feature that, for any chosen paid media channel, shuffles its weekly spend, reruns Robyn, and compares the resulting NRMSE distribution against the original one. This code creates a “placebo test” (via robyn_placebo() and plot_placebo()).

Specifically, it:

  • Stress tests 1 paid variable of choice by allowing the user to randomly shuffle one media channel’s spend data (for example, tv_S =TV spend) -- making it a placebo with no real effect.
  • Allows the user to rerun the model using this shuffled data to generate a placebo distribution of errors (NRMSEs).
  • Compares the original model’s errors (with real data) vs. the placebo model’s errors (with shuffled data) using statistical tests: a) A t-test to check if the mean error increases after shuffling, and b) a supporting F-test to check if the variance of errors increases.
  • Produces visual plots showing density curves and violin plots, where you can easily see if the shuffled (placebo) version led to worse model fit.
  • Allows exporting and saving the chart directly in the same folder where other visuals are exported.
  • Allows printing the t-test and f-test results, and also printing the chart directly in Rstudio.

Motivation

When we include media spend as an independent variable in an MMM, we are assuming -- under the null hypothesis (H₀) -- that it has some predictive influence on the response (e.g., sales, conversions). But in reality, we don’t know if that’s true. Some channels’ spend might genuinely help the model predict the outcome. Other channels might not add anything beyond what other predictors already cover (i.e., they are redundant or collinear). This placebo test challenges that assumption, and helps to stress test one paid media variable at a time:

  • If shuffling the spend significantly on average worsens models performance (higher mean error distribution), it may indicate that the original spend variable was contributing a strong unique predictive information -- a.k.a. it significantly mattered.
  • If shuffling doesn’t change the model error distribution much, it suggests that the spend variable may not provide a unique strong information, because the model can still perform similarly using other predictors.

Why “original” may be lower on average
Because the optimizer is discovering combinations of hyperparameters to minimize NRMSE, we expect those thousands of original fits to tend toward lower errors (they “learn” real signal from the un‐shuffled data). In contrast, if we scramble one channel that genuinely carried predictive information, the optimizer can’t recover those patterns as well as it could with it -- instead it’s fitting noise -- so its candidate fits should, on average, be worse (higher NRMSE).

When we might see the opposite
If we turn a paid channel into a placebo that has very little or no real predictive power on the response (as placebos should), then the optimizer might still use the remaining variables to get a similarly low error. In that case original distribution and placebo distribution end up roughly the same, or occasionally placebo distribution even dips slightly below original distribution just due to random chance in the stochastic search. But whenever a channel truly matters, placebo treatment should increase the average NRMSE (so we see a higher‐centered placebo distribution after re-runing the model with the shuffled media spend variable).

Variance as a supporting metric
In practice when using robyn, after implementing the placebo, a higher variance of NRMSE values may flag that the optimizer is “flailing” more when it can’t lean on a real driver because the optimizer "searches" and “hunts” harder when it no longer has that useful independent variable in the model. A one-tailed F-test then tells us whether that increase in spread is significant or not. If it is, we have got supporting extra evidence that the channel we shuffled may have carried a real additional predictive power - if the variance doesn’t rise much, it suggests the channel might not have contributed much signal to begin with.

Output Examples

The placebo test has 3 main outputs:

  1. density distributions of the original NRMSE vs the placebo NRMSE
  2. violin plot with variance data labels for original vs placebo
  3. t-test and f-test results

Placebo tests suggest OOH spend might be an important spend variable, as after shuffled the error distribution rises significantly

  • 1st replication:
    image

  • 2nd replication:
    image

  • 3rd replication:
    image

Placebo tests suggest FB media spend might not be important given other predictive variables in the model

  • 1st replication:
    image
  • 2nd replication:
    image
  • 3rd replication:
    image

These stress-testing charts are automatically saved into the same folder as other visuals when exported:
image

Code Example

library(Robyn)

# Check version
packageVersion("Robyn")

# Enable multi-core
Sys.setenv(R_FUTURE_FORK_ENABLE = "true")
options(future.fork.enable = TRUE)

# Load data
data("dt_simulated_weekly")
data("dt_prophet_holidays")

# Output folder
robyn_directory <- "~/Desktop"

# Step 1: Inputs
InputCollect <- robyn_inputs(
  dt_input         = dt_simulated_weekly,
  dt_holidays      = dt_prophet_holidays,
  date_var         = "DATE",
  dep_var          = "revenue",
  dep_var_type     = "revenue",
  prophet_vars     = c("trend", "season", "holiday"),
  prophet_country  = "US",
  context_vars     = c("competitor_sales_B", "events"),
  paid_media_spends= c("tv_S", "ooh_S", "print_S", "facebook_S", "search_S"),
  paid_media_vars  = c("tv_S", "ooh_S", "print_S", "facebook_I", "search_clicks_P"),
  organic_vars     = c("newsletter"),
  factor_vars      = c("events"),
  window_start     = "2016-01-01",
  window_end       = "2018-12-31",
  adstock          = "geometric"
)

# Step 2: Hyperparameters
hyperparameters <- list(
  facebook_I_alphas      = c(0.5, 3),
  facebook_I_gammas      = c(0.3, 1),
  facebook_I_thetas      = c(0, 0.3),
  print_S_alphas         = c(0.5, 1),
  print_S_gammas         = c(0.3, 1),
  print_S_thetas         = c(0.1, 0.4),
  tv_S_alphas            = c(0.5, 1),
  tv_S_gammas            = c(0.3, 1),
  tv_S_thetas            = c(0.3, 0.8),
  search_clicks_P_alphas = c(0.5, 3),
  search_clicks_P_gammas = c(0.3, 1),
  search_clicks_P_thetas = c(0, 0.3),
  ooh_S_alphas           = c(0.5, 1),
  ooh_S_gammas           = c(0.3, 1),
  ooh_S_thetas           = c(0.1, 0.4),
  newsletter_alphas      = c(0.5, 3),
  newsletter_gammas      = c(0.3, 1),
  newsletter_thetas      = c(0.1, 0.4),
  train_size             = c(0.5, 0.8)
)
InputCollect <- robyn_inputs(InputCollect = InputCollect, hyperparameters = hyperparameters)

# Step 3: Train
OutputModels <- robyn_run(
  InputCollect = InputCollect,
  cores        = NULL,
  iterations   = 1500,
  trials       = 3,
  ts_validation= TRUE
)

# Step 4: Analyze & export
OutputCollect <- robyn_outputs(
  InputCollect = InputCollect,
  OutputModels = OutputModels,
  pareto_fronts= "auto",
  clusters     = TRUE,
  export       = TRUE,
  plot_folder  = robyn_directory
)

# Step 5: Select & save model
select_model    <- "1_283_1"
ExportedModel   <- robyn_write(InputCollect, OutputCollect, select_model, export = TRUE)
json_file <- "/Users/jerry/Desktop/Robyn_202505301936_init/RobynModel-1_283_1.json"

# Step 6: Allocation
AllocatorCollect <- robyn_allocator(
  InputCollect       = InputCollect,
  OutputCollect      = OutputCollect,
  select_model       = select_model,
  channel_constr_low = 0.7,
  channel_constr_up  = c(1.2, 1.5, 1.5, 1.5, 1.5),
  scenario           = "max_response"
)
plot(AllocatorCollect)

#NEW FEATURE: step: 7: PLACEBO test and export to the same folder
oc_placebo <- robyn_placebo(OutputCollect,
                            channel = "facebook_S",
                            export = TRUE)

# Inspect the t-test and variance f-test results
print(oc_placebo$placebo$t_test)
print(oc_placebo$placebo$f_test)

# Plot in R if necessary
plot_placebo(oc_placebo)

What's next?

For example, the code could be adjusted to support other variables in the future, not just paid media. We could also introduce random synthetic confounders to see if the channel effect disappears, run a “subset refuter” that holds out part of the data to verify stability, or even remove certain variables (or time periods) entirely to confirm stability.

Type of change

feat: New feature (non-breaking change which adds functionality)

How Has This Been Tested?

image

@facebook-github-bot
Copy link

Hi @toj9!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

@facebook-github-bot
Copy link

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 1, 2025
@toj9
Copy link
Author

toj9 commented Jun 1, 2025

Hey @gufengzhou! have been using Robyn for years and I love it! I have just put together this PR for a potential new feature I have been thinking about. Let me know what you think or if I need to tag somebody else from the community. Thanks, cheers.

@gufengzhou
Copy link
Contributor

Hello @toj9, very cool idea, beautiful viz and thanks for the contribution! I'd like to get your thoughts on couple of things.

  1. How do you think about shuffling time series with a temporal structure? Using the "sample" function to reshuffle the media is like an experiment "let's switch the summer and Christmas campaigns and see what happens". If the error stays the same after reshuffle, does it really mean this channel is insignificant? Or does it rather say "this channel is more like an always on instead of seasonal campaigns"?

  2. What would be the difference between reshuffling a channel vs removing a channel as a placebo?

@toj9
Copy link
Author

toj9 commented Jun 3, 2025

@gufengzhou ah yes, thanks for this feedback!!

  1. For the first point:
  • You are right because here the timing is actually a crucial variable when talking about the spend and the robyn setup, wherelse with for example causal inference and treatment vs control group setup, we actually are ok with shuffling the variable because there is no notion of timing really: “either you got treated or you didnt” -- and swapping those labels should break any real effect quite nicely.
  • You are right with the always-on and I have not thought about that specific scenario - the always-on setup for a channel would actually be the a scenario where this shuffling would fall short - it might not break the effect hence might be flagged as not important because the assumption would be "well we have shuffled it so it must have increased the error" -- but if I shuffle the weekly spend, I am effectively asking “what if this weeks FB budget had actually run in November instead of June?” -- so essentially with always-on, shuflfing may not “destroy” the predictive power because a channel thats effectively “always‐on” (flat spend just varies slightly weekly but is stable) might not change fit when permuted at the end of the day.

I will say though that the idea for me from the start was that by shuffling we will break the hill with the adstock effect, meaning the model should no longer easily learn how each week’s spend for the channel carries over or follows a diminishing‐returns curve, even if total dollars stay the same the dose–response and carryover signals "vanish" and NRMSE inflates.

  1. For the second point, there are differences in these two and specifically for robyn it would be:
  • Placebo shuffle -- it would be looking at whether the precise timing of this channel matter -- if timing matters, a shuffled series will break it. If it doesn’t, maybe the model’s picking up only the average spend or a pattern already captured by other covariates.

  • Placebo removal -- it would be more looking at answering whether any information (timing + amount + actual presence) of that channel helps to drive predictions.

So, for an always-on channel:
Shuffle test: NRMSE might not change, because timing isn’t carrying any unique lift --> but a sequential removal placebo test should be implemented to double check as --> Removal placebo test should increase NRMSE, as the sheer presence of always-on spend is under H0 hypothesizsed to hold/drive the response variable in some way (unless why would we have it). If NRMSE distribution stays identical after this test too, the channel was most likely truly redundant as even the removal of it did not stat. change the NRMSE of all models.

So actually the sequential combo could be a solution for the always-on is what I thought about after writing this lol. Both tests are for something different and would serve different stress-testing capability?


What I am thinking is also that we can actually have a third inject a placebo option -- adding a paid variable spend -- which this could then help to sidestep the problem with the possible always-on? this would be essentially the same random noise as the shuffling but un practice, we would generate a truly random spend series (for example, sample from the same mean/SD of existing channels like search_S or ooh_S so it lives on the same scale) and then inject that column into the Robyn inputs. And because the hypothesis would be that this placebo has robyn's "spend_share" but should have 0% "effect_share" due to it being a placebo, we can check the following:

Spend-Share Check -- in the one-pager output, the placebo variable will show up in the bar chart of robyns Spend Share (since it’s literally inserted as a paid channel), but when we look at its Effect Share it must be essentially zero -- if Robyn ever assigns it a nonzero effect, that means the optimizer is mistaking random noise for real signal and its showing overfitting:

image

Saturation Curve Check -- its Hill/adstock curve should be flat with no significant shape or upward slope. If we see any curvature, shape or significant slope, it’s a red flag that the model is overfitting again.

image

NRMSE Impact Check -- we compare Pareto NRMSE distributions with and without that injected placebo. Because it truly drives nothing, ideally the minimum NRMSE (across all candidate models) should stay the same or even improve slightly (if Robyn accidentally shrank away noise). If adding the placebo ever lowers NRMSE significantly, it means Robyn is “learning” from pure randomness and overfitting again.

After injecting a placebo as a diagnostic step into what we believed was a strong model, its behavior tells us whether the original fit was genuine or just noise:

  • If the placebo remains flat = our “good” model was probably capturing real media signals. We may trust its channel‐level ROI, spend‐share, and Hill/adstock shapes etc. more confidently.
  • If the placebo picks up a signal = part of that “great” R² was driven by noise. Our channel contributions may be inflated or not confident, so we may avoid trusting the model’s recommendations until we tighten hyperparameters, remove suspect predictors, add omitted variables etc.

If Robyn “learns” from the noise variable, it’s demonstrating an overfitting tendency. If it does not, its proving that its hill/adstock machinery was not flexible enough to mistake random fluctuations for real signal.


Here is an example where the EFFECT SHARE actually gone up to 2.9% for a placebo:

image

And here is an example where the EFFECT SHARE went up to even 3.7%:

image

  • In both one pagers above, the residuals are forming a shape which they should not as the assumption is they should be randomly distributed around 0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants