diff --git a/causal-inference-for-the-brave-and-true/08-Instrumental-Variables.ipynb b/causal-inference-for-the-brave-and-true/08-Instrumental-Variables.ipynb index fa802a6..b9b5c51 100644 --- a/causal-inference-for-the-brave-and-true/08-Instrumental-Variables.ipynb +++ b/causal-inference-for-the-brave-and-true/08-Instrumental-Variables.ipynb @@ -131,7 +131,7 @@ "source": [ "One way to avoid this is to control for constant levels of ability when measuring the effect of education on wage. We could do that by including ability in our linear regression model. However, we don't have good measurements of ability. The best we have are some very questionable proxies, like IQ.\n", "\n", - "But all is not lost. Here is where Instrumental Variables enters the picture. The idea of IV is to find another variable that causes the treatment and it is only correlated with the outcome through the treatment. Another way of saying this is that this instrument $Z_i$ is uncorrelated with $Y_0$, but it is correlated with $T$. This is sometimes referred to as the exclusion restriction." + "But all is not lost. Here is where Instrumental Variables enters the picture. The idea of IV is to find another variable that causes the treatment and it is only correlated with the outcome through the treatment. Another way of saying this is that this instrument $Z_i$ is uncorrelated with $Y_0$ and $Y_1$, but it is correlated with $T$. This is sometimes referred to as the exclusion restriction." ] }, { @@ -260,7 +260,7 @@ "\\kappa = \\dfrac{\\mathrm{Cov}(Y_i, Z_i)/V(Z_i)}{\\mathrm{Cov}(T_i, Z_i)/V(Z_i)} = \\dfrac{\\text{Reduced Form}}{\\text{1st Stage}} \n", "$\n", "\n", - "Notice that both the numerator and the denominator are regression coefficients (covariances divided by variances). The numerator is the result from the regression of Y on Z. In other words, it's the \"impact\" of Z on Y. Remember that this is not to say that Z causes Y, since we have a requirement that Z impacts Y only through T. Rather, it is only capturing how big is this effect of Z on Y through T. This numerator is so famous it has its own name: the reduced form coefficient.\n", + "Notice that both the numerator and the denominator are regression coefficients (covariances divided by variances). The numerator is the result from the regression of Y on Z. In other words, it's the \"impact\" of Z on Y. Remember that this is not to say that Z causes Y, since we have a requirement that Z impacts Y only through T. Rather, it is only capturing how big this effect of Z on Y through T is. This numerator is so famous it has its own name: the reduced form coefficient.\n", "\n", "The denominator is also a regression coefficient. This time, it is the regression of T on Z. This regression captures what is the impact of Z on T and it is also so famous that it is called the 1st Stage coefficient. \n", "\n", @@ -288,7 +288,7 @@ "\n", "Still, we do have some interesting examples of instruments to make things a little more concrete. We will again try to estimate the effect of education on wage. To do so, we will use the person's quarter of birth as the instrument Z.\n", "\n", - "This idea takes advantage of US compulsory attendance law. Usually, they state that a kid must have turned 6 years by January 1 of the year they enter school. For this reason, kids that are born at the beginning of the year will enter school at an older age. Compulsory attendance law also requires students to be in school until they turn 16, at which point they are legally allowed to drop out. The result is that people born later in the year have, on average, more years of education than those born in the beginning of the year.\n", + "This idea takes advantage of US compulsory attendance law. Usually, they state that a kid must have turned 6 by January 1 of the year they enter school. For this reason, kids that are born at the beginning of the year will enter school at an older age. Compulsory attendance law also requires students to be in school until they turn 16, at which point they are legally allowed to drop out. The result is that people born later in the year have, on average, more years of education than those born in the beginning of the year.\n", "\n", "![img](./data/img/iv/qob.png)\n", "\n", diff --git a/causal-inference-for-the-brave-and-true/09-Non-Compliance-and-LATE.ipynb b/causal-inference-for-the-brave-and-true/09-Non-Compliance-and-LATE.ipynb index b880700..3cbc876 100644 --- a/causal-inference-for-the-brave-and-true/09-Non-Compliance-and-LATE.ipynb +++ b/causal-inference-for-the-brave-and-true/09-Non-Compliance-and-LATE.ipynb @@ -305,13 +305,13 @@ "\n", "Instrumental Variable assumptions can now be rewritten as follows\n", "\n", - "1. $T_{0i}, T_{1i} \\perp Z_i $ and $Y_i(T_{1i},1), Y_i(T_{0i},0) \\perp Z_i $. This is the independence Assumption. This says that the instrument is as good as randomly assigned. In other words, Z, the instrument, is not correlated with the potential treatments, which is the same as saying that people in different instrument groups are comparable. \n", + "1. $T_{0i}, T_{1i} \\perp Z_i $ and $Y_i(T_{1i},1), Y_i(T_{0i},0) \\perp Z_i $. This is the Independence Assumption. This says that the instrument is as good as randomly assigned. In other words, Z, the instrument, is not correlated with the potential treatments, which is the same as saying that people in different instrument groups are comparable. \n", "\n", "2. $Y_i(1, 0)=Y_i(1, 1)=Y_{i1}$ and $Y_i(0, 0)=Y_i(0, 1)=Y_{i0}$. This is the exclusion restriction. It says that if I'm looking at the potential outcome for the treated, it is the same for both instrument groups. In other words, the instrument does not affect the potential outcome, which is the same as saying that the instrument only affects the outcome through the treatment.\n", "\n", "3. $E[T_{1i}-T_{0i}] \\neq 0$. This is the existence of a 1st stage. It is saying that the potential outcome of the 1st stage, that is, the potential treatment, is NOT the same. Another way of saying this is that the instrument does affect the treatment.\n", "\n", - "4. $T_{i1} > T_{i0}$. This is the monotonicity assumption. It is saying that if everyone had the instrument turned on, the treatment level would be higher than if everyone had the treatment turned off. \n", + "4. $T_{i1} \\geq T_{i0}$. This is the monotonicity assumption. It is saying that if everyone had the instrument turned on, the treatment level would be equal or higher than if everyone had the instrument turned off. \n", "\n", "Now, let's review the Wald estimator to gain some further intuition on IV:\n", "\n", @@ -322,7 +322,7 @@ "Let's take the first bit of it, $E[Y|Z=1]$. Using the exclusion restriction, we can rewrite Y in terms of potential outcome like this.\n", "\n", "$\n", - "E[Y_i|Z_i=1]=E[Y_{i0} + T_{i1}(Y_{i1} - Y_{i0})|Z=1]\n", + "E[Y_i|Z_i=1]=E[Y_{i0} + T_{i1}(Y_{i1} - Y_{i0})|Z_i=1]\n", "$\n", "\n", "Using independence, we can take out the conditioning on Z.\n", @@ -553,7 +553,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Looks like we have a strong first stage. Those that get assigned to get the push get it 71.8% of the time. This means that we have something like 28% of never takers. We also have strong reasons to believe there are no always takers, since the intercept parameter is estimated to be zero. This means that no one get's the push if it is not assigned to it. Given the design of our experiment, this is expected. \n", + "Looks like we have a strong first stage. Those that get assigned to get the push get it 71.8% of the time. This means that we have something like 28% of never takers. We also have strong reasons to believe there are no always takers, since the intercept parameter is estimated to be zero. This means that no one gets the push if it is not assigned to it. Given the design of our experiment, this is expected. \n", "\n", "Let's now run the reduced form:" ] diff --git a/causal-inference-for-the-brave-and-true/10-Matching.ipynb b/causal-inference-for-the-brave-and-true/10-Matching.ipynb index 18978fc..ba77904 100644 --- a/causal-inference-for-the-brave-and-true/10-Matching.ipynb +++ b/causal-inference-for-the-brave-and-true/10-Matching.ipynb @@ -163,7 +163,7 @@ "\\hat{ATE} = \\sum^K_{k=1}(\\bar{Y}_{k1} - \\bar{Y}_{k0}) * \\dfrac{N_k}{N}\n", "$\n", "\n", - "where the bar represent the mean of the outcome on the treated, $Y_{k1}$, and non-treated, $Y_{k0}$, at cell k and $N_{k}$ is the number of observations in that same cell. As you can see, we are computing a local ATE for each cell and combining them using a weighted average, where the weights are the sample size of the cell. In our medicine example above, this would be the first estimate, which gave us −2.6.\n", + "where the bar represents the mean of the outcome on the treated, $Y_{k1}$, and non-treated, $Y_{k0}$, at cell k and $N_{k}$ is the number of observations in that same cell. As you can see, we are computing a local ATE for each cell and combining them using a weighted average, where the weights are the sample size of the cell. In our medicine example above, this would be the first estimate, which gave us −2.6.\n", "\n", "## Matching Estimator\n", "\n", @@ -828,7 +828,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "But this was a very contrived example, just to introduce matching. In reality, we usually have more than one feature and units don't match perfectly. In this case, we have to define some measurement of proximity to compare how units are close to each other. One common metric for this is the euclidean norm $||X_i - X_j||$. This difference, however, is not invariant to the scale of the features. This means that features like age, that take values on the tenths, will be much less important when computing this norm compared to features like income, which take the order of hundreds. For this reason, before applying the norm, we need to scale the features so that they are on roughly the same scale.\n", + "But this was a very contrived example, just to introduce matching. In reality, we usually have more than one feature and units don't match perfectly. In this case, we have to define some measurement of proximity to compare how units are close to each other. One common metric for this is the euclidean norm $||X_i - X_j||$. This difference, however, is not invariant to the scale of the features. This means that features like age, that take values in the tens, will be much less important when computing this norm compared to features like income, which take the order of thousands. For this reason, before applying the norm, we need to scale the features so that they are on roughly the same scale.\n", "\n", "Having defined a distance measure, we can now define the match as the nearest neighbour to that sample we wish to match. In math terms, we can write the matching estimator the following way\n", "\n", @@ -1253,7 +1253,7 @@ "\\sqrt{N_1}(\\hat{ATET} - ATET)\n", "$\n", "\n", - "However, this doesn't alway happen. If we define the mean outcome for the untreated given X, $\\mu_0(x)=E[Y|X=x, T=0]$, we will have that (btw, I've omitted the proof for that because it's a little beyond the point here).\n", + "However, this doesn't always happen. If we define the mean outcome for the untreated given X, $\\mu_0(x)=E[Y|X=x, T=0]$, we will have that (btw, I've omitted the proof for that because it's a little beyond the point here).\n", "\n", "$\n", "E[\\sqrt{N_1}(\\hat{ATET} - ATET)] = E[\\sqrt{N_1}(\\mu_0(X_i) - \\mu_0(X_j(i)))]\n", @@ -1269,7 +1269,7 @@ "\\hat{ATET} = \\frac{1}{N_1}\\sum \\big((Y_i - Y_{j(i)}) - (\\hat{\\mu_0}(X_i) - \\hat{\\mu_0}(X_{j(i)}))\\big)\n", "$\n", "\n", - "where $\\hat{\\mu_0}(x)$ is some estimative of $E[Y|X, T=0]$, like a linear regression fitted only on the untreated sample." + "where $\\hat{\\mu_0}(x)$ is some estimate of $E[Y|X, T=0]$, like a linear regression fitted only on the untreated sample." ] }, { @@ -1496,9 +1496,11 @@ "\n", "As it turns out, the answer is quite simple and intuitive. It is easy to find people that match on a few characteristics, like sex. But if we add more characteristics, like age, income, city of birth and so on, it becomes harder and harder to find matches. In more general terms, the more features we have, the higher will be the distance between units and their matches. \n", "\n", - "This is not something that hurts only the matching estimator. It ties back to the subclassification estimator we saw earlier. Early on, in that contrived medicine example where with man and woman, it was quite easy to build the subclassification estimator. That was because we only had 2 cells: man and woman. But what would happen if we had more? Let's say we have 2 continuous features like age and income and we manage to discretise them into 5 buckets each. This will give us 25 cells, or $5^2$. And what if we had 10 covariates with 3 buckets each? Doesn't seem like a lot right? Well, this would give us 59049 cells, or $3^{10}$. It's easy to see how this can blow out of proportion pretty quickly. This is a phenomena pervasive in all data science, which is called the **The Curse of Dimensionality**!!!\n", + "This is not something that hurts only the matching estimator. It ties back to the subclassification estimator we saw earlier. Early on, in that contrived medicine example where with man and woman, it was quite easy to build the subclassification estimator. That was because we only had 2 cells: man and woman. But what would happen if we had more? Let's say we have 2 continuous features like age and income and we manage to discretise them into 5 buckets each. This will give us 25 cells, or $5^2$. And what if we had 10 covariates with 3 buckets each? Doesn't seem like a lot right? Well, this would give us 59049 cells, or $3^{10}$. It's easy to see how this can blow out of proportion pretty quickly. This is a phenomenon pervasive in all data science, which is called the **The Curse of Dimensionality**!!!\n", + "\n", + "![img](./data/img/curse-of-dimensionality.jpg)", + "\n", "\n", - "![img](./data/img/curse-of-dimensionality.jpg)\n", "Image Source: https://deepai.org/machine-learning-glossary-and-terms/curse-of-dimensionality\n", "\n", "Despite its scary and pretentious name, this only means that the number of data points required to fill a feature space grows exponentially with the number of features, or dimensions. So, if it takes X data points to fill the space of, say, 3 feature spaces, it takes exponentially more points to fill in the space of 4 features. \n", @@ -1515,7 +1517,7 @@ "\n", "From there, we've derived a very general causal inference estimator with subclassification. We saw how that estimator is not very useful in practice but it gave us some interesting insights on how to tackle the problem of causal inference estimation. That gave us the opportunity to talk about the matching estimator. \n", "\n", - "Matching controls for the confounders by looking at each treated unit and finding an untreated pair that is very similar to it and similarly for the untreated units. We saw how to implement this method using the KNN algorithm and also how to debiase it using regression. Finally, we discussed the difference between matching and linear regression. We saw how matching is a non parametric estimator that doesn't rely on linearity the way linear regression does.\n", + "Matching controls for the confounders by looking at each treated unit and finding an untreated pair that is very similar to it and similarly for the untreated units. We saw how to implement this method using the KNN algorithm and also how to debias it using regression. Finally, we discussed the difference between matching and linear regression. We saw how matching is a non-parametric estimator that doesn't rely on linearity the way linear regression does.\n", "\n", "Finally, we've delved into the problem of high dimensional datasets and we saw how causal inference methods can suffer from it.\n", "\n", diff --git a/causal-inference-for-the-brave-and-true/11-Propensity-Score.ipynb b/causal-inference-for-the-brave-and-true/11-Propensity-Score.ipynb index a194b91..384035b 100644 --- a/causal-inference-for-the-brave-and-true/11-Propensity-Score.ipynb +++ b/causal-inference-for-the-brave-and-true/11-Propensity-Score.ipynb @@ -793,7 +793,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Propensity score weighting is saying that we should expect treated individuals to be 0.38 standard deviations above their untreated fellows, in terms of achievements. We can also see that if no one got the treatment, we should expect the general level of achievements to be 0.12 standard deviation lower than what it is now. By the same reasoning, we should expect the general level of achievement to be 0.25 standards deviation higher if we've given everyone the seminar. Contrast this to the 0.47 ATE estimate we've got by simply comparing treated and untreated. This is evidence that the bias we have is indeed positive and that controlling for X gives us a more modest estimate of the impact of the growth mindset.\n", + "Propensity score weighting is saying that we should expect treated individuals to be 0.38 standard deviations above their untreated fellows, in terms of achievements. We can also see that if no one got the treatment, we should expect the general level of achievements to be 0.12 standard deviation lower than what it is now. By the same reasoning, we should expect the general level of achievement to be 0.25 standard deviations higher if we've given everyone the seminar. Contrast this to the 0.47 ATE estimate we've got by simply comparing treated and untreated. This is evidence that the bias we have is indeed positive and that controlling for X gives us a more modest estimate of the impact of the growth mindset.\n", "\n", "## Standard Error\n", "\n", @@ -916,7 +916,7 @@ "\n", "![img](./data/img/ps/ml-trap.png)\n", "\n", - "To see this, consider the following example (adapted from Hernán's Book). You have 2 schools, one of them apply the growth mindset seminar to 99% of its students and the other to 1%. Suppose that the schools have no impact on the treatment effect (except through the treatment), so it's not necessary to control for it. If you add the school variable to the propensity score model, it's going to have a very high predictive power. However, by chance, we could end up with a sample where everyone in school A got the treatment, leading to a propensity score of 1 for that school, which would lead to an infinite variance. This is an extreme example, but let's see how it would work with simulated data." + "To see this, consider the following example (adapted from Hernán's Book). You have 2 schools, one of them applies the growth mindset seminar to 99% of its students and the other to 1%. Suppose that the schools have no impact on the treatment effect (except through the treatment), so it's not necessary to control for it. If you add the school variable to the propensity score model, it's going to have a very high predictive power. However, by chance, we could end up with a sample where everyone in school A got the treatment, leading to a propensity score of 1 for that school, which would lead to an infinite variance. This is an extreme example, but let's see how it would work with simulated data." ] }, { @@ -1114,7 +1114,7 @@ "\n", "## Propensity Score Matching\n", "\n", - "As I've said before, you don't need to control for X when you have the propensity score. It suffices to control for it. As such, you can think of the propensity score as performing a kind of dimensionality reduction on the feature space. It condenses all the features in X into a single treatment assignment dimension. For this reason, we can treat the propensity score as an input feature for other models. Take a regression, model for instance." + "As I've said before, you don't need to control for X when you have the propensity score. It suffices to control for it. As such, you can think of the propensity score as performing a kind of dimensionality reduction on the feature space. It condenses all the features in X into a single treatment assignment dimension. For this reason, we can treat the propensity score as an input feature for other models. Take a regression model, for instance." ] }, { @@ -1162,7 +1162,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "If we control for the propensity score, we now estimate a ATE of 0.39, which is lower than the 0.47 we got previously with a regression model without controlling for the propensity score. We can also use matching on the propensity score. This time, instead of trying to find matches that are similar in all the X features, we can find matches that just have the same propensity score.\n", + "If we control for the propensity score, we now estimate an ATE of 0.39, which is lower than the 0.47 we got previously with a regression model without controlling for the propensity score. We can also use matching on the propensity score. This time, instead of trying to find matches that are similar in all the X features, we can find matches that just have the same propensity score.\n", "\n", "This is a huge improvement on top of the matching estimator, since it deals with the curse of dimensionality. Also, if a feature is unimportant for the treatment assignment, the propensity score model will learn that and give low importance to it when fitting the treatment mechanism. Matching on the features, on the other hand, would still try to find matches where individuals are similar on this unimportant feature." ] diff --git a/causal-inference-for-the-brave-and-true/14-Panel-Data-and-Fixed-Effects.ipynb b/causal-inference-for-the-brave-and-true/14-Panel-Data-and-Fixed-Effects.ipynb index 2c21772..c2ad2f7 100644 --- a/causal-inference-for-the-brave-and-true/14-Panel-Data-and-Fixed-Effects.ipynb +++ b/causal-inference-for-the-brave-and-true/14-Panel-Data-and-Fixed-Effects.ipynb @@ -56,7 +56,7 @@ "\\widehat{ATT} = \\underbrace{Y_1(1)|D=1}_{\\substack{\\text{POA outcome} \\\\ \\text{after intervention}}} - \\widehat{Y_0(1)|D=1}\n", "$$\n", " \n", - "In other words, the effect of placing a billboard in POA is the outcome we saw on POA after placing the billboard minus our estimate of what would have happened if we hadn't placed the billboard. Also, recall that the power of DiD comes from the fact that estimating the mentioned counterfactual only requires that the growth deposits in POA matches the growth in deposits in FLW. This is the key parallel trends assumption. We should definitely spend some time on it because it is going to become very important later on. \n", + "In other words, the effect of placing a billboard in POA is the outcome we saw on POA after placing the billboard minus our estimate of what would have happened if we hadn't placed the billboard. Also, recall that the power of DiD comes from the fact that estimating the mentioned counterfactual only requires that the growth deposits in POA matches the growth in deposits in FLN. This is the key parallel trends assumption. We should definitely spend some time on it because it is going to become very important later on. \n", " \n", " \n", "## Parallel Trends\n", @@ -67,7 +67,7 @@ "Y_d \\perp D\n", "$$\n", " \n", - "This means we don't give more treatment to units with higher outcome (which would cause upward bias in the effect estimation) or lower outcome (which would cause downward bias). In less abstract terms, back to our example, let's say that your marketing manager decides to add billboards only to cities that already have very high deposits. That way, he or she can later boast that cities with billboards generate more deposties, so of course the marketing campaign was a success. Setting aside the moral discussion here, I think you can see that this violates the independence assumption: we are giving the treatment to cities with high $Y_0$. Also, remember that a natural extension of this assumption is the conditional independence assumption, which allows the potential outcomes to be dependent on the treatment at first, but independent once we condition on the confounders $X$\n", + "This means we don't give more treatment to units with higher outcome (which would cause upward bias in the effect estimation) or lower outcome (which would cause downward bias). In less abstract terms, back to our example, let's say that your marketing manager decides to add billboards only to cities that already have very high deposits. That way, he or she can later boast that cities with billboards generate more deposites, so of course the marketing campaign was a success. Setting aside the moral discussion here, I think you can see that this violates the independence assumption: we are giving the treatment to cities with high $Y_0$. Also, remember that a natural extension of this assumption is the conditional independence assumption, which allows the potential outcomes to be dependent on the treatment at first, but independent once we condition on the confounders $X$\n", " \n", "$\n", "Y_d \\perp D | X\n", @@ -81,7 +81,7 @@ "$\n", " \n", " \n", - "In less mathematical terms, this assumption is saying it is fine that we assign the treatment to units that have a higher or lower level of the outcome. What we can't do is assign the treatment to units based on how the outcome is growing. In out billboard example, this means it is OK to place billboards only in cities with originally high deposits level. What we can't do is place billboards only in cities where the deposits are growing the most. That makes a lot of sense if we remember that DiD is inputting the counterfactual growth in the treated unit with the growth in the control unit. If growth in the treated unit under the control is different than the growth in the control unit, then we are in trouble. " + "In less mathematical terms, this assumption is saying it is fine that we assign the treatment to units that have a higher or lower level of the outcome. What we can't do is assign the treatment to units based on how the outcome is growing. In our billboard example, this means it is OK to place billboards only in cities with originally high deposits level. What we can't do is place billboards only in cities where the deposits are growing the most. That makes a lot of sense if we remember that DiD is inputting the counterfactual growth in the treated unit with the growth in the control unit. If growth in the treated unit under the control is different than the growth in the control unit, then we are in trouble. " ] }, { @@ -915,7 +915,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Take a minute to appreciate what the image above is telling you about what fixed effect is doing. Notice that fixed effect is fitting **one regression line per city**. Also notice that the lines are parallel. The slope of the line is the effect of marketing costs on in-app purchase. So the **fixed effect is assuming that the causal effect is constants across all entities**, which are cities in this case. This can be a weakness or an advantage, depending on how you see it. It is a weakness if you are interested in finding the causal effect per city. Since the FE model assumes this effect is constant across entities, you won't find any difference in the causal effect. However, if you want to find the overall impact of marketing on in-app purchase, the panel structure of the data is a very useful leverage that fixed effects can explore. \n", + "Take a minute to appreciate what the image above is telling you about what fixed effect is doing. Notice that fixed effect is fitting **one regression line per city**. Also notice that the lines are parallel. The slope of the line is the effect of marketing costs on in-app purchase. So the **fixed effect is assuming that the causal effect is constant across all entities**, which are cities in this case. This can be a weakness or an advantage, depending on how you see it. It is a weakness if you are interested in finding the causal effect per city. Since the FE model assumes this effect is constant across entities, you won't find any difference in the causal effect. However, if you want to find the overall impact of marketing on in-app purchase, the panel structure of the data is a very useful leverage that fixed effects can explore. \n", "\n", "## Time Effects\n", "\n", diff --git a/causal-inference-for-the-brave-and-true/15-Synthetic-Control.ipynb b/causal-inference-for-the-brave-and-true/15-Synthetic-Control.ipynb index e1fbcc2..1da952b 100644 --- a/causal-inference-for-the-brave-and-true/15-Synthetic-Control.ipynb +++ b/causal-inference-for-the-brave-and-true/15-Synthetic-Control.ipynb @@ -1218,7 +1218,7 @@ "\n", "To correct for that, we learned that we can build a synthetic control that combines multiple control units to make them resemble the treated unit. With this synthetic control, we were able to see what would have happened to our treated unit in the absence of a treatment. \n", "\n", - "Finally, we saw how we could use Fisher's Exact Tests to do inference with synthetic control. Namely, we've pretended that the non-treated units were actually the treated and computed their effect. These were the placebo effects: the effects we would observe even without a treatment. We uses these to see if the treatment effect we've estimated was statistically significant. \n", + "Finally, we saw how we could use Fisher's Exact Tests to do inference with synthetic control. Namely, we've pretended that the non-treated units were actually the treated and computed their effect. These were the placebo effects: the effects we would observe even without a treatment. We used these to see if the treatment effect we've estimated was statistically significant. \n", "\n", "## References\n", "\n", diff --git a/causal-inference-for-the-brave-and-true/16-Regression-Discontinuity-Design.ipynb b/causal-inference-for-the-brave-and-true/16-Regression-Discontinuity-Design.ipynb index e5a702a..a3277e6 100644 --- a/causal-inference-for-the-brave-and-true/16-Regression-Discontinuity-Design.ipynb +++ b/causal-inference-for-the-brave-and-true/16-Regression-Discontinuity-Design.ipynb @@ -24,7 +24,7 @@ "D_i = \\mathcal{1}\\{R_i>c\\}\n", "$\n", "\n", - "In other words, this is saying that treatment is zero when $R$ is below a threshold $c$ and one otherwise. This means that we get to observe $Y_1$ when $R>c$ and $Y_0$ when $Rc$ and $Y_0$ when $RT_{i0} \\ \\forall i$. This means that crossing the threshold from the left to the right only increases your chance of getting a diploma (or that there are no defiers). With these 2 assumptions, we have a Wald Estimator for LATE.\n", + "Just like when we've assumed smoothness on the potential outcome, we now assume it for the potential treatment. Also, we need to assume monotonicity, just like in IV. In case you don't remember, it states that $T_{i1} \\geq T_{i0} \\ \\forall i$. This means that crossing the threshold from the left to the right only increases your chance of getting a diploma (or that there are no defiers). With these 2 assumptions, we have a Wald Estimator for LATE.\n", "\n", "$$\n", "\\dfrac{\\lim_{r \\to c^+} E[Y_i|R_i=r] - \\lim_{r \\to c^-} E[Y_i|R_i=r]}{\\lim_{r \\to c^+} E[T_i|R_i=r] - \\lim_{r \\to c^-} E[T_i|R_i=r]} = E[Y_{1i} - Y_{0i} | T_{1i} > T_{0i}, R_i=c]\n", @@ -700,7 +700,7 @@ "\n", "Notice how this is a local estimate in two senses. First, it is local because it only gives the treatment effect at the threshold $c$. This is the RD locality. Second, it is local because it only estimates the treatment effect for the compliers. This is the IV locality.\n", "\n", - "To estimate this, we will use 2 linear regression. The numerator can be estimated just like we've done before. To get the denominator, we simply replace the outcome with the treatment. But first, let's talk about a sanity check we need to run to make sure we can trust our RDD estimates.\n", + "To estimate this, we will use 2 linear regressions. The numerator can be estimated just like we've done before. To get the denominator, we simply replace the outcome with the treatment. But first, let's talk about a sanity check we need to run to make sure we can trust our RDD estimates.\n", "\n", "### The McCrary Test\n", "\n", diff --git a/causal-inference-for-the-brave-and-true/17-Predictive-Models-101.ipynb b/causal-inference-for-the-brave-and-true/17-Predictive-Models-101.ipynb index 6410d90..fcdfb13 100644 --- a/causal-inference-for-the-brave-and-true/17-Predictive-Models-101.ipynb +++ b/causal-inference-for-the-brave-and-true/17-Predictive-Models-101.ipynb @@ -6,7 +6,7 @@ "source": [ "# 17 - Predictive Models 101\n", "\n", - "We are leaving Part I of this book. That part covered the core about causal inference. Techniques over there are very well known and established. They have survived the test of time. Part I builds the solid foundation we can rely upon. In more technical terms, Part I focuses on defining what is causal inference, what are the biases that prevents correlation from being causation, multiple ways to adjust for those biases (regression, matching and propensity score) and canonical identification strategies (instrumental variables, diff-in-diff and RDD). In summary, Part I focuses on the standard techniques we use to identify the average treatment effect $E[Y_1 - Y_0]$. \n", + "We are leaving Part I of this book. That part covered the core about causal inference. Techniques over there are very well known and established. They have survived the test of time. Part I builds the solid foundation we can rely upon. In more technical terms, Part I focuses on defining what is causal inference, what are the biases that prevent correlation from being causation, multiple ways to adjust for those biases (regression, matching and propensity score) and canonical identification strategies (instrumental variables, diff-in-diff and RDD). In summary, Part I focuses on the standard techniques we use to identify the average treatment effect $E[Y_1 - Y_0]$. \n", " \n", "As we move to Part II, things will get a bit shaky. We will cover recent developments in the causal inference literature, its relationship with Machine Learning and applications in the industry. In that sense, we trade-off academic rigour for applicability and empiricism. Some methods presented in Part II don't have a solid theory about why they work. Still, when we try them, they seem to work nevertheless. In that sense, Part II might be more useful for industry practitioners that want to use causal inference in their day to day work, rather than scientists who want to research a fundamental causal relationship in the world. \n", " \n", @@ -27,7 +27,7 @@ " \n", "![img](./data/img/industry-ml/translation.png)\n", " \n", - "What machine learning really does is it learns this mapping function, even if it is a very complicated mapping function. The bottom line is that if you can frame a problem as this mapping from an input to an output, then machine learning might be a good candidate to solve it. As for self-driving cars, you can think of it as not one, but multiple complex prediction problems: predicting the correct angle of the wheel from sensors in the front of the car, predicting the pressure in the brakes from cameras around the car, predicting the pressure in the accelerator from gps data. Solving those (and a tone more) of prediction problems is what makes a self driving car.\n", + "What machine learning really does is it learns this mapping function, even if it is a very complicated mapping function. The bottom line is that if you can frame a problem as this mapping from an input to an output, then machine learning might be a good candidate to solve it. As for self-driving cars, you can think of it as not one, but multiple complex prediction problems: predicting the correct angle of the wheel from sensors in the front of the car, predicting the pressure in the brakes from cameras around the car, predicting the pressure in the accelerator from GPS data. Solving those (and a ton more) of prediction problems is what makes a self-driving car.\n", "\n", "A more technical way of thinking about ML is in term of estimating (possibly very complex) expectation functions: \n", " \n", @@ -35,7 +35,7 @@ "E[Y|X]\n", "$\n", " \n", - "Where $Y$ is what you want to know (translated sentence, diagnostica) and $X$ is what you already know (input sentence, x-ray image). Machine learning is simply a way of estimating that conditional expectation function. \n", + "Where $Y$ is what you want to know (translated sentence, diagnostics) and $X$ is what you already know (input sentence, x-ray image). Machine learning is simply a way of estimating that conditional expectation function. \n", "\n", "\n", "OK… You now understand how prediction can be more powerful than we first thought. Self-driving cars and language translation are cool and all, but they are quite distant, unless you work at a major tech company like Google or Uber. So, to make things more relatable, let's talk in terms of problems almost every company has: customer acquisition (that is getting new customers). \n", @@ -312,7 +312,7 @@ "source": [ "What we need to do now is distinguish the good from the bad customers according to this transactional data. For the sake of simplicity, I'll just sum up all transactions and the CACQ. Keep in mind that this throws under the rug a lot of nuances, like distinguishing customers that are churned from those that are in a break between one purchase and the next.\n", " \n", - "I'll then join this sum, which I call `net_value`, with customer specific features. Since my goals is to figure out which customer will be profitable **before** deciding to engage with them, you can only use data prior to the acquisition period. In our case, these features are age, region and income, which are all available at another `csv` file." + "I'll then join this sum, which I call `net_value`, with customer specific features. Since my goal is to figure out which customer will be profitable **before** deciding to engage with them, you can only use data prior to the acquisition period. In our case, these features are age, region and income, which are all available at another `csv` file." ] }, { @@ -423,15 +423,15 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Good! Our task is becoming less abstract. We wish to identify the profitable customers (`net_value > 0`) from the non profitable ones. Let's try different things and see which one works better. But before that, we need to take a quick look into Machine Learning (feel free skip if you know how ML works) \n", + "Good! Our task is becoming less abstract. We wish to identify the profitable customers (`net_value > 0`) from the non profitable ones. Let's try different things and see which one works better. But before that, we need to take a quick look into Machine Learning (feel free to skip if you know how ML works) \n", " \n", "## Machine Learning Crash Course\n", " \n", - "For our intent and purpose, we can think of ML as an overpowered way of making predictions. For it to work, you need some data with labels or the ground truth of what you are predicting. Then, you can train a ML model on that data and use it to make predictions where the ground truth is not yet known. The image below exemplifies the typical machine learning flow.\n", + "For our intent and purpose, we can think of ML as an overpowered way of making predictions. For it to work, you need some data with labels or the ground truth of what you are predicting. Then, you can train an ML model on that data and use it to make predictions where the ground truth is not yet known. The image below exemplifies the typical machine learning flow.\n", " \n", "![img](./data/img/industry-ml/ml-flow.png)\n", " \n", - "First, you need data where the ground truth, `net_value` here, is known. Then, you train a ML model that will use features - region, income and age in our case - to predict `net_value`. This training or estimating step will produce a machine learning model that can be used to make predictions about `net_value` when you don't yet have the true `net_value`. This is shown in the left part of the image. You have some new data where you have the features (region, income and age) but you don't know the `net_value` yet. So you pass this data through your model and it provides you with `net_value` predictions. \n", + "First, you need data where the ground truth, `net_value` here, is known. Then, you train an ML model that will use features - region, income and age in our case - to predict `net_value`. This training or estimating step will produce a machine learning model that can be used to make predictions about `net_value` when you don't yet have the true `net_value`. This is shown in the left part of the image. You have some new data where you have the features (region, income and age) but you don't know the `net_value` yet. So you pass this data through your model and it provides you with `net_value` predictions. \n", " \n", "If you are more into technical notation, another way of understanding machine learning is in term of estimating a conditional expectation $E[Y|X]$, where $Y$ is called the target variable or outcome and $X$ is called the feature variables. ML is just a powerful way of obtaining $\\hat{E}[Y|X]$, usually by optimizing some error or loss function.\n", " \n", @@ -549,7 +549,7 @@ " \n", "Moving forward, what is the next simplest thing we can think of? One idea is taking our features and seeing if they alone distinguish the good from the bad customers. Take `income`, for instance. It's intuitive that richer customers should be more profitable, right? What if we do business only with the top richest customers? Would that be a good idea?\n", " \n", - "To figure this out we can partition our data into income quantiles (a quantile has the propriety of dividing the data into partitions of equal size, that's why I like them). Then, for each income quantile, let's compute the average net value. The hope here is that, although the average net value in negative, $E[NetValue]<0$, there might be some subpopulation defined by income where the net value is positive, $E[NetValue|Income=x]>0$, probably, higher income levels." + "To figure this out we can partition our data into income quantiles (a quantile has the property of dividing the data into partitions of equal size, that's why I like them). Then, for each income quantile, let's compute the average net value. The hope here is that, although the average net value in negative, $E[NetValue]<0$, there might be some subpopulation defined by income where the net value is positive, $E[NetValue|Income=x]>0$, probably, higher income levels." ] }, { @@ -719,7 +719,7 @@ " \n", "If you are willing to do even better, we can now use the power of machine learning. Keep in mind that this might add tones of complexity to the whole thing and usually only marginal gains. But, depending on the circumstances, marginal gains can be translated into huge piles of money and that's why machine learning is so valuable these days.\n", " \n", - "Here, I'll use a Gradient Boosting model. It's a fairly complicated model to explain, but one that is very simple to use. For our purpose, we don't need to get into the details of how it works. Instead, just remember what we've seen in our ML Crash course: a ML model is a super powerful predictive machine that has some complexity parameters. It's a tool to estimate $E[Y|X]$. The more complex, the more powerful the model becomes. However, if the complexity is too high, the model will overfit, learn noise and not generalize well to unseen data. Hence, we need to use cross validation here to see if the model has the right complexity. \n", + "Here, I'll use a Gradient Boosting model. It's a fairly complicated model to explain, but one that is very simple to use. For our purpose, we don't need to get into the details of how it works. Instead, just remember what we've seen in our ML Crash course: an ML model is a super powerful predictive machine that has some complexity parameters. It's a tool to estimate $E[Y|X]$. The more complex, the more powerful the model becomes. However, if the complexity is too high, the model will overfit, learn noise and not generalize well to unseen data. Hence, we need to use cross validation here to see if the model has the right complexity. \n", " \n", "Now, we need to ask, how can good predictions be used to improve upon our simple region policy to identify and engage with profitable customers? I think there are two main improvements that we can make here. First, you will have to agree that going through all the features looking for one that distinguishes good from bad customers is a cumbersome process. Here, we had only 3 of them (age, income and region), so it wasn't that bad, but imagine if we had more than 100. Also, you have to be careful with issues of [multiple testing](https://en.wikipedia.org/wiki/Multiple_comparisons_problem) and false positive rates. The second reason is that it is probably the case that you need more than one feature to distinguish between customers. In our example, we believe that features other than region also have some information on customer profitability. Sure, when we looked at income alone it didn't give us much, but what about income in those regions that are just barely unprofitable? Maybe, in those regions, if we focus only on richer customers, we could still get some profit. Technically speaking, we are saying that $E[NetValue|Region, Income, Age]$ is a better predictor of `NetValue` than $E[NetValue|Region]$. This makes a lot of sense. Using more information about income and age on top of region should allow us to predict net value better. \n", " \n", @@ -954,7 +954,7 @@ "source": [ "Here, notice how there are model bands where the net value is super negative, while there are also bands where it is very positive. Also, there are bands where we don't know exactly if the net value is negative or positive. Finally, notice how they have an upward trend, from left to right. Since we are predicting net value, it is expected that the prediction will be proportional to what it predicts.\n", "\n", - "Now, to compare this policy using a machine learning model with the one using only the regions we can also show the histogram of net gains, along with the total net value in the test set." + "Now, using a model policy that selects customers where $\\hat{E}[NetValue|Region, Income, Age] > 0$, and comparing it against the one using only the regions, we can show their histogram of net gains, along with their average net value in the test set." ] }, { @@ -1014,7 +1014,7 @@ " \n", "Here, for the next example, suppose your decision is not just who to do business with, but how much marketing costs you should invest in each customer. And for the sake of the example, assume that you are competing with other firms and whoever spends more on marketing in a particular customer wins that customer (much like a bidding mechanism). In that case, it makes sense to invest more in highly profitable customers, less in marginally profitable customers and not at all in non profitable customers.\n", " \n", - "One way to do that is to discritize your predictions into bands. We've done this previously for the purpose of model comparison, but here we'll do it for decision making. Let's create 20 bands. We can think of those as quantiles or equal size groups. The first band will contain the 5% less profitable customers *according to our predictions*, the second band will contain from the 5% to the 10% less profitable and so on. The last band, 20, will contain the most profitable customers.\n", + "One way to do that is to discretize your predictions into bands. We've done this previously for the purpose of model comparison, but here we'll do it for decision making. Let's create 20 bands. We can think of those as quantiles or equal size groups. The first band will contain the 5% less profitable customers *according to our predictions*, the second band will contain from the 5% to the 10% less profitable and so on. The last band, 20, will contain the most profitable customers.\n", " \n", "Notice that the binning too has to be estimated on the training set and applied on the test set! For this reason, we will compute the bins using `pd.qcut` on the training set. To actually do the binning, we will use `np.digitize`, passing the bins that were precomputed on the training set." ] diff --git a/causal-inference-for-the-brave-and-true/18-Heterogeneous-Treatment-Effects-and-Personalization.ipynb b/causal-inference-for-the-brave-and-true/18-Heterogeneous-Treatment-Effects-and-Personalization.ipynb index c919a3e..55533d4 100644 --- a/causal-inference-for-the-brave-and-true/18-Heterogeneous-Treatment-Effects-and-Personalization.ipynb +++ b/causal-inference-for-the-brave-and-true/18-Heterogeneous-Treatment-Effects-and-Personalization.ipynb @@ -70,7 +70,7 @@ " \n", "![img](./data/img/causal-model/customers.png)\n", " \n", - "To do that, you have to segment your customers. You have created groups that respond differently to your treatment. For example, you want to find customers that respond well to discounts and customers who respond poorly to it. Well, the customer's response to a treatment is given by the conditional treatment effect $\\frac{\\delta Y}{ \\delta T}$. So, we could somehow estimate that for each customer, we could group together those that respond great to the treatment (high treatment effect) and those that don't respond very well to it. If we did that, we would split the customers space somewhat like the following image.\n", + "To do that, you have to segment your customers. You have to create groups that respond differently to your treatment. For example, you want to find customers that respond well to discounts and customers who respond poorly to it. Well, the customer's response to a treatment is given by the conditional treatment effect $\\frac{\\delta Y}{ \\delta T}$. So, if we could somehow estimate that for each customer, we could group together those that respond great to the treatment (high treatment effect) and those that don't respond very well to it. If we did that, we would split the customers space somewhat like the following image.\n", " \n", "![img](./data/img/causal-model/elast-partition.png)\n", " \n", @@ -84,7 +84,7 @@ " \n", "![img](./data/img/causal-model/elasticity.png)\n", " \n", - "Of course, we can't see those individual slope coefficients. For us to see the individual slopes, we would have to observe each day under two different prices and calculate how the sales changes for each of those prices.\n", + "Of course, we can't see those individual slope coefficients. For us to see the individual slopes, we would have to observe each day under two different prices and calculate how the sales change for each of those prices.\n", " \n", "$$\n", "\\frac{\\delta Y_i}{ \\delta T_i} \\approx \\frac{Y(T_i) - Y(T_i + \\epsilon)}{T_i - (T_i + \\epsilon)}\n", @@ -99,7 +99,7 @@ "source": [ "## Predicting Sensitivity\n", " \n", - "We got ourselves into a complicated situation here. We've agreed that we need to predict $\\frac{\\delta Y_i}{ \\delta T_i}$, which is sadly not observable. So it's not like we could use a ML algorithm and plug that as it's target. But maybe we don't need to observe $\\frac{\\delta Y_i}{ \\delta T_i}$ in order to predict it\n", + "We got ourselves into a complicated situation here. We've agreed that we need to predict $\\frac{\\delta Y_i}{ \\delta T_i}$, which is sadly not observable. So it's not like we could use an ML algorithm and plug that as it's target. But maybe we don't need to observe $\\frac{\\delta Y_i}{ \\delta T_i}$ in order to predict it\n", " \n", "Here is an idea. What if we use linear regression?\n", " \n", @@ -135,7 +135,7 @@ " \n", "We are finally getting somewhere. The model above allows us to make a sensitivity prediction for each of our entities. With those predictions we can make more useful groups. We can take the units with high predicted sensitivity and group them together. We can do the same with the ones that have low predicted sensitivity. Finally, with our sensitivity predictions, we can group entities by how much we think they will respond to the treatment.\n", " \n", - "Enough of theory for now. It's time to walk through an example of how to make this sort of sensitivity model. Let's consider our ice cream example. Each unit $i$ is a day. For each day, we know if it's a weekday or not, what was the cost we had to make the ice cream (you can think of the cost as a proxy for quality) and the average temperature for that day. Those will be our feature space $X$. Then, we have our treatment, price, and our outcome, the number of ice cream sold. For this example, we will consider that the treatment is randomized, just so that we don't have to worry about bias for now." + "Enough of theory for now. It's time to walk through an example of how to make this sort of sensitivity model. Let's consider our ice cream example. Each unit $i$ is a day. For each day, we know if it's a weekday or not, what was the cost we had to make the ice cream (you can think of the cost as a proxy for quality) and the average temperature for that day. Those will make up our feature space $X$. Then, we have our treatment, price, and our outcome, the number of ice creams sold. For this example, we will consider that the treatment is randomized, just so that we don't have to worry about bias for now." ] }, { @@ -483,7 +483,7 @@ " \n", "where $\\hat{y}$ is given by our model's predictions. In words, I'll make two predictions with my models: one passing the original data and another passing the original data but with the treatment incremented by one unit. The difference between those predictions is my CATE prediction. \n", " \n", - "Below, you can see a function for doing that. Since we've used the train set to estimate our model, we will now make predictions on the test set. First, let's use our first, ATE model, $m1$." + "Below, you can see a function for doing that. Since we've used the train set to estimate our model, we will now make predictions on the test set. First, let's use our first ATE model, $m1$." ] }, { @@ -709,9 +709,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Notice how the predictions are numbers that go from something like -9 to something 1. Those are not predictions of the sales column, which is in the order of the hundreds. Rather, **it's a prediction of how much sales would change if we increased price by one unit**. Right off of the bet, we can see some strange numbers. For example, take a look at day 4764. It's predicting a positive sensitivity. In other words, we are predicting that sales will increase if we increase the price of ice cream. This doesn't appeal to our economic sense. It's probably the case that the model is doing some weird extrapolation on that prediction. Fortunately, you don't have to worry too much about it. Remember that our ultimate goal is to segment the units by how sensitive they are to the treatment. It's **not** to come up with the most accurate sensitivity prediction ever. For our main goal, it suffices if the sensitivity predictions orders the units according to how sensitive they are. In other words, even if positive sensitivity predictions like 1.1, or 0.5 don't make much sense, all we need is that the ordering is correct, that is, we want the units with prediction 1.1 to be less impacted by price increase than units with predictions 0.5. \n", + "Notice how the predictions are numbers that go from something like -9 to something like 1. Those are not predictions of the sales column, which is in the order of the hundreds. Rather, **it's a prediction of how much sales would change if we increased price by one unit**. Right off of the bat, we can see some strange numbers. For example, take a look at day 4764. It's predicting a positive sensitivity. In other words, we are predicting that sales will increase if we increase the price of ice cream. This doesn't appeal to our economic sense. It's probably the case that the model is doing some weird extrapolation on that prediction. Fortunately, you don't have to worry too much about it. Remember that our ultimate goal is to segment the units by how sensitive they are to the treatment. It's **not** to come up with the most accurate sensitivity prediction ever. For our main goal, it suffices if the sensitivity predictions orders the units according to how sensitive they are. In other words, even if positive sensitivity predictions like 1.1, or 0.5 don't make much sense, all we need is that the ordering is correct, that is, we want the units with prediction 1.1 to be less impacted by price increase than units with predictions 0.5. \n", " \n", - "Ok, we have our sensitivity or CATE model. But there is still a lurking question: how do they compare to a ML predictive model? Let's try that now. We will use a machine learning algorithm that uses price, temperature, weekday and cost as features $X$ and tries to predict ice cream sales." + "Ok, we have our sensitivity or CATE model. But there is still a lurking question: how do they compare to an ML predictive model? Let's try that now. We will use a machine learning algorithm that uses price, temperature, weekday and cost as features $X$ and tries to predict ice cream sales." ] }, { @@ -886,7 +886,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Next, we need to compare which of these two segmentations is the best one. I might be getting ahead of myself now, since we will only look at CATE model evaluation in the next chapter. But I feel I can give you a taste of what it looks like. One very simple way to check how good are those partition schemas - and by good I mean useful - is to plot a regression line of prices on sales for each partition. We can achieve that easily with Seaborn's `regplot` combined with `FacetGrid`. \n", + "Next, we need to compare which of these two segmentations is the best one. I might be getting ahead of myself now, since we will only look at CATE model evaluation in the next chapter. But I feel I can give you a taste of what it looks like. One very simple way to check how good those partition schemas are - and by good I mean useful - is to plot a regression line of prices on sales for each partition. We can achieve that easily with Seaborn's `regplot` combined with `FacetGrid`. \n", " \n", "Below, we can see the partitions made using the sensitivity predictions. Remember that all of this is done in the test set." ] diff --git a/causal-inference-for-the-brave-and-true/19-Evaluating-Causal-Models.ipynb b/causal-inference-for-the-brave-and-true/19-Evaluating-Causal-Models.ipynb index 7af77e1..88d6ce1 100644 --- a/causal-inference-for-the-brave-and-true/19-Evaluating-Causal-Models.ipynb +++ b/causal-inference-for-the-brave-and-true/19-Evaluating-Causal-Models.ipynb @@ -6,7 +6,7 @@ "source": [ "# 19 - Evaluating Causal Models\n", "\n", - "In the vast majority of material about causality, researchers use synthetic data to check if their methods are any good. Much like we did in the When Prediction Fails chapter, they generate data on both $Y_{0i}$ and $Y_{1i}$ so that they can check if their model is correctly capturing the treatment effect $Y_{1i} - Y_{0i}$. That's fine for academic purposes, but in the real world, we don't have that luxury. When applying these techniques in the industry, we'll be asked time and again to prove why our model is better, why should it replace the current one in production or why it won't fail miserably. This is so crucial that it's beyond my comprehension why we don't see any material whatsoever explaining how we should evaluate causal inference models with real data. \n", + "In the vast majority of material about causality, researchers use synthetic data to check if their methods are any good. Much like we do in the When Prediction Fails appendix chapter, they generate data on both $Y_{0i}$ and $Y_{1i}$ so that they can check if their model is correctly capturing the treatment effect $Y_{1i} - Y_{0i}$. That's fine for academic purposes, but in the real world, we don't have that luxury. When applying these techniques in the industry, we'll be asked time and again to prove why our model is better, why should it replace the current one in production or why it won't fail miserably. This is so crucial that it's beyond my comprehension why we don't see any material whatsoever explaining how we should evaluate causal inference models with real data. \n", "\n", "As a consequence, data scientists that want to apply causal inference models have a really hard time convincing management to trust them. The approach they take is one of showing how sound the theory is and how careful they've been while training the model. Unfortunately, in a world where train-test split paradigm is the norm, that just won't cut it. The quality of your model will have to be grounded on something more concrete than a beautiful theory. Think about it. Machine learning has only achieved its huge success because predictive model validation is very straightforward. There is something reassuring about seeing that the predictions match what really happened. \n", "\n", @@ -14,7 +14,7 @@ "\n", "![img](./data/img/evaluate-causal-models/sneak.png)\n", "\n", - "This is a very very very hard thing to wrap our heads around and it took me years to find something close to an answer. Is not a definitive one, but it works in practice and it has that concreteness, which I hope will approach causal inference from a train-test paradigm similar to the one we have with machine learning. The trick is to use aggregate measurements of sensitivity. Even if you can't estimate sensitivity individually, you can do it for a group and that is what we will leverage here." + "This is a very very very hard thing to wrap our heads around and it took me years to find something close to an answer. It's not a definitive one, but it works in practice and it has that concreteness, which I hope will approach causal inference from a train-test paradigm similar to the one we have with machine learning. The trick is to use aggregate measurements of sensitivity. Even if you can't estimate sensitivity individually, you can do it for a group and that is what we will leverage here." ] }, { @@ -418,7 +418,7 @@ "\n", "Now that we have our predictions, we need to evaluate how good they are. And remember, we can't observe sensitivity, so there isn't a simple ground truth we can compare against. Instead, let's think back to what we want from our sensitivity models. Perhaps that will give us some insights into how we should evaluate them. \n", "\n", - "The idea of making treatment sensitivity models came from the necessity of finding which units are more sensitive to the treatment and which are less. It came from a desire for personalisation. Maybe a marketing campaign is very effective in only one segment of the population. Maybe discounts only work for some type of customers. A good causal model should help us find which customers will respond better and worse to a proposed treatment. They should be able to separate units into how elastic or sensitive they are to the treatment. In our ice cream example, the model should be able to figure out in which days are people willing to spend more on ice cream or, in which days is the price sensitivity less negative. \n", + "The idea of making treatment sensitivity models came from the necessity of finding which units are more sensitive to the treatment and which are less. It came from a desire for personalisation. Maybe a marketing campaign is very effective in only one segment of the population. Maybe discounts only work for some type of customers. A good causal model should help us find which customers will respond better and worse to a proposed treatment. They should be able to separate units into how elastic or sensitive they are to the treatment. In our ice cream example, the model should be able to figure out in which days people are willing to spend more on ice cream or, in which days the price sensitivity is less negative. \n", "\n", "If that is the goal, it would be very useful if we could somehow order units from more sensitive to less sensitive. Since we have the predicted sensitivity, we can order the units by that prediction and hope it also orders them by the real sensitivity. Sadly, we can't evaluate that ordering on a unit level. But, what if we don't need to? What if, instead, we evaluate groups defined by the ordering? If our treatment is randomly distributed (and here is where randomness enters), estimating sensitivity for a group of units is easy. All we need is to compare the outcome between the treated and untreated.\n", "\n", @@ -561,7 +561,7 @@ "(\\widehat{y'(t)}_1, \\widehat{y'(t)}_2, \\widehat{y'(t)}_3,..., \\widehat{y'(t)}_N)\n", "$$\n", "\n", - "This is a very interesting sequence in terms of model evaluation because we can make preferences statements about it. First, a model is better to the degree that\n", + "This is a very interesting sequence in terms of model evaluation because we can make preference statements about it. First, a model is better to the degree that\n", "\n", "$\\hat{y}'(t)_k > \\hat{y}'(t)_{k+a}$\n", "\n", @@ -656,7 +656,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Interpreting a Cumulative Sensitivity Curve can be a bit challenging, but here is how I see it. Again, it might be easier to think about the binary case. The X axis of the curve represents how many samples are we treating. Here, I've normalized the axis to be the proportion of the dataset, so .4 means we are treating 40% of the samples. The Y axis is the sensitivity we should expect at that many samples. So, if a curve has value -1 at 40%, it means that the sensitivity of the top 40% units is -1. Ideally, we want the highest sensitivity for the largest possible sample. An ideal curve then would start high up on the Y axis and descend very slowly to the average sensitivity, representing we can treat a high percentage of units while still maintaining an above average sensitivity. \n", + "Interpreting a Cumulative Sensitivity Curve can be a bit challenging, but here is how I see it. Again, it might be easier to think about the binary case. The X axis of the curve represents how many samples we are treating. Here, I've normalized the axis to be the proportion of the dataset, so .4 means we are treating 40% of the samples. The Y axis is the sensitivity we should expect at that many samples. So, if a curve has value -1 at 40%, it means that the sensitivity of the top 40% units is -1. Ideally, for this case, we want the least negative sensitivity for the largest possible sample. An ideal curve then would start high up on the Y axis and descend very slowly to the average sensitivity, representing we can treat a high percentage of units while still maintaining a low magnitude sensitivity.\n", "\n", "Needless to say, none of our models gets even close to an ideal sensitivity curve. The random model `rand_m` oscillates around the average sensitivity and never goes too far away from it. This means that the model can't find groups where the sensitivity is different from the average one. As for the predictive model `pred_m`, it appears to be reversely ordering sensitivity, because the curve starts below the average sensitivity. Not only that, it also converges to the average sensitivity pretty quickly, at around 50% of the samples. Finally, the causal model `sensitivity_m` seems more interesting. It has this weird behavior at first, where the cumulative sensitivity increases away from the average, but then it reaches a point where we can treat about 75% of the units while keeping a pretty decent sensitivity of almost 0. This is probably happening because this model can identify the very low sensitivity (high price sensitivity) days. Hence, provided we don't increase prices on those days, we are allowed to do it for most of the sample (about 75%), while still having a low price sensitivity. \n", "\n", @@ -760,7 +760,7 @@ "source": [ "Now it is very clear that the causal model (`sensitivity_m`) is much better than the other two. It diverges much more from the random line than both `rand_m` and `pred_m`. Also, notice how the actual random model follows very closely the theoretical random model. The difference between both is probably just random noise. \n", "\n", - "With that, we covered some really nice ideas on how to evaluate causal models. That alone is a huge deed. We managed to evaluate how good are models in ordering sensitivity even though we didn't have a ground truth to compare against. There is only one final thing missing, which to include a confidence interval around those measurements. After all, we are not barbarians, are we?\n", + "With that, we covered some really nice ideas on how to evaluate causal models. That alone is a huge deed. We managed to evaluate how good are models in ordering sensitivity even though we didn't have a ground truth to compare against. There is only one final thing missing, which is to include a confidence interval around those measurements. After all, we are not barbarians, are we?\n", "\n", "![img](./data/img/evaluate-causal-models/uncivilised.png)\n", "\n", diff --git a/causal-inference-for-the-brave-and-true/20-Plug-and-Play-Estimators.ipynb b/causal-inference-for-the-brave-and-true/20-Plug-and-Play-Estimators.ipynb index 7e356bc..7129851 100644 --- a/causal-inference-for-the-brave-and-true/20-Plug-and-Play-Estimators.ipynb +++ b/causal-inference-for-the-brave-and-true/20-Plug-and-Play-Estimators.ipynb @@ -6,7 +6,7 @@ "source": [ "# 20 - Plug-and-Play Estimators\n", " \n", - "So far, we've seen how to debias our data in the case where the treatment is not randomly assigned, which results in confounding bias. That helps us with the identification problem in causal inference. In other words, once the units are exchangeable, or $ Y(0), Y(1) \\perp X$, it becomes possible to learn the treatment effect. But we are far from done. \n", + "So far, we've seen how to debias our data in the case where the treatment is not randomly assigned, which results in confounding bias. That helps us with the identification problem in causal inference. In other words, once the units are exchangeable, or $ Y(0), Y(1) \\perp T|X$, it becomes possible to learn the treatment effect. But we are far from done. \n", " \n", "Identification means that we can find the average treatment effect. In other words, we know how effective a treatment is on average. Of course this is useful, as it helps us to decide if we should roll out a treatment or not. But we want more than that. We want to know if there are subgroups of units that respond better or worse to the treatment. That should allow for a much better policy, one where we only treat the ones that will benefit from it.\n", " \n", @@ -19,7 +19,7 @@ "\\tau_i = Y_i(1) − Y_i(0),\n", "$\n", " \n", - "or, the continuous treatment case, $\\tau_i = \\partial Y(t)$, where $t$ is the treatment variable. Of course, we can never observe the individual treatment effect, because we only get to see the one of potential outcomes\n", + "or, the continuous treatment case, $\\tau_i = \\partial Y(t)$, where $t$ is the treatment variable. Of course, we can never observe the individual treatment effect, because we only get to see one of potential outcomes\n", " \n", "$\n", "Y^{obs}_i(t)= \n", @@ -41,7 +41,7 @@ "\\tau(x) = E[Y_i(1) − Y_i(0)|X] = E[\\tau_i|X]\n", "$\n", " \n", - "In Part I of this book, we've focused mostly on the ATE. Now, we are interested in the CATE. The CATE is useful for personalising a decision making process. For example, if you have a drug as the treatment $t$, you want to know which type of patients are more responsive to the drug (higher CATE) and if there are some types of patient with a negative response (CATE < 0). \n", + "In Part I of this book, we've focused mostly on the ATE. Now, we are interested in the CATE. The CATE is useful for personalising a decision making process. For example, if you have a drug as the treatment $t$, you want to know which type of patients are more responsive to the drug (higher CATE) and if there are some types of patients with a negative response (CATE < 0). \n", " \n", "We've seen how to estimate the CATE using a linear regression with interactions between the treatment and the features\n", " \n", @@ -57,7 +57,7 @@ " \n", "Still, the linear models have some drawbacks. The main one being the linearity assumption on $X$. Notice that you don't even care about $\\beta_2$ on this model. But if the features $X$ don't have a linear relationship with the outcome, your estimates of the causal parameters $\\beta_1$ and $\\beta_3$ will be off. \n", " \n", - "It would be great if we could replace the linear model by a more flexible machine learning model. We could even plug the treatment as a feature to a ML model, like boosted trees or a neural network\n", + "It would be great if we could replace the linear model by a more flexible machine learning model. We could even plug the treatment as a feature to an ML model, like boosted trees or a neural network\n", " \n", "$\n", "y_i = M(X_i, T_i) + e_i\n", @@ -99,7 +99,7 @@ " \n", "This seems very odd, because you are saying that the effect of the email can be a negative number, but bear with me. If we do a little bit of math, we can see that, on average or in expectation, this transformed target will be the treatment effect. This is nothing short of amazing. What I'm saying is that by applying this somewhat wacky transformation, I get to estimate something that I can't even observe. \n", " \n", - "To understand that, we need a bit of math. Because of random assignment, we have that $T \\perp Y(0), Y(1)$, which is our old unconfoundedness friend. That implies that $E[T, Y(t)]=E[T]*E[Y(t)]$, which is the definition of independence.\n", + "To understand that, we need a bit of math. Because of random assignment, we have that $T \\perp Y(0), Y(1)$, which is our old unconfoundedness friend. That implies that $E[T, Y(t)]=E[T]*E[Y(t)]$, which is a consequence of independence.\n", "\n", "\n", "Also, we know that\n", @@ -151,7 +151,7 @@ "\\end{align}\n", "$\n", " \n", - "As always, I think this will become much more concrete with an example. Again, consider the investment emails we've sent trying to make people invest more. The outcome variable the binary (invested vs didn't invest) `converted`." + "As always, I think this will become much more concrete with an example. Again, consider the investment emails we've sent trying to make people invest more. The outcome variable is the binary (invested vs didn't invest) `converted`." ] }, { @@ -746,7 +746,7 @@ "\\end{align}\n", "$\n", " \n", - "Bare in mind that this only works when the treatment is randomized. For non randomized treatment, we have to replace $\\bar{T}$ by $M(X_i)$, where $M$ is a model that estimates $E[T_i|X_i=x]$. \n", + "Bear in mind that this only works when the treatment is randomized. For non randomized treatment, we have to replace $\\bar{T}$ by $M(X_i)$, where $M$ is a model that estimates $E[T_i|X_i=x]$. \n", " \n", "$\n", "Y^*_i = (Y_i- \\bar{Y})\\dfrac{(T_i - M(T_i))}{(T_i - M(T_i))^2}\n", @@ -938,7 +938,7 @@ " \n", "### Non Linear Treatment Effects\n", " \n", - "Having talked about the continuous case, there is still an elephant in the room we need to adress. We've assumed a linearity on the treatment effect. However, that is very rarely a reasonable assumption. Usually, treatment effects saturate in one form or another. In our example, it's reasonable to think that demand will go down faster at the first units of price increase, but then it will fall slowlier.\n", + "Having talked about the continuous case, there is still an elephant in the room we need to adress. We've assumed a linearity on the treatment effect. However, that is very rarely a reasonable assumption. Usually, treatment effects saturate in one form or another. In our example, it's reasonable to think that demand will go down faster at the first units of price increase, but then it will fall more slowly.\n", " \n", "![img](./data/img/plug-and-play-estimators/non-linear-case.png)\n", " \n", @@ -978,7 +978,7 @@ " \n", "The things I've written here are mostly stuff from my head. I've learned them through experience. This means that they have **not** passed the academic scrutiny that good science often goes through. Instead, notice how I'm talking about things that work in practice, but I don't spend too much time explaining why that is the case. It's a sort of science from the streets, if you will. However, I am putting this up for public scrutiny, so, by all means, if you find something preposterous, open an issue and I'll address it to the best of my efforts.\n", " \n", - "Most of this chapter draws from Susan Atheys' and Guido W. Imbens' paper, *Machine Learning Methods for Estimating Heterogeneous Causal Effects*. Some material about target transformation can also be found on Pierre Gutierrez' and Jean-Yves G´erardy's paper, *Causal Inference and Uplift Modeling: A review of the literature*. Note that these papers only cover the binary treatment case. Another review of causal models for CATE estimation that references the F-Learner is *Meta-learners for Estimating Heterogeneous Treatment Effects using Machine Learning*, by K¨unzel et al, 2019. \n", + "Most of this chapter draws from Susan Atheys' and Guido W. Imbens' paper, *Machine Learning Methods for Estimating Heterogeneous Causal Effects*. Some material about target transformation can also be found on Pierre Gutierrez' and Jean-Yves Gérardy's paper, *Causal Inference and Uplift Modeling: A review of the literature*. Note that these papers only cover the binary treatment case. Another review of causal models for CATE estimation that references the F-Learner is *Meta-learners for Estimating Heterogeneous Treatment Effects using Machine Learning*, by Künzel et al, 2019. \n", "\n", " \n", "## Contribute\n",