matheusfacure · gsmafra · Oct 23, 2025 · Oct 23, 2025 · Oct 23, 2025 · Oct 29, 2025
diff --git a/causal-inference-for-the-brave-and-true/08-Instrumental-Variables.ipynb b/causal-inference-for-the-brave-and-true/08-Instrumental-Variables.ipynb
@@ -131,7 +131,7 @@
    "source": [
     "One way to avoid this is to control for constant levels of ability when measuring the effect of education on wage. We could do that by including ability in our linear regression model. However, we don't have good measurements of ability. The best we have are some very questionable proxies, like IQ.\n",
     "\n",
-    "But all is not lost. Here is where Instrumental Variables enters the picture. The idea of IV is to find another variable that causes the treatment and it is only correlated with the outcome through the treatment. Another way of saying this is that this instrument $Z_i$ is uncorrelated with $Y_0$, but it is correlated with $T$. This is sometimes referred to as the exclusion restriction."
+    "But all is not lost. Here is where Instrumental Variables enters the picture. The idea of IV is to find another variable that causes the treatment and it is only correlated with the outcome through the treatment. Another way of saying this is that this instrument $Z_i$ is uncorrelated with $Y_0$ and $Y_1$, but it is correlated with $T$. This is sometimes referred to as the exclusion restriction."
    ]
   },
   {
@@ -260,7 +260,7 @@
     "\\kappa = \\dfrac{\\mathrm{Cov}(Y_i, Z_i)/V(Z_i)}{\\mathrm{Cov}(T_i, Z_i)/V(Z_i)} = \\dfrac{\\text{Reduced Form}}{\\text{1st Stage}} \n",
     "$\n",
     "\n",
-    "Notice that both the numerator and the denominator are regression coefficients (covariances divided by variances). The numerator is the result from the regression of Y on Z. In other words, it's the \"impact\" of Z on Y. Remember that this is not to say that Z causes Y, since we have a requirement that Z impacts Y only through T. Rather, it is only capturing how big is this effect of Z on Y through T. This numerator is so famous it has its own name: the reduced form coefficient.\n",
+    "Notice that both the numerator and the denominator are regression coefficients (covariances divided by variances). The numerator is the result from the regression of Y on Z. In other words, it's the \"impact\" of Z on Y. Remember that this is not to say that Z causes Y, since we have a requirement that Z impacts Y only through T. Rather, it is only capturing how big this effect of Z on Y through T is. This numerator is so famous it has its own name: the reduced form coefficient.\n",
     "\n",
     "The denominator is also a regression coefficient. This time, it is the regression of T on Z. This regression captures what is the impact of Z on T and it is also so famous that it is called the 1st Stage coefficient. \n",
     "\n",
@@ -288,7 +288,7 @@
     "\n",
     "Still, we do have some interesting examples of instruments to make things a little more concrete. We will again try to estimate the effect of education on wage. To do so, we will use the person's quarter of birth as the instrument Z.\n",
     "\n",
-    "This idea takes advantage of US compulsory attendance law. Usually, they state that a kid must have turned 6 years by January 1 of the year they enter school. For this reason, kids that are born at the beginning of the year will enter school at an older age. Compulsory attendance law also requires students to be in school until they turn 16, at which point they are legally allowed to drop out. The result is that people born later in the year have, on average, more years of education than those born in the beginning of the year.\n",
+    "This idea takes advantage of US compulsory attendance law. Usually, they state that a kid must have turned 6 by January 1 of the year they enter school. For this reason, kids that are born at the beginning of the year will enter school at an older age. Compulsory attendance law also requires students to be in school until they turn 16, at which point they are legally allowed to drop out. The result is that people born later in the year have, on average, more years of education than those born in the beginning of the year.\n",
     "\n",
     "![img](./data/img/iv/qob.png)\n",
     "\n",

diff --git a/causal-inference-for-the-brave-and-true/09-Non-Compliance-and-LATE.ipynb b/causal-inference-for-the-brave-and-true/09-Non-Compliance-and-LATE.ipynb
@@ -305,13 +305,13 @@
     "\n",
     "Instrumental Variable assumptions can now be rewritten as follows\n",
     "\n",
-    "1. $T_{0i}, T_{1i} \\perp Z_i $ and $Y_i(T_{1i},1), Y_i(T_{0i},0) \\perp Z_i $. This is the independence Assumption. This says that the instrument is as good as randomly assigned. In other words, Z, the instrument, is not correlated with the potential treatments, which is the same as saying that people in different instrument groups are comparable. \n",
+    "1. $T_{0i}, T_{1i} \\perp Z_i $ and $Y_i(T_{1i},1), Y_i(T_{0i},0) \\perp Z_i $. This is the Independence Assumption. This says that the instrument is as good as randomly assigned. In other words, Z, the instrument, is not correlated with the potential treatments, which is the same as saying that people in different instrument groups are comparable. \n",
     "\n",
     "2. $Y_i(1, 0)=Y_i(1, 1)=Y_{i1}$ and $Y_i(0, 0)=Y_i(0, 1)=Y_{i0}$. This is the exclusion restriction. It says that if I'm looking at the potential outcome for the treated, it is the same for both instrument groups. In other words, the instrument does not affect the potential outcome, which is the same as saying that the instrument only affects the outcome through the treatment.\n",
     "\n",
     "3. $E[T_{1i}-T_{0i}] \\neq 0$. This is the existence of a 1st stage. It is saying that the potential outcome of the 1st stage, that is, the potential treatment, is NOT the same. Another way of saying this is that the instrument does affect the treatment.\n",
     "\n",
-    "4. $T_{i1} > T_{i0}$. This is the monotonicity assumption. It is saying that if everyone had the instrument turned on, the treatment level would be higher than if everyone had the treatment turned off. \n",
+    "4. $T_{i1} \\geq T_{i0}$. This is the monotonicity assumption. It is saying that if everyone had the instrument turned on, the treatment level would be equal or higher than if everyone had the instrument turned off. \n",
     "\n",
     "Now, let's review the Wald estimator to gain some further intuition on IV:\n",
     "\n",
@@ -322,7 +322,7 @@
     "Let's take the first bit of it, $E[Y|Z=1]$. Using the exclusion restriction, we can rewrite Y in terms of potential outcome like this.\n",
     "\n",
     "$\n",
-    "E[Y_i|Z_i=1]=E[Y_{i0} + T_{i1}(Y_{i1} - Y_{i0})|Z=1]\n",
+    "E[Y_i|Z_i=1]=E[Y_{i0} + T_{i1}(Y_{i1} - Y_{i0})|Z_i=1]\n",
     "$\n",
     "\n",
     "Using independence, we can take out the conditioning on Z.\n",
@@ -553,7 +553,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Looks like we have a strong first stage. Those that get assigned to get the push get it 71.8% of the time. This means that we have something like 28% of never takers. We also have strong reasons to believe there are no always takers, since the intercept parameter is estimated to be zero. This means that no one get's the push if it is not assigned to it. Given the design of our experiment, this is expected. \n",
+    "Looks like we have a strong first stage. Those that get assigned to get the push get it 71.8% of the time. This means that we have something like 28% of never takers. We also have strong reasons to believe there are no always takers, since the intercept parameter is estimated to be zero. This means that no one gets the push if it is not assigned to it. Given the design of our experiment, this is expected. \n",
     "\n",
     "Let's now run the reduced form:"
    ]

diff --git a/causal-inference-for-the-brave-and-true/10-Matching.ipynb b/causal-inference-for-the-brave-and-true/10-Matching.ipynb
@@ -163,7 +163,7 @@
     "\\hat{ATE} = \\sum^K_{k=1}(\\bar{Y}_{k1} - \\bar{Y}_{k0}) * \\dfrac{N_k}{N}\n",
     "$\n",
     "\n",
-    "where the bar represent the mean of the outcome on the treated, $Y_{k1}$, and non-treated, $Y_{k0}$, at cell k and $N_{k}$ is the number of observations in that same cell. As you can see, we are computing a local ATE for each cell and combining them using a weighted average, where the weights are the sample size of the cell. In our medicine example above, this would be the first estimate, which gave us −2.6.\n",
+    "where the bar represents the mean of the outcome on the treated, $Y_{k1}$, and non-treated, $Y_{k0}$, at cell k and $N_{k}$ is the number of observations in that same cell. As you can see, we are computing a local ATE for each cell and combining them using a weighted average, where the weights are the sample size of the cell. In our medicine example above, this would be the first estimate, which gave us −2.6.\n",
     "\n",
     "## Matching Estimator\n",
     "\n",
@@ -828,7 +828,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "But this was a very contrived example, just to introduce matching. In reality, we usually have more than one feature and units don't match perfectly. In this case, we have to define some measurement of proximity to compare how units are close to each other. One common metric for this is the euclidean norm $||X_i - X_j||$. This difference, however, is not invariant to the scale of the features. This means that features like age, that take values on the tenths, will be much less important when computing this norm compared to features like income, which take the order of hundreds. For this reason, before applying the norm, we need to scale the features so that they are on roughly the same scale.\n",
+    "But this was a very contrived example, just to introduce matching. In reality, we usually have more than one feature and units don't match perfectly. In this case, we have to define some measurement of proximity to compare how units are close to each other. One common metric for this is the euclidean norm $||X_i - X_j||$. This difference, however, is not invariant to the scale of the features. This means that features like age, that take values in the tens, will be much less important when computing this norm compared to features like income, which take the order of thousands. For this reason, before applying the norm, we need to scale the features so that they are on roughly the same scale.\n",
     "\n",
     "Having defined a distance measure, we can now define the match as the nearest neighbour to that sample we wish to match. In math terms, we can write the matching estimator the following way\n",
     "\n",
@@ -1253,7 +1253,7 @@
     "\\sqrt{N_1}(\\hat{ATET} - ATET)\n",
     "$\n",
     "\n",
-    "However, this doesn't alway happen. If we define the mean outcome for the untreated given X, $\\mu_0(x)=E[Y|X=x, T=0]$, we will have that (btw, I've omitted the proof for that because it's a little beyond the point here).\n",
+    "However, this doesn't always happen. If we define the mean outcome for the untreated given X, $\\mu_0(x)=E[Y|X=x, T=0]$, we will have that (btw, I've omitted the proof for that because it's a little beyond the point here).\n",
     "\n",
     "$\n",
     "E[\\sqrt{N_1}(\\hat{ATET} - ATET)] = E[\\sqrt{N_1}(\\mu_0(X_i) - \\mu_0(X_j(i)))]\n",
@@ -1269,7 +1269,7 @@
     "\\hat{ATET} = \\frac{1}{N_1}\\sum \\big((Y_i - Y_{j(i)}) - (\\hat{\\mu_0}(X_i) - \\hat{\\mu_0}(X_{j(i)}))\\big)\n",
     "$\n",
     "\n",
-    "where $\\hat{\\mu_0}(x)$ is some estimative of $E[Y|X, T=0]$, like a linear regression fitted only on the untreated sample."
+    "where $\\hat{\\mu_0}(x)$ is some estimate of $E[Y|X, T=0]$, like a linear regression fitted only on the untreated sample."
    ]
   },
   {
@@ -1496,9 +1496,11 @@
     "\n",
     "As it turns out, the answer is quite simple and intuitive. It is easy to find people that match on a few characteristics, like sex. But if we add more characteristics, like age, income, city of birth and so on, it becomes harder and harder to find matches. In more general terms, the more features we have, the higher will be the distance between units and their matches. \n",
     "\n",
-    "This is not something that hurts only the matching estimator. It ties back to the subclassification estimator we saw earlier. Early on, in that contrived medicine example where with man and woman, it was quite easy to build the subclassification estimator. That was because we only had 2 cells: man and woman. But what would happen if we had more? Let's say we have 2 continuous features like age and income and we manage to discretise them into 5 buckets each. This will give us 25 cells, or $5^2$. And what if we had 10 covariates with 3 buckets each? Doesn't seem like a lot right? Well, this would give us 59049 cells, or $3^{10}$. It's easy to see how this can blow out of proportion pretty quickly. This is a phenomena pervasive in all data science, which is called the **The Curse of Dimensionality**!!!\n",
+    "This is not something that hurts only the matching estimator. It ties back to the subclassification estimator we saw earlier. Early on, in that contrived medicine example where with man and woman, it was quite easy to build the subclassification estimator. That was because we only had 2 cells: man and woman. But what would happen if we had more? Let's say we have 2 continuous features like age and income and we manage to discretise them into 5 buckets each. This will give us 25 cells, or $5^2$. And what if we had 10 covariates with 3 buckets each? Doesn't seem like a lot right? Well, this would give us 59049 cells, or $3^{10}$. It's easy to see how this can blow out of proportion pretty quickly. This is a phenomenon pervasive in all data science, which is called the **The Curse of Dimensionality**!!!\n",
+    "\n",
+    "![img](./data/img/curse-of-dimensionality.jpg)",
+    "\n",
     "\n",
-    "![img](./data/img/curse-of-dimensionality.jpg)\n",
     "Image Source: https://deepai.org/machine-learning-glossary-and-terms/curse-of-dimensionality\n",
     "\n",
     "Despite its scary and pretentious name, this only means that the number of data points required to fill a feature space grows exponentially with the number of features, or dimensions. So, if it takes X data points to fill the space of, say, 3 feature spaces, it takes exponentially more points to fill in the space of 4 features. \n",
@@ -1515,7 +1517,7 @@
     "\n",
     "From there, we've derived a very general causal inference estimator with subclassification. We saw how that estimator is not very useful in practice but it gave us some interesting insights on how to tackle the problem of causal inference estimation. That gave us the opportunity to talk about the matching estimator. \n",
     "\n",
-    "Matching controls for the confounders by looking at each treated unit and finding an untreated pair that is very similar to it and similarly for the untreated units. We saw how to implement this method using the KNN algorithm and also how to debiase it using regression. Finally, we discussed the difference between matching and linear regression. We saw how matching is a non parametric estimator that doesn't rely on linearity the way linear regression does.\n",
+    "Matching controls for the confounders by looking at each treated unit and finding an untreated pair that is very similar to it and similarly for the untreated units. We saw how to implement this method using the KNN algorithm and also how to debias it using regression. Finally, we discussed the difference between matching and linear regression. We saw how matching is a non-parametric estimator that doesn't rely on linearity the way linear regression does.\n",
     "\n",
     "Finally, we've delved into the problem of high dimensional datasets and we saw how causal inference methods can suffer from it.\n",
     "\n",