tcarleton · fatiqnadeem · Sep 8, 2021 · Sep 14, 2021 · Sep 14, 2021 · Sep 14, 2021
diff --git a/labs/2-week-two/code/1-generate-govt-data.R b/labs/2-week-two/code/1-generate-govt-data.R
@@ -0,0 +1,17 @@
+
+library(readxl)
+library(tidyverse)
+library(readr)
+set.seed(1234)
+
+
+setwd("/Users/fatiq/Documents/GitHub/EDS-222-stats/")
+
+read_excel("labs/2-week-two/data/in/EPD_USAQI_LOG.xlsx") %>% 
+  transmute(date = Date, 
+            aqi = as.numeric(USAQI), 
+            rand = round(runif(396, min=40, max=200)),
+            aqi = if_else(is.na(aqi), rand, aqi)) %>% 
+  select(-rand) %>% 
+  distinct(date, .keep_all = TRUE) %>% 
+  write_csv2("labs/2-week-two/data/out/airpol-PK-govt.csv")
diff --git a/labs/2-week-two/code/2-generate-public-data.R b/labs/2-week-two/code/2-generate-public-data.R
@@ -0,0 +1,16 @@
+
+library(readxl)
+library(tidyverse)
+library(readr)
+library(lubridate)
+set.seed(1234)
+
+setwd("/Users/fatiq/Documents/GitHub/EDS-222-stats/")
+
+read_excel("labs/2-week-two/data/in/PakAir 2019-8-2.xlsx") %>% 
+  select(city = City, longitude, latitude, date_time = `Datetime (UTC+5)`, pm_25 = `PM2.5 (μg/m3)`, aqi = `US AQI`) %>% 
+  separate(`date_time`, c('date', 'time'), sep = ' ', extra = 'merge') %>% 
+  distinct(date, .keep_all = TRUE) %>% 
+  select(-time) %>% 
+  write_csv2("labs/2-week-two/data/out/airpol-PK-crowdsourced.csv")
+
diff --git a/labs/2-week-two/rmd/week-2-lab-answers.Rmd b/labs/2-week-two/rmd/week-2-lab-answers.Rmd
@@ -0,0 +1,284 @@
+---
+title: "EDS 222: Week 1: In-class Assignment"
+author: "{STUDENT NAME}"
+date: "`r Sys.Date()`"
+output: pdf_document
+---
+
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(echo = TRUE)
+# Load all the packages needed here
+library(tidyverse)
+library(readr)
+library(gt)
+
+setwd("/Users/fatiq/Dropbox/EDS222_data/data")
+
+```
+
+_(All case studies in this exercise are fictional Any resemblance to actual events or locales or persons, living or dead, is entirely coincidental.)_
+
+# Sampling in R
+
+Knowing how the observational units were selected from a larger entity will allow for generalizations back to the population from which the data were randomly selected. Additionally, by understanding the structure of the study, causal relationships can be separated from those relationships which are only associated. A good question to ask oneself before working with the data at all is,“How were these observations collected?” The following exercise will help you ask this question. 
+
+In an effort to reverse the effects of climate change and understand the distribution of air quality in the U.S., the Environmental Protection Agency (EPA) intends to initiate an effort to collect data on air quality across the U.S. However, limited funding from President Grump only allow them to collect data from 150 counties. You are an environmental data scientist at the EPA who is tasked to select 150 counties. You are conveniently given access to a list of all 3142 counties from the census. This is publicly available information and is contained in the county data-set in the `usdata` package. You can read more about the data using `help(county)`
+
+
+You have previously learned in class that there two types of sampling methodologies here: (i) Simple Random Sampling and; (ii) Stratified Sampling. Use these strategies to solve the first part of this exercise. For this we will make use of the county level data from the `usdata` package. This data frame contains data from counties in all 50 states plus the District of Columbia. 
+
+## Step 1: 
+
+Use the `library(usdata)` to load the data. 
+
+```{r}
+library(usdata)
+county <- 
+  county
+
+```
+
+## Step 2: 
+
+Take a random sample of 150 counties that you think can be representative of the US and store them in a tidy data frame called `county_srs` You are encouraged to write a function to do this.
+
+```{r}
+# simple way
+county_srs <- 
+  county %>%
+  slice_sample(n = 150)
+```
+
+## Step 3: 
+
+Evaluate the distribution of sampled counties across different states in the data frame `county_srs`. Are the number of sampled counties in each state different? Would you call this a representative sample? 
+
+```{r}
+county_srs %>%
+  group_by(state) %>%
+  count() %>% 
+  ungroup()
+
+# Yes, they are different. It is not a representative sample since it is not blocked on states. 
+```
+
+## Step 4: 
+
+Using your answer from step 3, think about why or why not is the sampling strategy in step 2 representative of the population. Use the concepts of stratified random sampling learned in class to select an equal number of counties in each state. This can also be referred to as stratification on state, can you explain why or why not would this strategy make the sampling more representative. 
+
+```{r}
+county_str <- 
+  county %>%
+  group_by(state) %>%
+  slice_sample(n = 3) %>% 
+  ungroup()
+```
+
+# Step 5: 
+
+President Grump and his team is not convinced that you are selecting a random sample of all the counties of the U.S. Write in a few words how you would defend your sampling strategy in Step 4. 
+
+
+# Sampling issues in real data: 
+
+## How does sampling influence your data and what you can learn from it?
+
+As we have learned in the lectures that statistic derived from the population data and the sample data differ. The statistic derived from the population is referred to as the estimand where the statistic derived from the sample data is referred to as the estimate.
+
+In this exercise we will look at a case study from the developing world. We will take the case of air quality in South Asia. Out of the 6 South Asian countries, Pakistan was ranked to contain the second most air pollution in the world in 2020 (IQAIR, 2020). In 2019 Lahore, Pakistan was the 12th most polluted city in the world with a population of 11.1 million people. You are given two data-sets from Lahore, Pakistan and are asked to compare the two different data collection strategies from this city. 
+
+We make use of two sources of data: (i) crowd-sourced data from air quality monitors in people's homes; (ii) government data from official monitors installed at selected places in Lahore.
+
+## Step 1: Load the data from dropbox and label it as `crowdsourced` and `govt` accordingly: 
+
+```{r}
+crowdsourced <- read_delim("2-week-two/airpol-PK-crowdsourced.csv", ";", escape_double = FALSE, trim_ws = TRUE)
+govt <- read_delim("2-week-two/airpol-PK-govt.csv", ";", escape_double = FALSE, trim_ws = TRUE)
+```
+
+## Step 2: Explain in words what the ideal population dataframe for air quality in Lahore would look like. Use both the crowdsourced data and the government and try to identify how both of these datasets can be products of different sampling strategies.
+
+```{r}
+
+```
+
+
+## Step 3: For each of the government data and crowdsourced generate estimates of the mean, min, max of the data and discuss how the sampling strategy biases (or doesn’t) those parameters as approximations of population statistics. 
+
+```{r}
+crowdsourced %>% 
+  summarise(mean(aqi), 
+            min(aqi), 
+            max(aqi), 
+            sd(aqi)) %>% 
+  gather() %>% 
+  gt()
+```
+
+
+```{r}
+govt %>% 
+  summarise(mean(aqi), 
+            min(aqi), 
+            max(aqi), 
+            sd(aqi)) %>% 
+  gather() %>% 
+  gt()
+```
+
+## Step 5: Identify why the government and corwdsourced data differ and how can they be related to biases generated as a result of the sampling process. 
+
+
+```{r}
+
+```
+
+
+# Designing an experiment
+
+Randomized controlled trials (RCT) are the gold-standard for causal data collection. However, conducting an RCT in real life is monetarily and logistically expensive to implement. 
+
+You are tasked to design a randomized controlled trial on air quality in Lahore, Pakistan. To know more about the context of air quality in Lahore, please read [this paper](https://www.dropbox.com/s/lrv496l15p2mgap/paper_432.pdf?dl=0). The experiment you design should have one treatment and one control group. Since, RCT are expensive and difficult to execute, the Jill and Belinda Gates Foundation (your funder) wants to see some preliminary data simulations on what kinds of results you can hypothetically expect. As described in [Blair et al 2019](https://www.cambridge.org/core/journals/american-political-science-review/article/declaring-and-diagnosing-research-designs/3CB0C0BB0810AEF8FF65446B3E2E4926) the benefits of doing Monte Carlo simulations before implementing a research design help reduce researcher bias, improve the design and forecast problems in potential analyses. Although, the RCT design is due in your homework, in this exercise, we will generate do Monte Carlo simulations to foresee some potential issues in our analyses and address concerns by our funder. 
+
+
+## Step 1: Generate a data frame with 100 observations. Add a random normal variable for potential outcomes for Y0 and Y1:
+
+
+```{r}
+sample <-
+  tibble(Y0 = rnorm(n = 100),
+         Y1 = Y0 + 0.5)
+```
+
+## Step 2: Assign a binary treatment variable `Z` which has equal probability of being in control and treatment:
+
+```{r}
+sample <-
+  sample %>%
+  mutate(Z = sample(
+    c(0, 1),
+    size = 100,
+    replace = TRUE,
+    prob = c(0.5, 0.5)
+  ))
+```
+
+## Step 3: Contruct observed outcome using Y0 and Y1:
+
+```{r}
+sample <-
+  sample %>%
+  mutate(
+    Y = if_else(Z == 1, Y1, Y0))
+```
+
+## Step 4: Use a simple difference in means to calculate the treatment effect as the estimand and the estimate:
+
+```{r}
+    fixed_population <- sample %>% summarize(
+      estimand = mean(Y1 - Y0),
+      estimate = mean(Y[Z == 1]) - mean(Y[Z == 0]))
+```
+
+
+## Step 5: Now, we would like to do this in a simulation over multiple trials. First, start with doing steps 1-4 in a single pipeline and generating a function to execute them.
+
+```{r}
+experiment_run <- function() {
+  sample %>%
+    mutate(Z = sample(c(0, 1), 100, replace = TRUE, prob = c(.5, .5))) %>%
+    mutate(Y = if_else(Z == 1, Y1, Y0)) %>%
+    summarize(estimand = mean(Y1 - Y0),
+              estimate = mean(Y[Z == 1]) - mean(Y[Z == 0])) %>%
+    as_vector()
+}
+```
+
+## Step 6: Using the function in 5 and replicate it times and report what you observe about the results:
+
+```{r}
+    many_runs <-
+      replicate(
+        n = 1000,
+        expr = experiment_run(),
+        simplify = TRUE)
+
+    apply(many_runs, 1, mean)
+
+```
+
+
+## Step 8: Jill and Belinda and Gates Foundation are not accustomed to the technical training in statistics and are worried that sampling or measurement error can affect the results that come out of your RCT.  Write a note to your funder about how will your mean treatment effects will be within a certain range using your results from step 1-6. Please note that your funders are not at all accustomed to statistical terminlogies and you need to explain them your monte carlo simulations in jargon-free language.  
+
+```{r}
+    many_runs <-
+      replicate(
+        n = 1000,
+        expr = run_experiment(),
+        simplify = TRUE)
+
+    apply(many_runs, 1, max)
+    apply(many_runs, 1, min)
+
+```
+
+
+
+## Step 9: add blocking within the sampling part:  
+
+### Step 7.1: First we need to identify what blocking is and why is it needed. For this we need to see what is the bias that our experiment suffers from due to lack of blocking. To do this, design an experiment, randomize the treatment status and see the bias in the simulated variable. Prove that in over 10,000 iterations this bias is reduced to 0. 
+
+```{r}    
+    tibble(u = rnorm(100),  
+           x = rnorm(100)) %>%  
+      mutate(Z = sample(rep(c(0, 1), each = n() / 2), n(), replace = FALSE)) %>% 
+      summarize(balance = mean(x[Z == 1]) - mean(x[Z == 0])) %>%
+      pull(balance)
+
+    my_experiment <- function() {
+      tibble(u = rnorm(100),  
+             x = rnorm(100)) %>%  
+        mutate(Z = sample(rep(c(0, 1), each = n() / 2), n(), replace = FALSE)) %>% 
+        summarize(balance = mean(x[Z == 1]) - mean(x[Z == 0])) %>%
+        pull(balance)
+      }
+
+    balance_statistics <- replicate(10000, my_experiment(), simplify = TRUE)
+    mean(balance_statistics)
+```
+
+
+### Step 7.2: Show the iterated bias generated in step 1 in a plot
+
+```{r}
+    ggplot(data.frame(balance = balance_statistics), aes(balance)) + geom_histogram()
+```  
+
+### Step 7.3: Now we need to think about blocking. We will start to do this by changing the function to use block randomization. For this, first, we should construct blocks. First generate one observed covariate and construct blocks according to that. 
+
+### Step 7.3.1: we need to create a block indicator. Blocks are based on observed units being similar to each other in the greatest way possible. Make use of the function `arrange()` to sort the data in the order of the observed covariate and then mutate a block indicator where the the data resembles a series where the first two units are in block 1, the second two are in block 2, etc.
+
+### Step 7.3.2: Now, once we have blocks, we need to randomize all the units within the blocks where one unit will go to treatment and one will go to control. randomize *within* blocks, exactly one to treatment and one to control. 
+```{r}
+    blocking <-
+      function(data) {
+        data$Z <- NA
+        for (b in unique(data$block)) {
+          data$Z[data$block == b] <-
+            sample(c(0, 1), size = 2, replace = FALSE)
+        }
+        data
+      }
+
+    my_experiment <- function() {
+      tibble(u = rnorm(100), 
+             x = rnorm(100)) %>%  
+        mutate(Y0 = x + u,
+               Y1 = Y0 + 5) %>%
+        arrange(x) %>%
+        mutate(block = rep(1:(n() / 2), each = 2)) %>% 
+        block_ra() %>%
+        mutate(Y = if_else(Z == 0, Y0, Y1))
+    }
+
+```