diff --git a/Contributing/Contributing.md b/Contributing/Contributing.md index be9d7c2d..8d9b12c0 100644 --- a/Contributing/Contributing.md +++ b/Contributing/Contributing.md @@ -184,3 +184,4 @@ Please be sure to add alt text to images for sight-impaired users. Image filenam - **How can I discuss what I'm doing with other contributors?** Head to the [Issues](https://github.com/LOST-STATS/LOST-STATS.github.io/issues) page and find (or post) a thread with the title of the page you're talking about. - **How can I [add an image/link to another LOST page/add an external link/bold text] in the LOST wiki?** See the Markdown section above. - **I want to contribute but I do not like all the rules and structure on this page. I don't even want my FAQ entry to be a question. Just let me write what I want.** If you have valuable knowledge about statistical techniques to share with people and are able to explain things clearly, I don't want to stop you. So go for it. Maybe post something in [Issues](https://github.com/LOST-STATS/LOST-STATS.github.io/issues) when you're done and perhaps someone else will help make your page more consistent with the rest of the Wiki. I mean, it would be nicer if you did that yourself, but hey, we all have different strengths, right? + diff --git a/Data/README.md b/Data/README.md index 6adfa164..ddc3e114 100644 --- a/Data/README.md +++ b/Data/README.md @@ -1 +1,2 @@ -Folder for adding user contributed data. \ No newline at end of file +Folder for adding user contributed data. + diff --git a/Data_Manipulation/Combining_Datasets/combining_datasets_horizontal_merge_deterministic.md b/Data_Manipulation/Combining_Datasets/combining_datasets_horizontal_merge_deterministic.md index 8ba1a884..a4ec908e 100644 --- a/Data_Manipulation/Combining_Datasets/combining_datasets_horizontal_merge_deterministic.md +++ b/Data_Manipulation/Combining_Datasets/combining_datasets_horizontal_merge_deterministic.md @@ -29,15 +29,20 @@ There are three main ways to join datasets horizontally in python using the `mer ```python import pandas as pd -gdp_2018 = pd.DataFrame({'country': ['UK', 'USA', 'France'], - 'currency': ['GBP', 'USD', 'EUR'], - 'gdp_trillions': [2.1, 20.58, 2.78]}) - -dollar_value_2018 = pd.DataFrame({'currency': ['EUR', 'GBP', 'YEN', 'USD'], - 'in_dollars': [1.104, 1.256, .00926, 1]}) +gdp_2018 = pd.DataFrame( + { + "country": ["UK", "USA", "France"], + "currency": ["GBP", "USD", "EUR"], + "gdp_trillions": [2.1, 20.58, 2.78], + } +) + +dollar_value_2018 = pd.DataFrame( + {"currency": ["EUR", "GBP", "YEN", "USD"], "in_dollars": [1.104, 1.256, 0.00926, 1]} +) # Perform a left merge, which discards 'YEN' -GDPandExchange = pd.merge(gdp_2018, dollar_value_2018, how='left', on='currency') +GDPandExchange = pd.merge(gdp_2018, dollar_value_2018, how="left", on="currency") ``` ## R @@ -50,12 +55,16 @@ There are several ways to combine data sets horizontally in R, including base-R library(dplyr) # This data set contains information on GDP in local currency -GDP2018 <- data.frame(Country = c("UK", "USA", "France"), - Currency = c("Pound", "Dollar", "Euro"), - GDPTrillions = c(2.1, 20.58, 2.78)) +GDP2018 <- data.frame( + Country = c("UK", "USA", "France"), + Currency = c("Pound", "Dollar", "Euro"), + GDPTrillions = c(2.1, 20.58, 2.78) +) # This data set contains dollar exchange rates -DollarValue2018 <- data.frame(Currency = c("Euro", "Pound", "Yen", "Dollar"), - InDollars = c(1.104, 1.256, .00926, 1)) +DollarValue2018 <- data.frame( + Currency = c("Euro", "Pound", "Yen", "Dollar"), + InDollars = c(1.104, 1.256, .00926, 1) +) ``` Next we want to join together `GDP2018` and `DollarValue2018` so we can convert all the GDPs to dollars and compare them. There are three kinds of observations we could get - observations in `GDP2018` but not `DollarValue2018`, observations in `DollarValue2018` but not `GDP2018`, and observations in both. Use `help(join)` to pick the variant of `join` that keeps the observations we want. The "Yen" observation won't have a match, and we don't need to keep it. So let's do a `left_join` and list `GDP2018` first, so it keeps matched observations, plus any observations only in `GDP2018`. @@ -117,3 +126,4 @@ A one-to-many merge is the opposite of a many to one merge, with multiple observ #### Many-to-Many A many-to-many merge is intended for use when there are multiple observations for each combination of the set of merging variables in both master and using data. However, `merge m:m` has strange behavior that is effectively never what you want, and it is not recommended. + diff --git a/Data_Manipulation/Combining_Datasets/combining_datasets_overview.md b/Data_Manipulation/Combining_Datasets/combining_datasets_overview.md index 00f3f3ab..2cd69f80 100644 --- a/Data_Manipulation/Combining_Datasets/combining_datasets_overview.md +++ b/Data_Manipulation/Combining_Datasets/combining_datasets_overview.md @@ -39,3 +39,4 @@ Alternatively, the below example has two datasets that collect the same informat | Donald Akliberti | B72197 | 34 | These ways of combining data are referred to by different names across different programming languages, but will largely be referred to by one common set of terms (used by Stata and Python’s Pandas): merge for horizontal combination and append for for vertical combination. + diff --git a/Data_Manipulation/Combining_Datasets/combining_datasets_vertical_combination.md b/Data_Manipulation/Combining_Datasets/combining_datasets_vertical_combination.md index 43a7d926..2fd6e421 100644 --- a/Data_Manipulation/Combining_Datasets/combining_datasets_vertical_combination.md +++ b/Data_Manipulation/Combining_Datasets/combining_datasets_vertical_combination.md @@ -8,11 +8,11 @@ nav_order: 1 # Combining Datasets: Vertical Combination -When combining two datasets that collect the same information about different people, they get combined vertically because they have variables in common but different observations. The result of this combination will more rows than the original dataset because it contains all of the people that are present in each of the original datasets. Here we combine the files based on the name or position of the columns in the dataset. It is a "vertical" combination in the sense that one set of observations gets added to the bottom of the other set of observations. +When combining two datasets that collect the same information about different people, they get combined vertically because they have variables in common but different observations. The result of this combination will more rows than the original dataset because it contains all of the people that are present in each of the original datasets. Here we combine the files based on the name or position of the columns in the dataset. It is a "vertical" combination in the sense that one set of observations gets added to the bottom of the other set of observations. -# Keep in Mind -- Vertical combinations require datasets to have variables in common to be of much use. That said, it may not be necessary for the two datasets to have exactly the same variables. Be aware of how your statistical package handles observations for a variable that is in one dataset but not another (e.g. are such observations set to missing?). -- It may be the case that the datasets you are combining have the same variables but those variables are stored differently (numeric vs. string storage types). Be aware of how the variables are stored across datasets and how your statistical package handles attempts to combine the same variable with different storage types (e.g. Stata throws an error and will now allow the combination, unless the ", force" option is specified.) +# Keep in Mind +- Vertical combinations require datasets to have variables in common to be of much use. That said, it may not be necessary for the two datasets to have exactly the same variables. Be aware of how your statistical package handles observations for a variable that is in one dataset but not another (e.g. are such observations set to missing?). +- It may be the case that the datasets you are combining have the same variables but those variables are stored differently (numeric vs. string storage types). Be aware of how the variables are stored across datasets and how your statistical package handles attempts to combine the same variable with different storage types (e.g. Stata throws an error and will now allow the combination, unless the ", force" option is specified.) # Implementations @@ -24,12 +24,11 @@ When combining two datasets that collect the same information about different pe import pandas as pd # Load California Population data from the internet -df_ca = pd.read_stata('http://www.stata-press.com/data/r14/capop.dta') -df_il = pd.read_stata('http://www.stata-press.com/data/r14/ilpop.dta') +df_ca = pd.read_stata("http://www.stata-press.com/data/r14/capop.dta") +df_il = pd.read_stata("http://www.stata-press.com/data/r14/ilpop.dta") # Concatenate a list of the dataframes (works on any number of dataframes) df = pd.concat([df_ca, df_il]) - ``` ## R @@ -45,8 +44,8 @@ library(dplyr) data(mtcars) # Split it in two, so we can combine them back together -mtcars1 <- mtcars[1:10,] -mtcars2 <- mtcars[11:32,] +mtcars1 <- mtcars[1:10, ] +mtcars2 <- mtcars[11:32, ] # Use bind_rows to vertically combine the data sets mtcarswhole <- bind_rows(mtcars1, mtcars2) @@ -56,26 +55,27 @@ mtcarswhole <- bind_rows(mtcars1, mtcars2) ```stata * Load California Population data -webuse http://www.stata-press.com/data/r14/capop.dta // Import data from the web +webuse http://www.stata-press.com/data/r14/capop.dta // Import data from the web -append using http://www.stata-press.com/data/r14/ilpop.dta // Merge on Illinois population data from the web +append using http://www.stata-press.com/data/r14/ilpop.dta // Merge on Illinois population data from the web ``` -You can also append multiple datasets at once, by simply listing both datasets separated by a space: +You can also append multiple datasets at once, by simply listing both datasets separated by a space: ```stata * Load California Population data -* Import data from the web -webuse http://www.stata-press.com/data/r14/capop.dta +* Import data from the web +webuse http://www.stata-press.com/data/r14/capop.dta -* Merge on Illinois and Texas population data from the web -append using http://www.stata-press.com/data/r14/ilpop.dta http://www.stata-press.com/data/r14/txpop.dta +* Merge on Illinois and Texas population data from the web +append using http://www.stata-press.com/data/r14/ilpop.dta http://www.stata-press.com/data/r14/txpop.dta ``` -Note that, if there are columns in one but not the other of the datasets, Stata will still append the two datasets, but observations from the dataset that did not contain those columns will have their values for that variable set to missing. +Note that, if there are columns in one but not the other of the datasets, Stata will still append the two datasets, but observations from the dataset that did not contain those columns will have their values for that variable set to missing. ```stata -* Load Even Number Data -webuse odd.dta, clear +* Load Even Number Data +webuse odd.dta, clear -append using http://www.stata-press.com/data/r14/even.dta +append using http://www.stata-press.com/data/r14/even.dta ``` + diff --git a/Data_Manipulation/Creating_Dummy_Variables/creating_dummy_variables.md b/Data_Manipulation/Creating_Dummy_Variables/creating_dummy_variables.md index 174624fd..d6012f95 100644 --- a/Data_Manipulation/Creating_Dummy_Variables/creating_dummy_variables.md +++ b/Data_Manipulation/Creating_Dummy_Variables/creating_dummy_variables.md @@ -25,11 +25,12 @@ Several python libraries have functions to turn categorical variables into dummi import pandas as pd # Create a dataframe -df = pd.DataFrame({'colors': ['red', 'green', 'blue', 'red', 'blue'], - 'numbers': [5, 13, 1, 7, 5]}) +df = pd.DataFrame( + {"colors": ["red", "green", "blue", "red", "blue"], "numbers": [5, 13, 1, 7, 5]} +) # Replace the colors column with a dummy column for each color -df = pd.get_dummies(df, columns=['colors']) +df = pd.get_dummies(df, columns=["colors"]) ``` ## R @@ -41,10 +42,10 @@ data(iris) # To retain the column of dummies for the first # categorical value we remove the intercept -model.matrix(~-1+Species, data=iris) +model.matrix(~ -1 + Species, data = iris) # Then we can add the dummies to the original data -iris <- cbind(iris, model.matrix(~-1+Species, data=iris)) +iris <- cbind(iris, model.matrix(~ -1 + Species, data = iris)) # Of course, in a regression we can skip this process summary(lm(Sepal.Length ~ Petal.Length + Species, data = iris)) @@ -69,7 +70,7 @@ data(iris) # mutated_data. # Note: new variables do not have to be based on old # variables -mutated_data = iris %>% +mutated_data <- iris %>% mutate(Long.Petal = Petal.Length > Petal.Width) ``` @@ -77,42 +78,41 @@ This will create a new column of logical (`TRUE`/`FALSE`) variables. This works ```r?example=dplyr mutated_data <- mutated_data %>% - mutate(Long.Petal = Long.Petal*1) + mutate(Long.Petal = Long.Petal * 1) ``` You could also nest that operation inside the original creation of new_dummy like so: ```r?example=dplyr -mutated_data = iris %>% - mutate(Long.Petal = (Petal.Length > Petal.Width)*1) +mutated_data <- iris %>% + mutate(Long.Petal = (Petal.Length > Petal.Width) * 1) ``` ### Base R ```r?example=baser -#the following creates a 5 x 2 data frame -letters = c("a","b","c", "d", "e") -numbers = c(1,2,3,4,5) -df = data.frame(letters,numbers) +# the following creates a 5 x 2 data frame +letters <- c("a", "b", "c", "d", "e") +numbers <- c(1, 2, 3, 4, 5) +df <- data.frame(letters, numbers) ``` Now I'll show several different ways to create a dummy indicating if the numbers variable is odd. ```r?example=baser -df$dummy = df$numbers%%2 +df$dummy <- df$numbers %% 2 -df$dummy = ifelse(df$numbers%%2==1,1,0) +df$dummy <- ifelse(df$numbers %% 2 == 1, 1, 0) -df$dummy = df$numbers%%2==1 +df$dummy <- df$numbers %% 2 == 1 # the last one created a logical outcome to convert to numerical we can either -df$dummy = df$dummy * 1 +df$dummy <- df$dummy * 1 # or -df$dummy = (df$numbers%%2==1) *1 - +df$dummy <- (df$numbers %% 2 == 1) * 1 ``` ## MATLAB @@ -121,7 +121,7 @@ df$dummy = (df$numbers%%2==1) *1 The equivalent of `model.matrix()` in MATLAB is `dummyvar` which creates columns of one-hot encoded dummies from categorical variables. The following example is taken from MathWorks documentation. -```MATLAB +```matlab Colors = {'Red';'Blue';'Green';'Red';'Green';'Blue'}; Colors = categorical(Colors); @@ -132,7 +132,7 @@ D = dummyvar(Colors) In MATLAB you can store variables as columns in arrays. If you know you are going to add columns multiple times to the same array it is best practice to pre-allocate the final size of the array for computational efficiency. If you do this you can simply select the column you are designating for your dummy variable and story the dummys in that column. -```MATLAB +```matlab arr = [1,2,3;5,2,6;1,8,3]; dum = sum(data(:,:),2) <10; data = horzcat(arr,dum); @@ -167,3 +167,4 @@ regress mpg weight b_* * Create a logical variable gen highmpg = mpg > 30 ``` + diff --git a/Data_Manipulation/Reshaping/reshape.md b/Data_Manipulation/Reshaping/reshape.md index 781fafc2..c68803d5 100644 --- a/Data_Manipulation/Reshaping/reshape.md +++ b/Data_Manipulation/Reshaping/reshape.md @@ -7,3 +7,4 @@ nav_order: 1 --- # Reshaping Data + diff --git a/Data_Manipulation/Reshaping/reshape_panel_data_from_long_to_wide.md b/Data_Manipulation/Reshaping/reshape_panel_data_from_long_to_wide.md index 0ddef05b..f49b5eaf 100644 --- a/Data_Manipulation/Reshaping/reshape_panel_data_from_long_to_wide.md +++ b/Data_Manipulation/Reshaping/reshape_panel_data_from_long_to_wide.md @@ -56,8 +56,10 @@ import pandas as pd # Load WHO data on population as an example, which has 'country', 'year', # and 'population' columns. -df = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/tidyr/population.csv', - index_col=0) +df = pd.read_csv( + "https://vincentarelbundock.github.io/Rdatasets/csv/tidyr/population.csv", + index_col=0, +) # In this example, we would like to have one row per country but the data have # multiple rows per country, each corresponding with @@ -69,9 +71,7 @@ print(df.head()) # the pivot function and set 'country' as the index. As we'd like to # split out years into different columns, we set columns to 'years', and the # values within this new dataframe will be population: -df_wide = df.pivot(index='country', - columns='year', - values='population') +df_wide = df.pivot(index="country", columns="year", values="population") # What if there are multiple year-country pairs? Pivot can't work # because it needs unique combinations. In this case, we can use @@ -81,20 +81,19 @@ df_wide = df.pivot(index='country', # 5% higher values for all years. # Copy the data for France -synth_fr_data = df.loc[df['country'] == 'France'] +synth_fr_data = df.loc[df["country"] == "France"] # Add 5% for all years -synth_fr_data['population'] = synth_fr_data['population']*1.05 +synth_fr_data["population"] = synth_fr_data["population"] * 1.05 # Append it to the end of the original data df = pd.concat([df, synth_fr_data], axis=0) # Compute the wide data - averaging over the two estimates for France for each # year. -df_wide = df.pivot_table(index='country', - columns='year', - values='population', - aggfunc='mean') +df_wide = df.pivot_table( + index="country", columns="year", values="population", aggfunc="mean" +) ``` ## R @@ -121,15 +120,16 @@ Now we think: ```r?example=pivot_wider pop_wide <- pivot_wider(population, - names_from = year, - values_from = population, - names_prefix = "pop_") + names_from = year, + values_from = population, + names_prefix = "pop_" +) ``` Another way to do this is using `data.table`. ```r?example=pivot_wider -#install.packages('data.table') +# install.packages('data.table') library(data.table) # The second argument here is the formula describing the observation level of the data @@ -137,11 +137,11 @@ library(data.table) # The parts before the ~ are what we want the new observation level to be in the wide data (one row per country) # The parts after the ~ are for the variables we want to no longer be part of the observation level (we no longer want a row per year) -population = as.data.table(population) -pop_wide = dcast(population, - country ~ year, - value.var = "population" - ) +population <- as.data.table(population) +pop_wide <- dcast(population, + country ~ year, + value.var = "population" +) ``` ## Stata @@ -210,3 +210,4 @@ restore ``` Note: there is much more guidance to the usage of greshape on the [Gtools reshape page](https://gtools.readthedocs.io/en/latest/usage/greshape/index.html). + diff --git a/Data_Manipulation/Reshaping/reshape_panel_data_from_wide_to_long.md b/Data_Manipulation/Reshaping/reshape_panel_data_from_wide_to_long.md index 42bf6283..6d5c10b6 100644 --- a/Data_Manipulation/Reshaping/reshape_panel_data_from_wide_to_long.md +++ b/Data_Manipulation/Reshaping/reshape_panel_data_from_wide_to_long.md @@ -57,29 +57,31 @@ All of the columns that we would like to convert to long format begin with the p import pandas as pd -df = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/tidyr/billboard.csv', - index_col=0) +df = pd.read_csv( + "https://vincentarelbundock.github.io/Rdatasets/csv/tidyr/billboard.csv", + index_col=0, +) # stubnames is the prefix for the columns we want to convert to long. i is the # unique id for each row, and j will be the name of the new column. Finally, # the values from the original wide columns (the chart position) adopt the # stubname, so we rename 'wk' to 'position' in the last step. -long_df = (pd.wide_to_long(df, - stubnames='wk', - i=['artist', 'track', 'date.entered'], - j='week') - .rename(columns={'wk': 'position'})) +long_df = pd.wide_to_long( + df, stubnames="wk", i=["artist", "track", "date.entered"], j="week" +).rename(columns={"wk": "position"}) # The wide_to_long function is a special case of the 'melt' function, which # can be used in more complex cases. Here we melt any columns that have the # string 'wk' in their names. In the final step, we extract the number of weeks # from the prefix 'wk' using regex. The final dataframe is the same as above. -long_df = pd.melt(df, - id_vars=['artist', 'track', 'date.entered'], - value_vars=[x for x in df.columns if 'wk' in x], - var_name='week', - value_name='position') -long_df['week'] = long_df['week'].str.extract(r'(\d+)') +long_df = pd.melt( + df, + id_vars=["artist", "track", "date.entered"], + value_vars=[x for x in df.columns if "wk" in x], + var_name="week", + value_name="position", +) +long_df["week"] = long_df["week"].str.extract(r"(\d+)") # A more complex case taken from the pandas docs: @@ -92,17 +94,19 @@ import numpy as np # their first letter (A or B) # Create some synthetic data -df = pd.DataFrame({"A1970" : {0 : "a", 1 : "b", 2 : "c"}, - "A1980" : {0 : "d", 1 : "e", 2 : "f"}, - "B1970" : {0 : 2.5, 1 : 1.2, 2 : .7}, - "B1980" : {0 : 3.2, 1 : 1.3, 2 : .1}, - "X" : dict(zip(range(3), np.random.randn(3))) - }) +df = pd.DataFrame( + { + "A1970": {0: "a", 1: "b", 2: "c"}, + "A1980": {0: "d", 1: "e", 2: "f"}, + "B1970": {0: 2.5, 1: 1.2, 2: 0.7}, + "B1980": {0: 3.2, 1: 1.3, 2: 0.1}, + "X": dict(zip(range(3), np.random.randn(3))), + } +) # Set an index df["id"] = df.index # Wide to multiple long columns df_long = pd.wide_to_long(df, ["A", "B"], i="id", j="year") - ``` ## R @@ -132,11 +136,12 @@ Now we think: ```r?example=pivoting billboard_long <- pivot_longer(billboard, - col = starts_with("wk"), - names_to = "week", - names_prefix = "wk", - values_to = "position", - values_drop_na = TRUE) + col = starts_with("wk"), + names_to = "week", + names_prefix = "wk", + values_to = "position", + values_drop_na = TRUE +) # values_drop_na says to drop any rows containing missing values of position. # If reshaping to create multiple variables, see the names_sep or names_pattern options. ``` @@ -144,16 +149,16 @@ billboard_long <- pivot_longer(billboard, This task can also be done through `data.table`. ```r?example=pivoting -#install.packages('data.table') +# install.packages('data.table') library(data.table) -billboard = as.data.table(billboard) -billboard_long = melt(billboard, - id = 1:3, - na.rm=TRUE, - variable.names = "Week", - value.name = "Position" - ) +billboard <- as.data.table(billboard) +billboard_long <- melt(billboard, + id = 1:3, + na.rm = TRUE, + variable.names = "Week", + value.name = "Position" +) ``` ## Stata @@ -224,3 +229,4 @@ restore ``` Note: there is much more guidance to the usage of greshape on the [**Gtools** reshape page](https://gtools.readthedocs.io/en/latest/usage/greshape/index.html). + diff --git a/Data_Manipulation/collapse_a_data_set.md b/Data_Manipulation/collapse_a_data_set.md index 2f8573b0..dfc43224 100644 --- a/Data_Manipulation/collapse_a_data_set.md +++ b/Data_Manipulation/collapse_a_data_set.md @@ -17,7 +17,7 @@ The *observation level* of a data set is the set of case-identifying variables w | 2 | 1 | 2 | | 2 | 2 | 4.5 | -the variables $$I$$ and $$J$$ uniquely identify rows. The first row has $$I = 1$$ and $$J = 1$$, and there is no other row with that combination. We could also say that $$X$$ uniquely identifies rows, but in this example $$X$$ is not a case-identifying variable, it's actual data. +the variables $$I$$ and $$J$$ uniquely identify rows. The first row has $$I = 1$$ and $$J = 1$$, and there is no other row with that combination. We could also say that $$X$$ uniquely identifies rows, but in this example $$X$$ is not a case-identifying variable, it's actual data. It is common to want to *collapse* a data set from one level to another, coarser level. For example, perhaps instead of one row per combination of $$I$$ and $$J$$, we simply want one row per $$I$$, perhaps with the average $$X$$ across all $$I$$ observations. This would result in: @@ -53,7 +53,7 @@ using DataFrames, Statistics, CSV, HTTP url = "https://vincentarelbundock.github.io/Rdatasets/csv/dplyr/storms.csv" storms = DataFrame(CSV.File(HTTP.get(url).body)) -combine(groupby(storms, [:name, :year, :month, :day]), +combine(groupby(storms, [:name, :year, :month, :day]), [:wind, :pressure] .=> mean, [:category] .=> first) ``` @@ -65,14 +65,14 @@ For our Python implementation we'll use the very popular [**pandas**](https://pa import pandas as pd # Pull in data on storms -storms = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/dplyr/storms.csv') +storms = pd.read_csv( + "https://vincentarelbundock.github.io/Rdatasets/csv/dplyr/storms.csv" +) # We'll save the collapsed data as a new object called `storms_collapsed` (this is optional) -storms_collapsed = (storms - .groupby(['name', 'year', 'month', 'day']) - .agg({'wind': 'mean', - 'pressure': 'mean', - 'category': 'first'})) +storms_collapsed = storms.groupby(["name", "year", "month", "day"]).agg( + {"wind": "mean", "pressure": "mean", "category": "first"} +) ``` ## R @@ -90,8 +90,8 @@ library(dplyr) data("storms") # We'll save the collapsed data as a new object called `storms_collapsed` (this is optional) -storms_collapsed = storms %>% - group_by(name, year, month, day) %>% +storms_collapsed <- storms %>% + group_by(name, year, month, day) %>% summarize(across(c(wind, pressure), mean), category = first(category)) ``` @@ -104,9 +104,10 @@ library(data.table) # Set the already-loaded storms dataset as a data.table setDT(storms) -storms[, - .(wind=mean(wind), pressure=mean(pressure), category=first(category)), - by = .(name, year, month, day)] +storms[, + .(wind = mean(wind), pressure = mean(pressure), category = first(category)), + by = .(name, year, month, day) +] ``` Third: [**collapse**](https://sebkrantz.github.io/collapse/): @@ -114,9 +115,10 @@ Third: [**collapse**](https://sebkrantz.github.io/collapse/): # install.packages('collapse') library(collapse) -collap(storms, - by = ~name+year+month+day, - custom = list(fmean=c('wind', 'pressure'), ffirst='category')) +collap(storms, + by = ~ name + year + month + day, + custom = list(fmean = c("wind", "pressure"), ffirst = "category") +) ``` ## Stata @@ -129,7 +131,7 @@ import delimited https://vincentarelbundock.github.io/Rdatasets/csv/dplyr/storms collapse (mean) wind (mean) pressure (first) category, by(name year month day) ``` -With big datasets, Stata can be slow compared to other languages, though they do seem to be trying to [change that](https://www.stata.com/new-in-stata/faster-stata-speed-improvements/) a bit. The community-contributed [**gtools**](https://gtools.readthedocs.io/en/latest/usage/gtools/index.html) suite can help a lot with speedups and, fortunately, has a faster version of collapse, called `gcollapse`. Note that we won't necessarily see a benefit for small(ish) datasets like the one that we are using here. But note that the syntax is otherwise identical. +With big datasets, Stata can be slow compared to other languages, though they do seem to be trying to [change that](https://www.stata.com/new-in-stata/faster-stata-speed-improvements/) a bit. The community-contributed [**gtools**](https://gtools.readthedocs.io/en/latest/usage/gtools/index.html) suite can help a lot with speedups and, fortunately, has a faster version of collapse, called `gcollapse`. Note that we won't necessarily see a benefit for small(ish) datasets like the one that we are using here. But note that the syntax is otherwise identical. ```stata * ssc install gtools @@ -137,3 +139,4 @@ With big datasets, Stata can be slow compared to other languages, though they do gcollapse (mean) wind (mean) pressure (first) category, by(name year month day) ``` + diff --git a/Data_Manipulation/creating_a_variable_with_group_calculations.md b/Data_Manipulation/creating_a_variable_with_group_calculations.md index d88ac20d..099db921 100644 --- a/Data_Manipulation/creating_a_variable_with_group_calculations.md +++ b/Data_Manipulation/creating_a_variable_with_group_calculations.md @@ -17,7 +17,7 @@ Many data sets have hierarchical structures, where individual observations belon | 2 | 1 | 2 | | 2 | 2 | 5 | -Here, we have data where each group $$I$$ has multiple rows, one for each $$J$$. +Here, we have data where each group $$I$$ has multiple rows, one for each $$J$$. We often might want to create a new variable that performs a calculation *within* each group, and assigns the result to each value in that group. For example, perhaps we want to calculate the mean of $$X$$ within each group $$I$$, so we can know how far above or below the group average each observation is. Our goal is: @@ -26,7 +26,7 @@ We often might want to create a new variable that performs a calculation *within | - | - | - | - | | 1 | 1 | 3 | 3.25 | | 1 | 2 | 3.5 | 3.25 | -| 2 | 1 | 2 | 3.5 | +| 2 | 1 | 2 | 3.5 | | 2 | 2 | 5 | 3.5 | @@ -44,20 +44,22 @@ We often might want to create a new variable that performs a calculation *within import pandas as pd # Pull in data on storms -storms = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/dplyr/storms.csv') +storms = pd.read_csv( + "https://vincentarelbundock.github.io/Rdatasets/csv/dplyr/storms.csv" +) # Use groupby and agg to perform a group calculation # Here it's a mean, but it could be anything -meanwind = (storms.groupby(['name', 'year', 'month', 'day']) - .agg({'wind': 'mean'}) - # Rename so that when we merge it in it has a - # different name - .rename({'wind': 'mean_wind'})) - # make sure it's a data frame so we can join it - +meanwind = ( + storms.groupby(["name", "year", "month", "day"]).agg({"wind": "mean"}) + # Rename so that when we merge it in it has a + # different name + .rename({"wind": "mean_wind"}) +) +# make sure it's a data frame so we can join it + # Use merge to bring the result back into the data -storms = pd.merge(storms,meanwind, - on = ['name', 'year', 'month', 'day']) +storms = pd.merge(storms, meanwind, on=["name", "year", "month", "day"]) ``` ## R @@ -88,3 +90,4 @@ import delimited "https://vincentarelbundock.github.io/Rdatasets/csv/dplyr/storm * Use bysort to determine the grouping, and egen to do the calculation bysort name year month day: egen mean_wind = mean(wind) ``` + diff --git a/Data_Manipulation/creating_categorical_variables.md b/Data_Manipulation/creating_categorical_variables.md index 625d6a18..9d25e5fa 100644 --- a/Data_Manipulation/creating_categorical_variables.md +++ b/Data_Manipulation/creating_categorical_variables.md @@ -29,15 +29,24 @@ import pandas as pd # and purely for the dataset import statsmodels.api as sm -mtcars = sm.datasets.get_rdataset('mtcars').data + +mtcars = sm.datasets.get_rdataset("mtcars").data # Now we go through each pair of conditions and group assignments, # using loc to only send that group assignment to observations # satisfying the given condition -mtcars.loc[(mtcars.mpg <= 19) & (mtcars.hp <= 123), 'classification'] = 'Efficient and Non-Powerful' -mtcars.loc[(mtcars.mpg > 19) & (mtcars.hp <= 123), 'classification'] = 'Inefficient and Non-Powerful' -mtcars.loc[(mtcars.mpg <= 19) & (mtcars.hp > 123), 'classification'] = 'Efficient and Powerful' -mtcars.loc[(mtcars.mpg > 19) & (mtcars.hp > 123), 'classification'] = 'Inefficient and Powerful' +mtcars.loc[ + (mtcars.mpg <= 19) & (mtcars.hp <= 123), "classification" +] = "Efficient and Non-Powerful" +mtcars.loc[ + (mtcars.mpg > 19) & (mtcars.hp <= 123), "classification" +] = "Inefficient and Non-Powerful" +mtcars.loc[ + (mtcars.mpg <= 19) & (mtcars.hp > 123), "classification" +] = "Efficient and Powerful" +mtcars.loc[ + (mtcars.mpg > 19) & (mtcars.hp > 123), "classification" +] = "Inefficient and Powerful" ``` There's another way to achieve the same outcome using *lambda functions*. In this case, we'll create a dictionary of pairs of classification names and conditions, for example `'Efficient': lambda x: x['mpg'] <= 19`. We'll then find the first case where the condition is true for each row and create a new column with the paired classification name. @@ -45,18 +54,16 @@ There's another way to achieve the same outcome using *lambda functions*. In thi ```python?example=catpy # Dictionary of classification names and conditions expressed as lambda functions conds_dict = { - 'Efficient and Non-Powerful': lambda x: (x['mpg'] <= 19) & (x['hp'] <= 123), - 'Inefficient and Non-Powerful': lambda x: (x['mpg'] > 19) & (x['hp'] <= 123), - 'Efficient and Powerful': lambda x: (x['mpg'] <= 19) & (x['hp'] > 123), - 'Inefficient and Powerful': lambda x: (x['mpg'] > 19) & (x['hp'] > 123), + "Efficient and Non-Powerful": lambda x: (x["mpg"] <= 19) & (x["hp"] <= 123), + "Inefficient and Non-Powerful": lambda x: (x["mpg"] > 19) & (x["hp"] <= 123), + "Efficient and Powerful": lambda x: (x["mpg"] <= 19) & (x["hp"] > 123), + "Inefficient and Powerful": lambda x: (x["mpg"] > 19) & (x["hp"] > 123), } # Find name of first condition that evaluates to True -mtcars['classification'] = mtcars.apply(lambda x: next(key for - key, value in - conds_dict.items() - if value(x)), - axis=1) +mtcars["classification"] = mtcars.apply( + lambda x: next(key for key, value in conds_dict.items() if value(x)), axis=1 +) ``` There's quite a bit to unpack here! `.apply(lambda x: ..., axis=1)` applies a lambda function rowwise to the entire dataframe, with individual columns accessed by, for example, `x['mpg']`. (You can apply functions on an index using `axis=0`.) The `next` keyword returns the next entry in a list that evaluates to true or exists (so in this case it will just return the first entry that exists). Finally, `key for key, value in conds_dict.items() if value(x)` iterates over the pairs in the dictionary and returns only the condition names (the 'keys' in the dictionary) for conditions (the 'values' in the dictionary) that evaluate to true. @@ -71,10 +78,10 @@ data(mtcars) mtcars <- mtcars %>% mutate(classification = case_when( - mpg <= 19 & hp <= 123 ~ 'Efficient and Non-Powerful', # Here we list each pair of conditions and group assignments - mpg > 19 & hp <= 123 ~ 'Inefficient and Non-Powerful', - mpg <= 19 & hp > 123 ~ 'Efficient and Powerful', - mpg > 19 & hp > 123 ~ 'Inefficient and Powerful' + mpg <= 19 & hp <= 123 ~ "Efficient and Non-Powerful", # Here we list each pair of conditions and group assignments + mpg > 19 & hp <= 123 ~ "Inefficient and Non-Powerful", + mpg <= 19 & hp > 123 ~ "Efficient and Powerful", + mpg > 19 & hp > 123 ~ "Inefficient and Powerful" )) %>% mutate(classification = as.factor(classification)) # Storing categorical variables as factors often makes sense ``` @@ -89,10 +96,10 @@ data(mtcars) mtcars <- as.data.table(mtcars) mtcars[, classification := fcase( - mpg <= 19 & hp <= 123, 'Efficient and Non-Powerful', # Here we list each pair of conditions and group assignments - mpg > 19 & hp <= 123, 'Inefficient and Non-Powerful', - mpg <= 19 & hp > 123, 'Efficient and Powerful', - mpg > 19 & hp > 123, 'Inefficient and Powerful' + mpg <= 19 & hp <= 123, "Efficient and Non-Powerful", # Here we list each pair of conditions and group assignments + mpg > 19 & hp <= 123, "Inefficient and Non-Powerful", + mpg <= 19 & hp > 123, "Efficient and Powerful", + mpg > 19 & hp > 123, "Inefficient and Powerful" )] # Storing categorical variables as factors often makes sense @@ -113,3 +120,4 @@ replace classification = "Inefficient and Non-Powerful" if mpg > 19 & gear_ratio replace classification = "Efficient and Powerful" if mpg <= 19 & gear_ratio > 2.9 replace classification = "Inefficient and Powerful" if mpg > 19 & gear_ratio > 2.9 ``` + diff --git a/Data_Manipulation/data_manipulation.md b/Data_Manipulation/data_manipulation.md index 0b09ea9e..537937b3 100644 --- a/Data_Manipulation/data_manipulation.md +++ b/Data_Manipulation/data_manipulation.md @@ -5,3 +5,4 @@ nav_order: 2 --- # Data Manipulation Techniques + diff --git a/Data_Manipulation/determine_the_observation_level_of_a_data_set.md b/Data_Manipulation/determine_the_observation_level_of_a_data_set.md index 069cca07..e5495db3 100644 --- a/Data_Manipulation/determine_the_observation_level_of_a_data_set.md +++ b/Data_Manipulation/determine_the_observation_level_of_a_data_set.md @@ -41,15 +41,16 @@ To check for duplicate rows when using [**pandas**](https://pandas.pydata.org/) import pandas as pd -storms = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/dplyr/storms.csv') +storms = pd.read_csv( + "https://vincentarelbundock.github.io/Rdatasets/csv/dplyr/storms.csv" +) # Find the duplicates by name, year, month, day, and hour -level_variables = ['name', 'year', 'month', 'day', 'hour'] +level_variables = ["name", "year", "month", "day", "hour"] storms[storms.duplicated(subset=level_variables)] # Drop these duplicates, but retain the first occurrence of each -storms = storms.drop_duplicates(subset=level_variables, keep='first') - +storms = storms.drop_duplicates(subset=level_variables, keep="first") ``` ## R @@ -68,12 +69,12 @@ data("storms") # name, year, month, day, and hour # anyDuplicated will return 0 if there are no duplicate combinations of these # so if we get 0, the variables in c() are our observation level. -anyDuplicated(storms[,c('name','year','month','day','hour')]) +anyDuplicated(storms[, c("name", "year", "month", "day", "hour")]) # We get 2292, telling us that row 2292 is a duplicate (and possibly others!) # We can pick just the rows that are duplicates of other rows for inspection # (note this won't get the first time that duplicate shows up, just the subsequent times) -duplicated_rows <- storms[duplicated(storms[,c('name','year','month','day','hour')]),] +duplicated_rows <- storms[duplicated(storms[, c("name", "year", "month", "day", "hour")]), ] # Alternately, we can use dplyr @@ -126,3 +127,4 @@ For especially large datasets the [**Gtools**](https://gtools.readthedocs.io/en/ gduplicates report latitude longitude gduplicates tag latitude longitude, g(g_duplicated_data) ``` + diff --git a/Data_Manipulation/rowwise_calculations.md b/Data_Manipulation/rowwise_calculations.md index 340c11eb..490ff738 100644 --- a/Data_Manipulation/rowwise_calculations.md +++ b/Data_Manipulation/rowwise_calculations.md @@ -30,25 +30,26 @@ Although not demonstrated in the example below, [lambda](https://www.analyticsvi import pandas as pd # Grab the data -df = pd.read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/ggplot2/midwest.csv", - index_col=0) +df = pd.read_csv( + "https://vincentarelbundock.github.io/Rdatasets/csv/ggplot2/midwest.csv", + index_col=0, +) # Let's assume that we want to sum, row-wise, every column # that contains 'perc' in its column name and check that # the total is 300. Use a list comprehension to get only # relevant columns, sum across them (axis=1), and create a # new column to store them: -df['perc_sum'] = df[[x for x in df.columns if 'perc' in x]].sum(axis=1) +df["perc_sum"] = df[[x for x in df.columns if "perc" in x]].sum(axis=1) # We can now check whether, on aggregate, each row entry of this new column # is 300 (it's not!) -df['perc_sum'].describe() - +df["perc_sum"].describe() ``` ## R -There are a few ways to perform rowwise operations in R. If you are summing the columns or taking their mean, `rowSums` and `rowMeans` in base R are great. +There are a few ways to perform rowwise operations in R. If you are summing the columns or taking their mean, `rowSums` and `rowMeans` in base R are great. For something more complex, `apply` in base R can perform any necessary rowwise calculation, but `pmap` in the `purrr` package is likely to be faster. @@ -58,7 +59,7 @@ In all cases, the **tidyselect** helpers in the **dplyr** package can help you t # If necessary # install.packages(c('purrr','ggplot2','dplyr')) # ggplot2 is only for the data -data(midwest, package = 'ggplot2') +data(midwest, package = "ggplot2") # dplyr is for the tidyselect functions, the pipe %>%, and select() to pick columns library(dplyr) @@ -66,33 +67,35 @@ library(dplyr) # add up to 300 as they maybe should # Use starts_with to select the variables -# First, do it with rowSums, +# First, do it with rowSums, # either by picking column indices or using tidyselect -midwest$rowsum_rowSums1 <- rowSums(midwest[,c(12:16,18:20,22:26)]) +midwest$rowsum_rowSums1 <- rowSums(midwest[, c(12:16, 18:20, 22:26)]) midwest$rowsum_rowSums2 <- midwest %>% - select(starts_with('perc')) %>% + select(starts_with("perc")) %>% rowSums() # Next, with apply - we're doing sum() here for the function # but it could be anything midwest$rowsum_apply <- apply( - midwest %>% select(starts_with('perc')), - MARGIN = 1, - sum) + midwest %>% select(starts_with("perc")), + MARGIN = 1, + sum +) # Next, two ways with purrr: library(purrr) # First, using purrr::reduce, which is good for some functions like summing # Note that . is the data set being sent by %>% midwest <- midwest %>% - mutate(rowsum_purrrReduce = reduce(select(., starts_with('perc')), `+`)) + mutate(rowsum_purrrReduce = reduce(select(., starts_with("perc")), `+`)) # More flexible, purrr::pmap, which works for any function # using pmap_dbl here to get a numeric variable rather than a list midwest <- midwest %>% mutate(rowsum_purrrpmap = pmap_dbl( - select(.,starts_with('perc')), - sum)) + select(., starts_with("perc")), + sum + )) # So do we get 300? summary(midwest$rowsum_rowSums2) @@ -101,7 +104,7 @@ summary(midwest$rowsum_rowSums2) ## Stata -Stata has a series of built-in row operations that use the `egen` command. See `help egen` for the full list, and look for functions beginning with `row` like `rowmean`. +Stata has a series of built-in row operations that use the `egen` command. See `help egen` for the full list, and look for functions beginning with `row` like `rowmean`. The full list includes: `rowfirst` and `rowlast` (first or last non-missing observation), `rowmean`, `rowmedian`, `rowmax`, `rowmin`, `rowpctile`, and `rowtotal` (the mean, median, max, min, given percentile, or sum of all the variables), and `rowmiss` and `rownonmiss` (the count of the number of missing or nonmissing observations across the variables). @@ -129,3 +132,4 @@ egen total_ed = rowtotal(perchsd-percprof) * and also leaves out non-HS graduates. summ total_ed ``` + diff --git a/Desired_Nonexistent_Pages/desired_nonexistent_pages.md b/Desired_Nonexistent_Pages/desired_nonexistent_pages.md index 23e559c3..caa866f4 100644 --- a/Desired_Nonexistent_Pages/desired_nonexistent_pages.md +++ b/Desired_Nonexistent_Pages/desired_nonexistent_pages.md @@ -67,7 +67,7 @@ If you create one of these pages, please remove it from this list. * Cluster Bootstrap Standard Errors * Jackknife Standard Errors -## Machine Learning +## Machine Learning * A-B Testing * Artificial Neural Networks @@ -89,7 +89,7 @@ If you create one of these pages, please remove it from this list. * Stationarity and Weak Dependence * Granger Causality * Moving Average Model -* Kalman Filtering/Smoothing +* Kalman Filtering/Smoothing * ARIMAX Model * GARCH Model * TARCH Model diff --git a/Geo-Spatial/Geo-spatial.md b/Geo-Spatial/Geo-spatial.md index 73a849cb..70f013a5 100644 --- a/Geo-Spatial/Geo-spatial.md +++ b/Geo-Spatial/Geo-spatial.md @@ -5,3 +5,4 @@ nav_order: 3 --- # Geo-Spatial + diff --git a/Geo-Spatial/choropleths.md b/Geo-Spatial/choropleths.md index 16088514..84c95933 100644 --- a/Geo-Spatial/choropleths.md +++ b/Geo-Spatial/choropleths.md @@ -30,26 +30,27 @@ The [**geopandas**](https://geopandas.org/) package is the easiest way to start import matplotlib.pyplot as plt import geopandas as gpd -world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres')) +world = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres")) world = world[(world.pop_est > 0) & (world.name != "Antarctica")] -world['gdp_per_cap'] = 1.0e6 * world.gdp_md_est / world.pop_est +world["gdp_per_cap"] = 1.0e6 * world.gdp_md_est / world.pop_est # Simple choropleth -world.plot(column='gdp_per_cap') +world.plot(column="gdp_per_cap") # Much better looking choropleth -plt.style.use('seaborn-paper') +plt.style.use("seaborn-paper") fig, ax = plt.subplots(1, 1) -world.plot(column='gdp_per_cap', - ax=ax, - cmap='plasma', - legend=True, - vmin=0., - legend_kwds={'label': "GDP per capita (USD)", - 'orientation': "horizontal"}) -plt.axis('off') +world.plot( + column="gdp_per_cap", + ax=ax, + cmap="plasma", + legend=True, + vmin=0.0, + legend_kwds={"label": "GDP per capita (USD)", "orientation": "horizontal"}, +) +plt.axis("off") # Now let's try a cartogram # If you don't have it already, geoplot can be installed by runnning @@ -58,13 +59,13 @@ import geoplot as gplt ax = gplt.cartogram( world, - scale='gdp_per_cap', - hue='gdp_per_cap', - cmap='plasma', + scale="gdp_per_cap", + hue="gdp_per_cap", + cmap="plasma", linewidth=0.5, - figsize=(8, 12) + figsize=(8, 12), ) -gplt.polyplot(world, facecolor='lightgray', edgecolor='None', ax=ax) +gplt.polyplot(world, facecolor="lightgray", edgecolor="None", ax=ax) plt.title("GDP per capita (USD)") plt.show() ``` @@ -78,46 +79,57 @@ The [**sf**](https://github.com/r-spatial/sf/) is a fantastic package to make ch if (!require("pacman")) install.packages("pacman") pacman::p_load(sf, tidyverse, ggplot2, rnaturalearth, viridis, cartogram, scales) -world <- ne_download( scale = 110, type = 'countries' ) %>% +world <- ne_download(scale = 110, type = "countries") %>% st_as_sf() -world = world %>% - filter(POP_EST > 0, - NAME != "Antarctica") %>% - mutate(gdp_per_capita = 1.0e6*(GDP_MD_EST / as.numeric(POP_EST))) +world <- world %>% + filter( + POP_EST > 0, + NAME != "Antarctica" + ) %>% + mutate(gdp_per_capita = 1.0e6 * (GDP_MD_EST / as.numeric(POP_EST))) ## Simple choropleth plot ggplot(data = world) + - geom_sf(aes(fill = gdp_per_capita)) + geom_sf(aes(fill = gdp_per_capita)) ## Much better looking choropleth with ggplot2 -world %>% - st_transform(crs = "+proj=eqearth +wktext") %>% +world %>% + st_transform(crs = "+proj=eqearth +wktext") %>% ggplot() + - geom_sf(aes(fill = gdp_per_capita)) + - theme_void() + - labs(title = "", - caption = "Data downloaded from www.naturalearthdata.com", - fill = "GDP per capita (USD)") + - scale_fill_viridis(labels = comma) + - theme(legend.position = "bottom", - legend.key.width = unit(1.5, "cm")) + geom_sf(aes(fill = gdp_per_capita)) + + theme_void() + + labs( + title = "", + caption = "Data downloaded from www.naturalearthdata.com", + fill = "GDP per capita (USD)" + ) + + scale_fill_viridis(labels = comma) + + theme( + legend.position = "bottom", + legend.key.width = unit(1.5, "cm") + ) ## Now let's try a cartogram using the cartogram package that was loaded above -world_cartogram = world %>% +world_cartogram <- world %>% st_transform(crs = "+proj=eqearth +wktext") %>% cartogram_ncont("gdp_per_capita", k = 100, inplace = TRUE) ggplot() + geom_sf(data = world, alpha = 1, color = "grey70", fill = "grey70") + - geom_sf(data = world_cartogram, aes(fill = gdp_per_capita), - alpha = 1, color = "black", size = 0.1) + + geom_sf( + data = world_cartogram, aes(fill = gdp_per_capita), + alpha = 1, color = "black", size = 0.1 + ) + scale_fill_viridis(labels = comma) + - labs(title = "Cartogram - GDP per capita", - caption = "Data downloaded from www.naturalearthdata.com", - fill = "GDP per capita") + + labs( + title = "Cartogram - GDP per capita", + caption = "Data downloaded from www.naturalearthdata.com", + fill = "GDP per capita" + ) + theme_void() ``` + diff --git a/Geo-Spatial/geocoding.md b/Geo-Spatial/geocoding.md index 14bcc1a7..e445dbe9 100644 --- a/Geo-Spatial/geocoding.md +++ b/Geo-Spatial/geocoding.md @@ -76,10 +76,10 @@ from geopy.geocoders import Nominatim # Create a geolocator using Open Street Map (aka Nominatim) # Use your own user agent identifier here -geolocator = Nominatim(user_agent='LOST_geocoding_page') +geolocator = Nominatim(user_agent="LOST_geocoding_page") # Pass an address to retrieve full location information: -location = geolocator.geocode('Bank of England') +location = geolocator.geocode("Bank of England") print(location.address) # >> Bank of England, 8AH, Threadneedle Street, Bishopsgate, City of London, @@ -127,26 +127,25 @@ To save the API in your **Renvrion**: 1. Open the **Renviron** by running `usethis::edit_r_environ()` 2. Once you are in the **Renviron** name and save the API you got from Geocodio. Maybe something like: -```r?example=rgeocodeio&skip=true&skipReason=requires_api_key -#geocodio_API = 'your api` +```r?example=rgeocodeio&skip=true&skipreason=requires_api_key +# geocodio_API = 'your api` ``` 3. Save your **Renviron** and then restart your R session just to be sure that the API is saved. Now that you have your API saved in R you still need to authorize the API in your R session. Do so by running `gio_auth()`. -```r?example=rgeocodeio&skip=true&skipReason=requires_api_key +```r?example=rgeocodeio&skip=true&skipreason=requires_api_key # If necessary # install.packages(c('rgeocodio','readxl','tidyverse')) library(rgeocodio) gio_auth(force = F) - ``` A quick note, `force` makes you set a new geocodio API key for the current environment. In general you will want to run `force=F`. Lets try a regeocodio example. Say you want to get the coordinates of the White House. You could run: -```r?example=rgeocodeio&skip=true&skipReason=requires_api_key -rgeocodio::gio_geocode('1600 Pennsylvania Ave NW, Washington DC 20500') +```r?example=rgeocodeio&skip=true&skipreason=requires_api_key +rgeocodio::gio_geocode("1600 Pennsylvania Ave NW, Washington DC 20500") ``` Most of these variables are intuitive but I want to spend a few seconds on **accuracy** and **accuracy type** which we can learn more about [here](https://www.geocod.io/docs/#accuracy-score). @@ -157,32 +156,31 @@ Most of these variables are intuitive but I want to spend a few seconds on **acc What if we want to geocode a bunch of addresses at once? To geocode multiple addresses at once we will use `gio_batch_geocode`. The data that we enter will need to be a *character vector of addresses*. -```r?example=rgeocodeio&skip=true&skipReason=requires_api_key +```r?example=rgeocodeio&skip=true&skipreason=requires_api_key library(readxl) library(tidyverse) -addresses<- c('Yosemite National Park, California', '1600 Pennsylvania Ave NW, Washington DC 20500', '2975 Kincaide St Eugene, Oregon, 97405') +addresses <- c("Yosemite National Park, California", "1600 Pennsylvania Ave NW, Washington DC 20500", "2975 Kincaide St Eugene, Oregon, 97405") gio_batch_geocode(addresses) ``` You will notice that the output is a list with dataframes of the results embedded. There are a number of ways to extract the relevant data but one approach would be: -```r?example=rgeocodeio&skip=true&skipReason=requires_api_key -addresses<- c('Yosemite National Park, California', '1600 Pennsylvania Ave NW, Washington DC 20500', '2975 Kincaide St Eugene, Oregon, 97405') - -extract_function<- function(addresses){ +```r?example=rgeocodeio&skip=true&skipreason=requires_api_key +addresses <- c("Yosemite National Park, California", "1600 Pennsylvania Ave NW, Washington DC 20500", "2975 Kincaide St Eugene, Oregon, 97405") -data<-gio_batch_geocode(addresses) -vector<- (1: length(addresses)) +extract_function <- function(addresses) { + data <- gio_batch_geocode(addresses) + vector <- (1:length(addresses)) -df_function<-function(vector){ - df<-data$response_results[vector] - df<-df%>%as.data.frame() -} + df_function <- function(vector) { + df <- data$response_results[vector] + df <- df %>% as.data.frame() + } -geocode_data<-do.call(bind_rows, lapply(vector, df_function)) -return(geocode_data) + geocode_data <- do.call(bind_rows, lapply(vector, df_function)) + return(geocode_data) } extract_function(addresses) @@ -192,15 +190,15 @@ Reverse geocoding uses `gio_reverse` and `gio_batch_reverse`. For `gio_reverse` you submit a longitude-latitude pair: -```r?example=rgeocodeio&skip=true&skipReason=requires_api_key +```r?example=rgeocodeio&skip=true&skipreason=requires_api_key gio_reverse(38.89767, -77.03655) ``` For `gio_batch_reverse` we will submit a vector of numeric entries ordered by c(longitude, latitude): -```r?example=rgeocodeio&skip=true&skipReason=requires_api_key -#make a dataset -data<-data.frame( +```r?example=rgeocodeio&skip=true&skipreason=requires_api_key +# make a dataset +data <- data.frame( lat = c(35.9746000, 32.8793700, 33.8337100, 35.4171240), lon = c(-77.9658000, -96.6303900, -117.8362320, -80.6784760) ) @@ -212,8 +210,9 @@ Notice that the output gives us multiple accuracy types. What about geocoding the rest of the world, chico? -```r?example=rgeocodeio&skip=true&skipReason=requires_api_key -rgeocodio::gio_batch_geocode('523-303, 350 Mokdongdong-ro, Yangcheon-Gu, Seoul, South Korea 07987') +```r?example=rgeocodeio&skip=true&skipreason=requires_api_key +rgeocodio::gio_batch_geocode("523-303, 350 Mokdongdong-ro, Yangcheon-Gu, Seoul, South Korea 07987") ``` *gasp* Geocodio only works, from my understanding, in the United States and Canada. We would need to use a different service like **Google's geocoder** to do the rest of the world. + diff --git a/Geo-Spatial/merging_shape_files.md b/Geo-Spatial/merging_shape_files.md index 9503d6dd..973f1bdd 100644 --- a/Geo-Spatial/merging_shape_files.md +++ b/Geo-Spatial/merging_shape_files.md @@ -38,49 +38,50 @@ library(ggplot2) library(reprex) ### Step 1: Read in nc file as a dataframe* -pm2010 = nc_open("https://github.com/LOST-STATS/lost-stats.github.io/blob/master/Geo-Spatial/Data/Merging_Shape_Files/GlobalGWRwUni_PM25_GL_201001_201012-RH35_Median_NoDust_NoSalt.nc?raw=true") -nc.brick = brick("https://github.com/LOST-STATS/lost-stats.github.io/blob/master/Geo-Spatial/Data/Merging_Shape_Files/GlobalGWRwUni_PM25_GL_201001_201012-RH35_Median_NoDust_NoSalt.nc?raw=true") +pm2010 <- nc_open("https://github.com/LOST-STATS/lost-stats.github.io/blob/master/Geo-Spatial/Data/Merging_Shape_Files/GlobalGWRwUni_PM25_GL_201001_201012-RH35_Median_NoDust_NoSalt.nc?raw=true") +nc.brick <- brick("https://github.com/LOST-STATS/lost-stats.github.io/blob/master/Geo-Spatial/Data/Merging_Shape_Files/GlobalGWRwUni_PM25_GL_201001_201012-RH35_Median_NoDust_NoSalt.nc?raw=true") # Check the dimensions dim(nc.brick) # Turn into a data frame for use -nc.df = as.data.frame(nc.brick[[1]], xy = T) +nc.df <- as.data.frame(nc.brick[[1]], xy = T) head(nc.df) ### Step 2: Filter out a specific country. # Global data is very big. I am going to focus only on Brazil. -nc.brazil = nc.df %>% filter(x >= -73.59 & x <= 34.47 & y >= -33.45 & y <= 5.16) +nc.brazil <- nc.df %>% filter(x >= -73.59 & x <= 34.47 & y >= -33.45 & y <= 5.16) rm(nc.df) head(nc.brazil) ### Step 3: Change the dataframe to a sf object using the st_as_sf function -pm25_sf = st_as_sf(nc.brazil, coords = c("x", "y"), crs = 4326, agr = "constant") +pm25_sf <- st_as_sf(nc.brazil, coords = c("x", "y"), crs = 4326, agr = "constant") rm(nc.brazil) head(pm25_sf) ### Step 4: Read in the Brazil shp file. we plan to merge to -Brazil_map_2010 = st_read("https://github.com/LOST-STATS/lost-stats.github.io/blob/master/Geo-Spatial/Data/Merging_Shape_Files/geo2_br2010.shp?raw=true") +Brazil_map_2010 <- st_read("https://github.com/LOST-STATS/lost-stats.github.io/blob/master/Geo-Spatial/Data/Merging_Shape_Files/geo2_br2010.shp?raw=true") head(Brazil_map_2010) ### Step 5: Intersect pm25 sf object with the shp file.* # Now let's use a sample from pm25 data and intersect it with the shp file. Since the sf object is huge, I recommend running the analysis on a cloud server -pm25_sample = sample_n(pm25_sf, 1000, replace = FALSE) +pm25_sample <- sample_n(pm25_sf, 1000, replace = FALSE) # Now look for the intersection between the pollution data and the Brazil map to merge them -pm25_municipal_2010 = st_intersection(pm25_sample, Brazil_map_2010) +pm25_municipal_2010 <- st_intersection(pm25_sample, Brazil_map_2010) head(pm25_municipal_2010) ### Step 6: Make a map using ggplot -pm25_municipal_2010 = pm25_municipal_2010 %>% - select(1,6) -pm25_municipal_2010 = st_drop_geometry(pm25_municipal_2010) -Brazil_pm25_2010 = left_join(Brazil_map_2010, pm25_municipal_2010) +pm25_municipal_2010 <- pm25_municipal_2010 %>% + select(1, 6) +pm25_municipal_2010 <- st_drop_geometry(pm25_municipal_2010) +Brazil_pm25_2010 <- left_join(Brazil_map_2010, pm25_municipal_2010) ggplot(Brazil_pm25_2010) + # geom_sf creates the map we need - geom_sf(aes(fill = -layer), alpha=0.8, lwd = 0, col="white") + + geom_sf(aes(fill = -layer), alpha = 0.8, lwd = 0, col = "white") + # and we fill with the pollution concentration data scale_fill_viridis_c(option = "viridis", name = "PM25") + - ggtitle("PM25 in municipals of Brazil")+ + ggtitle("PM25 in municipals of Brazil") + ggsn::blank() ``` + diff --git a/Geo-Spatial/spatial_joins.md b/Geo-Spatial/spatial_joins.md index 0c820e58..fc16499c 100644 --- a/Geo-Spatial/spatial_joins.md +++ b/Geo-Spatial/spatial_joins.md @@ -17,7 +17,7 @@ Joins are typically interesections of objects, but can be expressed in different - Geospatial packages in R and Python tend to have a large number of complex dependencies, which can make installing them painful. Best practice is to install geospatial packages in a new virtual environment. - When it comes to the package we are using in R for the US boundaries, it is much easier to install via the [devtools](https://cran.r-project.org/web/packages/devtools/index.html). This will save you the trouble of getting errors when installing the data packages for the boundaries. Otherwise, your mileage may vary. When I installed USAboundariesData via USAboundaries, I received errors. -```r?skip=True&skipReason=dont_install_packages +```r?skip=true&skipreason=dont_install_packages devtools::install_github("ropensci/USAboundaries") devtools::install_github("ropensci/USAboundariesData") ``` @@ -39,13 +39,13 @@ The [**geopandas**](https://geopandas.org/) package is the easiest way to start import geopandas as gpd # Grab a world map -world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres')) +world = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres")) # Plot the map of the world world.plot() # Grab data on cities -cities = gpd.read_file(gpd.datasets.get_path('naturalearth_cities')) +cities = gpd.read_file(gpd.datasets.get_path("naturalearth_cities")) # We can plot the cities too - but they're just dots of lat/lon without any # context for now @@ -57,8 +57,8 @@ cities.plot() cities = cities.to_crs(world.crs) # Combine them on a plot: -base = world.plot(color='white', edgecolor='black') -cities.plot(ax=base, marker='o', color='red', markersize=5) +base = world.plot(color="white", edgecolor="black") +cities.plot(ax=base, marker="o", color="red", markersize=5) # We want to perform a spatial merge, but there are many kinds in 2D # projections, including withins, touches, crosses, and overlaps. We want to @@ -66,7 +66,7 @@ cities.plot(ax=base, marker='o', color='red', markersize=5) # point) with the shapes of countries and determine which city goes in which # country (even if it's on the boundary). We use the 'sjoin' function: -cities_with_country = gpd.sjoin(cities, world, how="inner", op='intersects') +cities_with_country = gpd.sjoin(cities, world, how="inner", op="intersects") cities_with_country.head() # name_left geometry pop_est continent \ @@ -104,40 +104,40 @@ library(GSODR) - We start with the boundaries of the United States to get desirable polygons to work with for our analysis. To pay homage to the states of my alma maters, we will do some analysis with Oregon, Ohio, and Michigan. ```r?example=spatial -#Selecting the United States Boundaries, but omitting Alaska, Hawaii, and Puerto Rico for it to be scaled better +# Selecting the United States Boundaries, but omitting Alaska, Hawaii, and Puerto Rico for it to be scaled better -usa <- us_boundaries(type="state", resolution = "low") %>% +usa <- us_boundaries(type = "state", resolution = "low") %>% filter(!state_abbr %in% c("PR", "AK", "HI")) -#Ohio with high resolution +# Ohio with high resolution oh <- USAboundaries::us_states(resolution = "high", states = "OH") -#Oregon with high resolution +# Oregon with high resolution or <- USAboundaries::us_states(resolution = "high", states = "OR") -#Michigan with high resolution +# Michigan with high resolution mi <- USAboundaries::us_states(resolution = "high", states = "MI") -#Insets for the identified states +# Insets for the identified states -#Oregon +# Oregon or_box <- st_make_grid(or, n = 1) -#Ohio +# Ohio oh_box <- st_make_grid(oh, n = 1) -#Michigan +# Michigan mi_box <- st_make_grid(mi, n = 1) -#We can also include the counties boundaries within the state too! +# We can also include the counties boundaries within the state too! -#Oregon +# Oregon or_co <- USAboundaries::us_counties(resolution = "high", states = "OR") -#Ohio +# Ohio oh_co <- USAboundaries::us_counties(resolution = "high", states = "OH") -#Michigan +# Michigan mi_co <- USAboundaries::us_counties(resolution = "high", states = "MI") ``` @@ -148,9 +148,9 @@ mi_co <- USAboundaries::us_counties(resolution = "high", states = "MI") ```r?example=spatial plot(usa$geometry) -plot(or$geometry, add=T, col="gray50", border="black") -plot(or_co$geometry, add=T, border="green", col=NA) -plot(or_box, add=T, border="yellow", col=NA, lwd=2) +plot(or$geometry, add = T, col = "gray50", border = "black") +plot(or_co$geometry, add = T, border = "green", col = NA) +plot(or_box, add = T, border = "yellow", col = NA, lwd = 2) ``` ![Oregon highlighted](Images/spatial_joins/join_image_1.png) @@ -158,9 +158,9 @@ plot(or_box, add=T, border="yellow", col=NA, lwd=2) ```r?example=spatial plot(usa$geometry) -plot(oh$geometry, add=T, col="gray50", border="black") -plot(oh_co$geometry, add=T, border="yellow", col=NA) -plot(oh_box, add=T, border="blue", col=NA, lwd=2) +plot(oh$geometry, add = T, col = "gray50", border = "black") +plot(oh_co$geometry, add = T, border = "yellow", col = NA) +plot(oh_box, add = T, border = "blue", col = NA, lwd = 2) ``` ![Ohio highlighted](Images/spatial_joins/join_image_2.png) @@ -168,9 +168,9 @@ plot(oh_box, add=T, border="blue", col=NA, lwd=2) ```r?example=spatial plot(usa$geometry) -plot(mi$geometry, add=T, col="gray50", border="black") -plot(mi_co$geometry, add=T, border="gray", col=NA) -plot(mi_box, add=T, border="green", col=NA, lwd=2) +plot(mi$geometry, add = T, col = "gray50", border = "black") +plot(mi_co$geometry, add = T, border = "gray", col = NA) +plot(mi_box, add = T, border = "green", col = NA, lwd = 2) ``` ![Michigan highlighted](Images/spatial_joins/join_image_3.png) @@ -178,15 +178,15 @@ plot(mi_box, add=T, border="green", col=NA, lwd=2) ```r?example=spatial plot(usa$geometry) -plot(mi$geometry, add=T, col="gray50", border="black") -plot(mi_co$geometry, add=T, border="gray", col=NA) -plot(mi_box, add=T, border="green", col=NA, lwd=2) -plot(oh$geometry, add=T, col="gray50", border="black") -plot(oh_co$geometry, add=T, border="yellow", col=NA) -plot(oh_box, add=T, border="blue", col=NA, lwd=2) -plot(or$geometry, add=T, col="gray50", border="black") -plot(or_co$geometry, add=T, border="green", col=NA) -plot(or_box, add=T, border="yellow", col=NA, lwd=2) +plot(mi$geometry, add = T, col = "gray50", border = "black") +plot(mi_co$geometry, add = T, border = "gray", col = NA) +plot(mi_box, add = T, border = "green", col = NA, lwd = 2) +plot(oh$geometry, add = T, col = "gray50", border = "black") +plot(oh_co$geometry, add = T, border = "yellow", col = NA) +plot(oh_box, add = T, border = "blue", col = NA, lwd = 2) +plot(or$geometry, add = T, col = "gray50", border = "black") +plot(or_co$geometry, add = T, border = "green", col = NA) +plot(or_box, add = T, border = "yellow", col = NA, lwd = 2) ``` ![Oregon, Ohio, and Michigan highlighted](Images/spatial_joins/join_image_4.png) @@ -197,14 +197,14 @@ plot(or_box, add=T, border="yellow", col=NA, lwd=2) ```r?example=spatial load(system.file("extdata", "isd_history.rda", package = "GSODR")) -#We want this to be spatial data +# We want this to be spatial data isd_history <- as.data.frame(isd_history) %>% - st_as_sf(coords=c("LON","LAT"), crs=4326, remove=FALSE) + st_as_sf(coords = c("LON", "LAT"), crs = 4326, remove = FALSE) -#There are many observations, so we want to narrow it to our three candidate states -isd_history_or <- dplyr::filter(isd_history, CTRY=="US", STATE=="OR") -isd_history_oh <- dplyr::filter(isd_history, CTRY=="US", STATE=="OH") -isd_history_mi <- dplyr::filter(isd_history, CTRY=="US", STATE=="MI") +# There are many observations, so we want to narrow it to our three candidate states +isd_history_or <- dplyr::filter(isd_history, CTRY == "US", STATE == "OR") +isd_history_oh <- dplyr::filter(isd_history, CTRY == "US", STATE == "OH") +isd_history_mi <- dplyr::filter(isd_history, CTRY == "US", STATE == "MI") ``` **This filtering should take you from around 26,700 observation sites around the world to approximately 200 in Michigan, 85 in Ohio, and 100 in Oregon. These numbers may vary based on when you independently do your analysis.** @@ -215,9 +215,9 @@ isd_history_mi <- dplyr::filter(isd_history, CTRY=="US", STATE=="MI") **Oregon** ```r?example=spatial -plot(isd_history_or$geometry, cex=0.5) -plot(or$geometry, col=alpha("gray", 0.5), border="#1F968BFF", lwd=1.5, add=TRUE) -plot(isd_history_or$geometry, add=T, pch=21, bg="#FDE725FF", cex=0.7, col="black") +plot(isd_history_or$geometry, cex = 0.5) +plot(or$geometry, col = alpha("gray", 0.5), border = "#1F968BFF", lwd = 1.5, add = TRUE) +plot(isd_history_or$geometry, add = T, pch = 21, bg = "#FDE725FF", cex = 0.7, col = "black") title("Oregon GSOD Climate Stations") ``` @@ -227,9 +227,9 @@ title("Oregon GSOD Climate Stations") **Ohio** ```r?example=spatial -plot(isd_history_oh$geometry, cex=0.5) -plot(oh$geometry, col=alpha("red", 0.5), border="gray", lwd=1.5, add=TRUE) -plot(isd_history_oh$geometry, add=T, pch=21, bg="black", cex=0.7, col="black") +plot(isd_history_oh$geometry, cex = 0.5) +plot(oh$geometry, col = alpha("red", 0.5), border = "gray", lwd = 1.5, add = TRUE) +plot(isd_history_oh$geometry, add = T, pch = 21, bg = "black", cex = 0.7, col = "black") title("Ohio GSOD Climate Stations") ``` @@ -238,9 +238,9 @@ title("Ohio GSOD Climate Stations") **Michigan** ```r?example=spatial -plot(isd_history_mi$geometry, cex=0.5) -plot(mi$geometry, col=alpha("green", 0.5), border="blue", lwd=1.5, add=TRUE) -plot(isd_history_mi$geometry, add=T, pch=21, bg="white", cex=0.7, col="black") +plot(isd_history_mi$geometry, cex = 0.5) +plot(mi$geometry, col = alpha("green", 0.5), border = "blue", lwd = 1.5, add = TRUE) +plot(isd_history_mi$geometry, add = T, pch = 21, bg = "white", cex = 0.7, col = "black") title("Michigan GSOD Climate Stations") ``` ![Michigan](Images/spatial_joins/join_image_7.png) @@ -253,7 +253,7 @@ title("Michigan GSOD Climate Stations") ```r?example=spatial or_co_isd_poly <- or_co[isd_history, ] -plot(or_co_isd_poly$geometry, col=alpha("green",0.7)) +plot(or_co_isd_poly$geometry, col = alpha("green", 0.7)) title("Oregon Counties with GSOD Climate Stations") ``` @@ -264,7 +264,7 @@ title("Oregon Counties with GSOD Climate Stations") ```r?example=spatial cand_co <- USAboundaries::us_counties(resolution = "high", states = c("OR", "OH", "MI")) cand_co_isd_poly <- cand_co[isd_history, ] -plot(cand_co_isd_poly$geometry, col=alpha("blue",0.7)) +plot(cand_co_isd_poly$geometry, col = alpha("blue", 0.7)) title("Counties in Candidate States with GSOD Climate Stations") ``` @@ -286,12 +286,12 @@ title("Counties in Candidate States with GSOD Climate Stations") ```r?example=spatial isd_or_co_pts <- st_join(isd_history, left = FALSE, or_co["name"]) -#Rename the county name variable county instead of name, since we already have NAME for the station location +# Rename the county name variable county instead of name, since we already have NAME for the station location colnames(isd_or_co_pts)[which(names(isd_or_co_pts) == "name")] <- "county" -plot(isd_or_co_pts$geometry, pch=21, cex=0.7, col="black", bg="orange") -plot(or_co$geometry, border="gray", col=NA, add=T) +plot(isd_or_co_pts$geometry, pch = 21, cex = 0.7, col = "black", bg = "orange") +plot(or_co$geometry, border = "gray", col = NA, add = T) ``` **You now have successfully joined the county name data into your new point data set! Those points in the plot now contain the county information for data analysis purposes.** @@ -322,3 +322,4 @@ isd_or_co_pts <- st_join(isd_history, left = FALSE, or_co) **You can use these to pare down your selections and joins to specific relationships.** **Good luck with your geospatial analysis!** + diff --git a/Geo-Spatial/spatial_lag_model.md b/Geo-Spatial/spatial_lag_model.md index a460867a..cf33d688 100644 --- a/Geo-Spatial/spatial_lag_model.md +++ b/Geo-Spatial/spatial_lag_model.md @@ -37,45 +37,42 @@ These examples will use some data on US colleges from [IPEDS](https://nces.ed.go ```python import pandas as pd + # can install all below with: # !pip install pysal from libpysal.cg import KDTree, RADIUS_EARTH_MILES from libpysal.weights import KNN from spreg import ML_Lag -url = ('https://github.com/LOST-STATS/lost-stats.github.io/raw/source' - '/Geo-Spatial/Data/Merging_Shape_Files/colleges_covid.csv') +url = ( + "https://github.com/LOST-STATS/lost-stats.github.io/raw/source" + "/Geo-Spatial/Data/Merging_Shape_Files/colleges_covid.csv" +) # specify index cols we need only for identification -- not modeling -df = pd.read_csv(url, index_col=['unitid', 'instnm']) +df = pd.read_csv(url, index_col=["unitid", "instnm"]) # we'll `pop` renaming columns so they're no longer in our dataframe -x = df.copy().dropna(how='any') +x = df.copy().dropna(how="any") # tree object is the main input to nearest neighbors tree = KDTree( - data=zip(x.pop('longitude'), x.pop('latitude')), + data=zip(x.pop("longitude"), x.pop("latitude")), # default is euclidean, but we want to use arc or haversine distance - distance_metric='arc', - radius=RADIUS_EARTH_MILES + distance_metric="arc", + radius=RADIUS_EARTH_MILES, ) nn = KNN(tree, k=5) -y = x.pop('covid_cases_per_cap_jul312020') +y = x.pop("covid_cases_per_cap_jul312020") # spreg only accepts numpy arrays or lists as arguments mod = ML_Lag( - y=y.to_numpy(), - x=x.to_numpy(), - w=nn, - name_y=y.name, - name_x=x.columns.tolist() + y=y.to_numpy(), x=x.to_numpy(), w=nn, name_y=y.name, name_x=x.columns.tolist() ) # results print(mod.summary) - - ``` ## R @@ -90,7 +87,7 @@ library(spdep) library(spatialreg) # Load data -df <- read.csv('https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Geo-Spatial/Data/Merging_Shape_Files/colleges_covid.csv') +df <- read.csv("https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Geo-Spatial/Data/Merging_Shape_Files/colleges_covid.csv") # Use latitude and longitude to determine the list of neighbors # Here we're using K-nearest-neighbors to find 5 neighbors for each college @@ -98,7 +95,7 @@ df <- read.csv('https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Ge # Get latitude and longitude into a matrix # Make sure longitude comes first -loc_matrix <- as.matrix(df[, c('longitude','latitude')]) +loc_matrix <- as.matrix(df[, c("longitude", "latitude")]) # Get 5 nearest neighbors kn <- knearneigh(loc_matrix, 5) @@ -112,9 +109,10 @@ listw <- nb2listw(nb) # Use a spatial regression # This uses the method from Bivand & Piras (2015) https://www.jstatsoft.org/v63/i18/. -m <- lagsarlm(covid_cases_per_cap_jul312020 ~ pctdesom + pctdenon, - data = df, - listw = listw) +m <- lagsarlm(covid_cases_per_cap_jul312020 ~ pctdesom + pctdenon, + data = df, + listw = listw +) # Note that, whlie summary(m) will show rho below the regression results, # most regression-table functions like modelsummary::msummary() or jtools::export_summs() @@ -149,3 +147,4 @@ spregress covid_cases_per_cap_jul312020 pctdesom pctdenon, ml dvarlag(M) * Get impact of each predictor, including spillovers, with estat impact estat impact ``` + diff --git a/Machine_Learning/Machine_Learning.md b/Machine_Learning/Machine_Learning.md index 3e2eeb52..e7da9d4b 100644 --- a/Machine_Learning/Machine_Learning.md +++ b/Machine_Learning/Machine_Learning.md @@ -5,3 +5,4 @@ nav_order: 4 --- # Machine Learning + diff --git a/Machine_Learning/Nearest_Neighbor.md b/Machine_Learning/Nearest_Neighbor.md index 02808468..0e919072 100644 --- a/Machine_Learning/Nearest_Neighbor.md +++ b/Machine_Learning/Nearest_Neighbor.md @@ -3,7 +3,7 @@ title: K-Nearest Neighbor Matching parent: Machine Learning has_children: false nav_order: 1 -mathjax: true +mathjax: true --- ## Introduction @@ -13,48 +13,48 @@ K-Nearest Neighbor Matching is to classify a new input vector x, examine the k-c ## Keep in Mind -When to Consider | Advantages | Disadvantages +When to Consider | Advantages | Disadvantages ---------------- | ---------- | ------------- Instances map to points in $R^{n}$ | **Traning is very fast** | **Slow at query time** -Less than 20 attributes per instance | Learn complex target functions | Easily fooled by irrelevant attributes -Lots of training data | Do not lose information +Less than 20 attributes per instance | Learn complex target functions | Easily fooled by irrelevant attributes +Lots of training data | Do not lose information ## Also Consider -1. Distance measure +1. Distance measure * Most common: Euclidean distance * Euclidean distance makes sense when different measurements are commensurate; each is variable measured in the same units. * If the measurements are different, say length and weight, it is not clear. - + $$d_{E}(x^{i}, x^{j}) = (\sum_{k=1}^{p}(x^{i}_k - x^{j}_k)^2)^\frac{1}{2}$$ 2. Standardization - * When variables are not commensurate, we want to standardize them by dividing by the sample standard deviation. This makes them all equally important. + * When variables are not commensurate, we want to standardize them by dividing by the sample standard deviation. This makes them all equally important. * The estimate for the standard deviation of $x_k$: $$\hat{\sigma}_k = \biggl(\frac{1}{n}\sum_{i=1}^{n}(x^{i}_k - \bar{x}_k)^2\biggr)^\frac{1}{2}$$ - where $\bar{x}_k$ is the sample mean: + where $\bar{x}_k$ is the sample mean: $$\bar{x}_k = \frac{1}{n}\sum_{i=1}^{n}x^i_k $$ 3. Weighted Euclidean Distance * Finally, if we have some idea of the relative importance of each variable, we can weight them: - + $$d_{WE}(i,j) = \biggl(\sum_{k=1}^{p}w_k(x^i_k - x^j_k)^2\biggr)^\frac{1}{2} $$ -4. Choosing k +4. Choosing k * Increasing k reduces variance and increases bias. - + 5. For high-dimensional space, problem that the nearest neighbor may not be very close at all. -6. Memory-based technique. Must make a pass through the data for each classification. This can be prohibitive for large data sets. +6. Memory-based technique. Must make a pass through the data for each classification. This can be prohibitive for large data sets. # Implementations ## Python -For KNN, it is not required to import packages other than **numpy**. You can basically do KNN with one package because it is mostly about computing distance and normalization. You would need TensorFlow and Keras as you try more advanced algorithms such as convolutional neural network. +For KNN, it is not required to import packages other than **numpy**. You can basically do KNN with one package because it is mostly about computing distance and normalization. You would need TensorFlow and Keras as you try more advanced algorithms such as convolutional neural network. ```c import argparse @@ -148,9 +148,9 @@ def main(): (test_x, test_y) = read_data(args.test) # Normalize the training data - (train_x, test_x) = normalize_data(train_x, test_x, + (train_x, test_x) = normalize_data(train_x, test_x, args.rangenorm, args.varnorm, args.exnorm) - + acc = runTest(test_x, test_y,train_x, train_y,args.k) print("Accuracy: ",acc) diff --git a/Machine_Learning/artificial_neural_network.md b/Machine_Learning/artificial_neural_network.md index aabbb19e..3c3a5491 100644 --- a/Machine_Learning/artificial_neural_network.md +++ b/Machine_Learning/artificial_neural_network.md @@ -42,9 +42,9 @@ X, y = make_regression(n_samples=1000, n_features=10) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1) # Create and fit model -regr = MLPRegressor(hidden_layer_sizes=(100,), - activation='relu').fit(X_train, y_train) +regr = MLPRegressor(hidden_layer_sizes=(100,), activation="relu").fit(X_train, y_train) # Compute R^2 score regr.score(X_test, y_test) ``` + diff --git a/Machine_Learning/boosted_regression_trees.md b/Machine_Learning/boosted_regression_trees.md index 9e4596e8..cece491c 100644 --- a/Machine_Learning/boosted_regression_trees.md +++ b/Machine_Learning/boosted_regression_trees.md @@ -49,10 +49,9 @@ X_train, X_test, y_train, y_test = train_test_split(X, y) # The number of trees is set by n_estimators; there are many other options that # you should experiment with. Typically the defaults will be sensible but are # unlikely to be perfect for your use case. Let's create the empty model: -reg = GradientBoostingRegressor(n_estimators=100, - max_depth=3, - learning_rate=0.1, - min_samples_split=3) +reg = GradientBoostingRegressor( + n_estimators=100, max_depth=3, learning_rate=0.1, min_samples_split=3 +) # Fit the model reg.fit(X_train, y_train) @@ -102,10 +101,10 @@ carInsurance_train <- read_csv("https://raw.githubusercontent.com/LOST-STATS/LOS summary(carInsurance_train) # Produce a training and a testing subset of the data -sample = sample.split(carInsurance_train$Id, SplitRatio = .8) -train = subset(carInsurance_train, sample == TRUE) -test = subset(carInsurance_train, sample == FALSE) -total <- rbind(train ,test) +sample <- sample.split(carInsurance_train$Id, SplitRatio = .8) +train <- subset(carInsurance_train, sample == TRUE) +test <- subset(carInsurance_train, sample == FALSE) +total <- rbind(train, test) gg_miss_upset(total) ``` @@ -113,18 +112,18 @@ Step 1: Produce dummies as appropriate ```r?example=boosting total$CallStart <- as.character(total$CallStart) -total$CallStart <- strptime(total$CallStart,format=" %H:%M:%S") +total$CallStart <- strptime(total$CallStart, format = " %H:%M:%S") total$CallEnd <- as.character(total$CallEnd) -total$CallEnd <- strptime(total$CallEnd,format=" %H:%M:%S") -total$averagetimecall <- as.numeric(as.POSIXct(total$CallEnd)-as.POSIXct(total$CallStart),units="secs") -time <- mean(total$averagetimecall,na.rm = TRUE) +total$CallEnd <- strptime(total$CallEnd, format = " %H:%M:%S") +total$averagetimecall <- as.numeric(as.POSIXct(total$CallEnd) - as.POSIXct(total$CallStart), units = "secs") +time <- mean(total$averagetimecall, na.rm = TRUE) ``` Produce dummy variables as appropriate ```r?example=boosting total_df <- dummy.data.frame(total %>% - dplyr::select(-CallStart, -CallEnd, -Id, -Outcome)) + dplyr::select(-CallStart, -CallEnd, -Id, -Outcome)) summary(total_df) ``` @@ -132,8 +131,8 @@ Fill in missing values ```r?example=boosting total_df$Job[is.na(total_df$Job)] <- "management" -total_df$Education [is.na(total_df$Education)] <- "secondary" -total_df$Marital[is.na(total_df$Marital)] <-"married" +total_df$Education[is.na(total_df$Education)] <- "secondary" +total_df$Marital[is.na(total_df$Marital)] <- "married" total_df$Communication[is.na(total_df$Communication)] <- "cellular" total_df$LastContactMonth[is.na(total_df$LastContactMonth)] <- "may" ``` @@ -143,7 +142,7 @@ Step 2: Preprocess data with median imputation and a central scaling ```r?example=boosting clean_new <- preProcess( x = total_df %>% dplyr::select(-CarInsurance) %>% as.matrix(), - method = c('medianImpute') + method = c("medianImpute") ) %>% predict(total_df) ``` @@ -167,8 +166,8 @@ Step 5: Train the boosted regression tree Notice that `trControl` is being set to select parameters using five-fold cross-validation (`"cv"`). ```r?example=boosting -carinsurance_boost = train( - factor(CarInsurance)~., +carinsurance_boost <- train( + factor(CarInsurance) ~ ., data = trainclean, method = "gbm", trControl = trainControl( @@ -179,6 +178,8 @@ carinsurance_boost = train( "n.trees" = seq(25, 200, by = 25), "interaction.depth" = 1:3, "shrinkage" = c(0.1, 0.01, 0.001), - "n.minobsinnode" = 5) + "n.minobsinnode" = 5 + ) ) ``` + diff --git a/Machine_Learning/causal_forest.md b/Machine_Learning/causal_forest.md index 233da72c..28f8be6f 100644 --- a/Machine_Learning/causal_forest.md +++ b/Machine_Learning/causal_forest.md @@ -34,18 +34,26 @@ from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeRegressor from econml.ortho_forest import ContinuousTreatmentOrthoForest as CausalForest -df = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/Crime.csv') +df = pd.read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/Crime.csv") # Set the categorical variables: -cat_vars = ['year', 'region', 'smsa'] +cat_vars = ["year", "region", "smsa"] # Transform the categorical variables to dummies and add them back in xf = pd.get_dummies(df[cat_vars]) df = pd.concat([df.drop(cat_vars, axis=1), xf], axis=1) cat_var_dummy_names = list(xf.columns) -regressors = ['prbarr', 'prbconv', 'prbpris', - 'avgsen', 'polpc', 'density', 'taxpc', - 'pctmin', 'wcon'] +regressors = [ + "prbarr", + "prbconv", + "prbpris", + "avgsen", + "polpc", + "density", + "taxpc", + "pctmin", + "wcon", +] # Add in the dummy names to the list of regressors regressors = regressors + cat_var_dummy_names @@ -53,13 +61,10 @@ regressors = regressors + cat_var_dummy_names train, test = train_test_split(df, test_size=0.2) # Estimate causal forest -estimator = CausalForest(n_trees=100, - model_T=DecisionTreeRegressor(), - model_Y=DecisionTreeRegressor()) -estimator.fit(train['crmrte'], - train['pctymle'], - train[regressors], - inference='blb') +estimator = CausalForest( + n_trees=100, model_T=DecisionTreeRegressor(), model_Y=DecisionTreeRegressor() +) +estimator.fit(train["crmrte"], train["pctymle"], train[regressors], inference="blb") effects_train = estimator.effect(train[regressors]) effects_test = estimator.effect(test[regressors]) conf_intrvl = estimator.effect_interval(test[regressors]) @@ -69,13 +74,13 @@ conf_intrvl = estimator.effect_interval(test[regressors]) The **grf** package has a `causal_forest` function that can be used to estimate causal forests. Additional functions afterwards can estimate, for example, the `average_treatment_effect()`. See `help(package='grf')` for more options. -```R +```r # If necessary # install.packages('grf') library(grf) # Get crime data from North Carolina -df <- read.csv('https://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/Crime.csv') +df <- read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/Crime.csv") # It's not, but let's pretend that "percentage of young males" pctymle is exogenous # and see how the effect of it on crmrte varies across the other measured covariates @@ -83,12 +88,12 @@ df <- read.csv('https://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/Crime.c # Make sure the data has no missing values. Here I'm dropping observations # with missing values in any variable, but you can limit the data first to just # variables used in analysis to only drop observations with missing values in those variables -df <- df[complete.cases(df),] +df <- df[complete.cases(df), ] # Let's use training and holdout data split <- sample(c(FALSE, TRUE), nrow(df), replace = TRUE) -df.train <- df[split,] -df.hold <- df[!split,] +df.train <- df[split, ] +df.hold <- df[!split, ] # Isolate the "treatment" as a matrix pctymle <- as.matrix(df.train$pctymle) @@ -98,20 +103,20 @@ crmrte <- as.matrix(df.train$crmrte) # Use model.matrix to get our predictor matrix # We might also consider adding interaction terms -X <- model.matrix(lm(crmrte ~ -1 + factor(year) + prbarr + prbconv + prbpris + - avgsen + polpc + density + taxpc + factor(region) + factor(smsa) + - pctmin + wcon, data = df.train)) +X <- model.matrix(lm(crmrte ~ -1 + factor(year) + prbarr + prbconv + prbpris + + avgsen + polpc + density + taxpc + factor(region) + factor(smsa) + + pctmin + wcon, data = df.train)) # Estimate causal forest -cf <- causal_forest(X,crmrte,pctymle) +cf <- causal_forest(X, crmrte, pctymle) # Get predicted causal effects for each observation effects <- predict(cf)$predictions # And use holdout X's for prediction -X.hold <- model.matrix(lm(crmrte ~ -1 + factor(year) + prbarr + prbconv + prbpris + - avgsen + polpc + density + taxpc + factor(region) + factor(smsa) + - pctmin + wcon, data = df.hold)) +X.hold <- model.matrix(lm(crmrte ~ -1 + factor(year) + prbarr + prbconv + prbpris + + avgsen + polpc + density + taxpc + factor(region) + factor(smsa) + + pctmin + wcon, data = df.hold)) # And get effects effects.hold <- predict(cf, X.hold)$predictions @@ -167,3 +172,4 @@ causal_forest crmrte pctymle year prbarr prbconv prbpris avgsen polpc density ta * Look at the holdout effects predicted di "`r(effects_hold)'" ``` + diff --git a/Machine_Learning/decision_trees.md b/Machine_Learning/decision_trees.md index fb25eeeb..a8bdc8fd 100644 --- a/Machine_Learning/decision_trees.md +++ b/Machine_Learning/decision_trees.md @@ -53,20 +53,22 @@ from sklearn.model_selection import train_test_split from sklearn.metrics import plot_confusion_matrix import pandas as pd -titanic = pd.read_csv("https://raw.githubusercontent.com/Evanmj7/Decision-Trees/master/titanic.csv", - index_col=0) +titanic = pd.read_csv( + "https://raw.githubusercontent.com/Evanmj7/Decision-Trees/master/titanic.csv", + index_col=0, +) # Let's ensure the columns we want to treat as continuous are indeed continuous by using pd.to_numeric # The errors = 'coerce' keyword argument will force any values that cannot be # cast into continuous variables to become NaNs. -continuous_cols = ['age', 'fare'] +continuous_cols = ["age", "fare"] for col in continuous_cols: - titanic[col] = pd.to_numeric(titanic[col], errors='coerce') + titanic[col] = pd.to_numeric(titanic[col], errors="coerce") # Set categorical cols & convert to dummies -cat_cols = ['sex', 'pclass'] +cat_cols = ["sex", "pclass"] for col in cat_cols: - titanic[col] = titanic[col].astype('category').cat.codes + titanic[col] = titanic[col].astype("category").cat.codes # Clean the dataframe. An alternative would be to retain some rows with missing values by giving # a special value to nan for each column, eg by imputing some values, but one should be careful not to @@ -78,14 +80,13 @@ titanic = titanic.dropna() # Create list of regressors regressors = continuous_cols + cat_cols # Predicted var -y_var = ['survived'] +y_var = ["survived"] # Create a test (25% of data) and train set train, test = train_test_split(titanic, test_size=0.25) # Now let's create an empty decision tree to solve the classification problem: -clf = tree.DecisionTreeClassifier(max_depth=10, min_samples_split=5, - ccp_alpha=0.01) +clf = tree.DecisionTreeClassifier(max_depth=10, min_samples_split=5, ccp_alpha=0.01) # The last option, ccp_alpha, prunes low-value complexity from the tree to help # avoid overfitting. @@ -97,10 +98,10 @@ tree.plot_tree(clf) # How does it perform on the train and test data? train_accuracy = round(clf.score(train[regressors], train[y_var]), 4) -print(f'Accuracy on train set is {train_accuracy}') +print(f"Accuracy on train set is {train_accuracy}") test_accuracy = round(clf.score(test[regressors], test[y_var]), 4) -print(f'Accuracy on test set is {test_accuracy}') +print(f"Accuracy on test set is {test_accuracy}") # Show the confusion matrix plot_confusion_matrix(clf, test[regressors], test[y_var]) @@ -133,8 +134,8 @@ titanic$fare <- as.numeric(titanic$fare) # As with all machine learning methodologies, we want to create a test and a training dataset # Take a random sample of the data, here we have chosen to use 75% for training and 25% for validation -samp_size <- floor(0.75*nrow(titanic)) -train_index <- sample(seq_len(nrow(titanic)),size=samp_size,replace=FALSE) +samp_size <- floor(0.75 * nrow(titanic)) +train_index <- sample(seq_len(nrow(titanic)), size = samp_size, replace = FALSE) train <- titanic[train_index, ] test <- titanic[-train_index, ] @@ -145,29 +146,29 @@ test <- titanic[-train_index, ] basic_tree <- rpart( survived ~ pclass + sex + age + fare + embarked, # our formula - data=train, + data = train, method = "class", # tell the model we are doing classification - minsplit=2, # set a minimum number of splits - cp=.02 # set an optional penalty rate. It is often useful to try out many different ones, use the caret package to test many at once + minsplit = 2, # set a minimum number of splits + cp = .02 # set an optional penalty rate. It is often useful to try out many different ones, use the caret package to test many at once ) basic_tree # plot it using the packages we loaded above -fancyRpartPlot(basic_tree,caption="Basic Decision Tree") +fancyRpartPlot(basic_tree, caption = "Basic Decision Tree") # This plot gives a very intuitive visual representation on what is going on behind the scenes. # Now we should predict using the test data we left out! -predictions <- predict(basic_tree,newdata=test,type="class") +predictions <- predict(basic_tree, newdata = test, type = "class") # Make the numeric responses as well as the variables that we are testing on into factors predictions <- as.factor(predictions) test$survived <- as.factor(test$survived) # Create a confusion matrix which tells us how well we did. -confusionMatrix(predictions,test$survived) +confusionMatrix(predictions, test$survived) # This particular model got ~80% accuracy. This varies each time if you do not set a seed. Much better than a coin toss, but not great. With some additional tuning a decision tree can be much more accurate! Try it for yourself by changing the factors that go into the prediction and the penalty rates. - ``` + diff --git a/Machine_Learning/penalized_regression.md b/Machine_Learning/penalized_regression.md index 21194056..921c28cc 100644 --- a/Machine_Learning/penalized_regression.md +++ b/Machine_Learning/penalized_regression.md @@ -8,7 +8,7 @@ mathjax: true ## Switch to false if this page has no equations or other math ren # Penalized Regression -When running a regression, especially one with many predictors, the results have a tendency to overfit the data, reducing out-of-sample predictive properties. +When running a regression, especially one with many predictors, the results have a tendency to overfit the data, reducing out-of-sample predictive properties. Penalized regression eases this problem by forcing the regression estimator to shrink its coefficients towards 0 in order to avoid the "penalty" term imposed on the coefficients. This process is closely related to the idea of Bayesian shrinkage, and indeed standard penalized regression results are equivalent to regression performed using [certain Bayesian priors](https://amstat.tandfonline.com/doi/abs/10.1198/016214508000000337?casa_token=DE6O93Bz7uUAAAAA:Ff_MiPXvPH32NA2hnGtZtqb8grXEiEqF0fdO3B0p_a6wOaqRciCZ4ASwxn69gdOb93Lbt-HSyK1o4As). @@ -24,7 +24,7 @@ $$ \min\left(\sum_i(y_i - X_i\hat{\beta})^2 + \lambda\left\lVert\beta\right\rVert_p \right) $$ -Typically $$p$$ is set to 1 for LASSO regression (least absolute shrinkage and selection operator), which has the effect of tending to set coefficients to 0, i.e. model selection, or to 2 for Ridge Regression. Elastic net regression provides a weighted mix of LASSO and Ridge penalties, commonly referring to the weight as $$\alpha$$. +Typically $$p$$ is set to 1 for LASSO regression (least absolute shrinkage and selection operator), which has the effect of tending to set coefficients to 0, i.e. model selection, or to 2 for Ridge Regression. Elastic net regression provides a weighted mix of LASSO and Ridge penalties, commonly referring to the weight as $$\alpha$$. ## Keep in Mind @@ -55,13 +55,13 @@ library(glmnet) data(iris) # Create a matrix with all variables other than our dependent vairable, Sepal.Length -# and interactions. +# and interactions. # -1 to omit the intercept M <- model.matrix(lm(Sepal.Length ~ (.)^2 - 1, data = iris)) # Add squared terms of numeric variables numeric.var.names <- names(iris)[2:4] -M <- cbind(M,as.matrix(iris[,numeric.var.names]^2)) -colnames(M)[16:18] <- paste(numeric.var.names,'squared') +M <- cbind(M, as.matrix(iris[, numeric.var.names]^2)) +colnames(M)[16:18] <- paste(numeric.var.names, "squared") # Create a matrix for our dependent variable too Y <- as.matrix(iris$Sepal.Length) @@ -80,7 +80,7 @@ Y <- scale(Y) cv.lasso <- cv.glmnet(M, Y, family = "gaussian", nfolds = 20, alpha = 1) # We might want to see how the choice of lambda relates to out-of-sample error with a plot plot(cv.lasso) -# After doing CV, we commonly pick the lambda.min for lambda, +# After doing CV, we commonly pick the lambda.min for lambda, # which is the lambda that minimizes out-of-sample error # or lambda.1se, which is one standard error above lambda.min, # which penalizes more harshly. The choice depends on context. @@ -151,3 +151,4 @@ lassocoef * By default, alpha will be selected by cross-validation as well elasticnet linear wage `numeric_vars' f*_* interact_*, sel(cv) ``` + diff --git a/Machine_Learning/random_forest.md b/Machine_Learning/random_forest.md index 333261ea..297aba39 100644 --- a/Machine_Learning/random_forest.md +++ b/Machine_Learning/random_forest.md @@ -55,7 +55,6 @@ y_pred = model.predict(X_test) # Evaluate model prediction print(f"Accuracy is {accuracy_score(y_pred, y_test)*100:.2f} %") - ``` ## R @@ -65,7 +64,7 @@ There are a number of packages in R capable of training a random forest, includi We'll be using a built-in dataset in R, called "Iris". There are five variables in this dataset, including species, petal width and length as well as sepal length and width. ```r -#Load packages +# Load packages library(tidyverse) library(rvest) library(dplyr) @@ -74,37 +73,38 @@ library(randomForest) library(Metrics) library(readr) -#Read data in R +# Read data in R data(iris) iris -#Create features and target +# Create features and target X <- iris %>% select(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) y <- iris$Species -#Split data into training and test sets -index <- createDataPartition(y, p=0.75, list=FALSE) -X_train <- X[ index, ] +# Split data into training and test sets +index <- createDataPartition(y, p = 0.75, list = FALSE) +X_train <- X[index, ] X_test <- X[-index, ] y_train <- y[index] -y_test<-y[-index] +y_test <- y[-index] -#Train the model -iris_rf <- randomForest(x = X_train, y = y_train , maxnodes = 10, ntree = 10) +# Train the model +iris_rf <- randomForest(x = X_train, y = y_train, maxnodes = 10, ntree = 10) print(iris_rf) -#Make predictions +# Make predictions predictions <- predict(iris_rf, X_test) result <- X_test -result['Species'] <- y_test -result['Prediction']<- predictions +result["Species"] <- y_test +result["Prediction"] <- predictions head(result) -#Check the classification accuracy (number of correct predictions out of total datapoints used to test the prediction) -print(sum(predictions==y_test)) +# Check the classification accuracy (number of correct predictions out of total datapoints used to test the prediction) +print(sum(predictions == y_test)) print(length(y_test)) -print(sum(predictions==y_test)/length(y_test)) +print(sum(predictions == y_test) / length(y_test)) ``` + diff --git a/Machine_Learning/support_vector_machine.md b/Machine_Learning/support_vector_machine.md index 3a107c2e..cc31b153 100644 --- a/Machine_Learning/support_vector_machine.md +++ b/Machine_Learning/support_vector_machine.md @@ -8,33 +8,33 @@ nav_order: 1 # Support Vector Machine -A support vector machine (hereinafter, SVM) is a supervised machine learning algorithm in that it is trained by a set of data and then classifies any new input data depending on what it learned during the training phase. SVM can be used both for classification and regression problems but here we focus on its use for classification. +A support vector machine (hereinafter, SVM) is a supervised machine learning algorithm in that it is trained by a set of data and then classifies any new input data depending on what it learned during the training phase. SVM can be used both for classification and regression problems but here we focus on its use for classification. -The idea is to separate two distinct groups by maximizing the distance between those points that are most hard to classify. To put it more formally, it maximizes the distance or margin between support vectors around the separating hyperplane. Support vectors here imply the data points that lie closest to the hyperplane. Hyperplanes are decision boundaries that are represented by a line (in two dimensional space) or a plane (in three dimensional space) that separate the two groups. +The idea is to separate two distinct groups by maximizing the distance between those points that are most hard to classify. To put it more formally, it maximizes the distance or margin between support vectors around the separating hyperplane. Support vectors here imply the data points that lie closest to the hyperplane. Hyperplanes are decision boundaries that are represented by a line (in two dimensional space) or a plane (in three dimensional space) that separate the two groups. -Suppose a hypothetical problem of classifying apples from lemons. Support vectors in this case are apples that look closest to lemons and lemons that look closest to apples. They are the most difficult ones to classify. SVM draws a separating line or hyperplane that maximizes the distance or margin between support vectors, in this case the apples that look closest to the lemons and lemons that look closest to apples. Therefore support vectors are critical in determining the position as well as the slope of the hyperplane. +Suppose a hypothetical problem of classifying apples from lemons. Support vectors in this case are apples that look closest to lemons and lemons that look closest to apples. They are the most difficult ones to classify. SVM draws a separating line or hyperplane that maximizes the distance or margin between support vectors, in this case the apples that look closest to the lemons and lemons that look closest to apples. Therefore support vectors are critical in determining the position as well as the slope of the hyperplane. For additional information about the support vector regression or support vector machine, refer to [Wikipedia: Support-vector machine](https://en.wikipedia.org/wiki/Support-vector_machine). # Keep in Mind -- Note that optimization problem to solve for a linear separator is maximizing the margin which could be calculated as $$\frac{2}{\lVert w \rVert}$$. This could then be rewritten as minimizing $$\lVert w \rVert$$, or minimizing a monotonic transformation version of it expressed as $$\frac{1}{2}\lVert w \rVert^2$$. Additional constraint of $$y_i(w^T x_i + b) \geq 1$$ needs to be imposed to ensure that the data points are still correctly classified. As such, the constrained optimization problem for SVM looks as the following: +- Note that optimization problem to solve for a linear separator is maximizing the margin which could be calculated as $$\frac{2}{\lVert w \rVert}$$. This could then be rewritten as minimizing $$\lVert w \rVert$$, or minimizing a monotonic transformation version of it expressed as $$\frac{1}{2}\lVert w \rVert^2$$. Additional constraint of $$y_i(w^T x_i + b) \geq 1$$ needs to be imposed to ensure that the data points are still correctly classified. As such, the constrained optimization problem for SVM looks as the following: $$ \text{min} \frac{\lVert w \rVert ^2}{2} $$ -s.t. $$y_i(w^T x_i + b) \geq 1$$, +s.t. $$y_i(w^T x_i + b) \geq 1$$, -where $$w$$ is a weight vector, $$x_i$$ is each data point, $$b$$ is bias, and $$y_i$$ is each data point's corresponding label that takes the value of either $$\{-1, 1\}$$. +where $$w$$ is a weight vector, $$x_i$$ is each data point, $$b$$ is bias, and $$y_i$$ is each data point's corresponding label that takes the value of either $$\{-1, 1\}$$. For detailed information about derivation of the optimization problem, refer to [MIT presentation slides](http://web.mit.edu/6.034/wwwbob/svm-notes-long-08.pdf), [The Math Behind Support Vector Machines](https://www.byteofmath.com/the-math-behind-support-vector-machines/), and [Demystifying Maths of SVM - Part1](https://towardsdatascience.com/demystifying-maths-of-svm-13ccfe00091e). -- If data points are not linearly separable, non-linear SVM introduces higher dimensional space that projects data points from original finite-dimensional space to gain linearly separation. Such process of mapping data points into a higher dimensional space is known as the Kernel Trick. There are numerous types of Kernels that can be used to create higher dimensional space including linear, polynomial, Sigmoid, and Radial Basis Function. +- If data points are not linearly separable, non-linear SVM introduces higher dimensional space that projects data points from original finite-dimensional space to gain linearly separation. Such process of mapping data points into a higher dimensional space is known as the Kernel Trick. There are numerous types of Kernels that can be used to create higher dimensional space including linear, polynomial, Sigmoid, and Radial Basis Function. - Setting the right form of Kernel is important as it determines the structure of the separator or hyperplane. -# Also Consider +# Also Consider -- See the alternative classification method described on the [K-Nearest Neighbor Matching]({{ "/Machine_Learning/Nearest_Neighbor.html" | relative_url }}). +- See the alternative classification method described on the [K-Nearest Neighbor Matching]({{ "/Machine_Learning/Nearest_Neighbor.html" | relative_url }}). # Implementations @@ -55,10 +55,10 @@ import numpy as np ########################### # Generate linearly separable data: -X, y = make_classification(n_features=2, n_redundant=0, n_informative=1, - n_clusters_per_class=1) -X_train, X_test, y_train, y_test = train_test_split( - X, y, test_size=0.2) +X, y = make_classification( + n_features=2, n_redundant=0, n_informative=1, n_clusters_per_class=1 +) +X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # Train linear SVM model svm = LinearSVC(tol=1e-5) @@ -66,7 +66,7 @@ svm.fit(X_train, y_train) # Test model test_score = svm.score(X_test, y_test) -print(f'The test score is {test_score}') +print(f"The test score is {test_score}") ############################### # Example 2: Non-linear SVM ### @@ -74,37 +74,43 @@ print(f'The test score is {test_score}') # Generate non-linearly separable data X, y = make_gaussian_quantiles(n_features=2, n_classes=2) -X_train, X_test, y_train, y_test = train_test_split( - X, y, test_size=0.2) +X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # Train non-linear SVM model -nl_svm = SVC(kernel='rbf', C=50) +nl_svm = SVC(kernel="rbf", C=50) nl_svm.fit(X_train, y_train) # Test model test_score = nl_svm.score(X_test, y_test) -print(f'The non-linear test score is {test_score}') +print(f"The non-linear test score is {test_score}") #################################### # Plot non-linear SVM boundaries ### #################################### plt.figure() decision_function = nl_svm.decision_function(X) -support_vector_indices = np.where( - np.abs(decision_function) <= 1 + 1e-15)[0] +support_vector_indices = np.where(np.abs(decision_function) <= 1 + 1e-15)[0] support_vectors = X[support_vector_indices] plt.scatter(X[:, 0], X[:, 1], c=y, s=30, cmap=plt.cm.Paired) ax = plt.gca() xlim = ax.get_xlim() ylim = ax.get_ylim() -xx, yy = np.meshgrid(np.linspace(xlim[0], xlim[1], 50), - np.linspace(ylim[0], ylim[1], 50)) +xx, yy = np.meshgrid( + np.linspace(xlim[0], xlim[1], 50), np.linspace(ylim[0], ylim[1], 50) +) Z = nl_svm.decision_function(np.c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape) -plt.contour(xx, yy, Z, colors='k', levels=[-1, 0, 1], alpha=0.5, - linestyles=['--', '-', '--']) -plt.scatter(support_vectors[:, 0], support_vectors[:, 1], s=100, - linewidth=1, facecolors='none', edgecolors='k') +plt.contour( + xx, yy, Z, colors="k", levels=[-1, 0, 1], alpha=0.5, linestyles=["--", "-", "--"] +) +plt.scatter( + support_vectors[:, 0], + support_vectors[:, 1], + s=100, + linewidth=1, + facecolors="none", + edgecolors="k", +) plt.tight_layout() plt.show() ``` @@ -112,18 +118,18 @@ plt.show() ## R -There are a couple of ways to implement SVM in R. Here we'll demonstrate using the **e1071** package. To learn more about the package, check out its [CRAN page](https://cran.r-project.org/web/packages/e1071/index.html), as well as [this vignette](https://cran.r-project.org/web/packages/e1071/vignettes/svmdoc.pdf). Note that we'll also load the **tidyverse** to help with some data wrangling and plotting. +There are a couple of ways to implement SVM in R. Here we'll demonstrate using the **e1071** package. To learn more about the package, check out its [CRAN page](https://cran.r-project.org/web/packages/e1071/index.html), as well as [this vignette](https://cran.r-project.org/web/packages/e1071/vignettes/svmdoc.pdf). Note that we'll also load the **tidyverse** to help with some data wrangling and plotting. -Two examples are shown below that use linear SVM and non-linear SVM respectively. The first example shows how to implement linear SVM. We start by constructing data, separating them into training and test set. Using the training set, we fit the data using the `svm()` function. Notice that kernel argument for ``svm()`` function is specified as *linear* for our first example. Next, we predict the test data based on the model estimates using the `predict()` function. The first example result suggests that only one out of 59 data points is incorrectly classified. +Two examples are shown below that use linear SVM and non-linear SVM respectively. The first example shows how to implement linear SVM. We start by constructing data, separating them into training and test set. Using the training set, we fit the data using the `svm()` function. Notice that kernel argument for ``svm()`` function is specified as *linear* for our first example. Next, we predict the test data based on the model estimates using the `predict()` function. The first example result suggests that only one out of 59 data points is incorrectly classified. -The second example shows how to implement non-linear SVM. The data in example two is generated in a way to have data points of one class centered around the middle whereas data points of the other class spread on two sides. Notice that kernel argument for the `svm()` function is specified as **radial** for our second example, based on the shape of the data. The second example result suggests that only two out of 58 data points are incorrectly classified. +The second example shows how to implement non-linear SVM. The data in example two is generated in a way to have data points of one class centered around the middle whereas data points of the other class spread on two sides. Notice that kernel argument for the `svm()` function is specified as **radial** for our second example, based on the shape of the data. The second example result suggests that only two out of 58 data points are incorrectly classified. ```r # Install and load the packages if (!require("tidyverse")) install.packages("tidyverse") if (!require("e1071")) install.packages("e1071") library(tidyverse) # package for data manipulation -library(e1071) # package for SVM +library(e1071) # package for SVM ########################### # Example 1: Linear SVM ### @@ -131,31 +137,31 @@ library(e1071) # package for SVM # Construct a completely separable data set ## Set seed for replication -set.seed(0715) -## Make variable x -x = matrix(rnorm(200, mean = 0, sd = 1), nrow = 100, ncol = 2) +set.seed(0715) +## Make variable x +x <- matrix(rnorm(200, mean = 0, sd = 1), nrow = 100, ncol = 2) ## Make variable y that labels x by either -1 or 1 -y = rep(c(-1, 1), c(50, 50)) -## Make x to have unilaterally higher value when y equals 1 -x[y == 1,] = x[y == 1,] + 3.5 +y <- rep(c(-1, 1), c(50, 50)) +## Make x to have unilaterally higher value when y equals 1 +x[y == 1, ] <- x[y == 1, ] + 3.5 ## Construct data set -d1 = data.frame(x1 = x[,1], x2 = x[,2], y = as.factor(y)) +d1 <- data.frame(x1 = x[, 1], x2 = x[, 2], y = as.factor(y)) ## Split it into training and test data -flag = sample(c(0,1), size = nrow(d1), prob=c(0.5,0.5), replace = TRUE) -d1 = setNames(split(d1, flag), c("train", "test")) +flag <- sample(c(0, 1), size = nrow(d1), prob = c(0.5, 0.5), replace = TRUE) +d1 <- setNames(split(d1, flag), c("train", "test")) # Plot ggplot(data = d1$train, aes(x = x1, y = x2, color = y, shape = y)) + - geom_point(size = 2) + + geom_point(size = 2) + scale_color_manual(values = c("darkred", "steelblue")) -# SVM classification -svmfit1 = svm(y ~ ., data = d1$train, kernel = "linear", cost = 10, scale = FALSE) +# SVM classification +svmfit1 <- svm(y ~ ., data = d1$train, kernel = "linear", cost = 10, scale = FALSE) print(svmfit1) plot(svmfit1, d1$train) # Predictability -pred.d1 = predict(svmfit1, newdata = d1$test) +pred.d1 <- predict(svmfit1, newdata = d1$test) table(pred.d1, d1$test$y) ############################### @@ -163,41 +169,40 @@ table(pred.d1, d1$test$y) ############################### # Construct less separable data set -## Make variable x -x = matrix(rnorm(200, mean = 0, sd = 1), nrow = 100, ncol = 2) +## Make variable x +x <- matrix(rnorm(200, mean = 0, sd = 1), nrow = 100, ncol = 2) ## Make variable y that labels x by either -1 or 1 -y <- rep(c(-1, 1) , c(50, 50)) -## Make x to have extreme values when y equals 1 -x[y == 1, ][1:25,] = x[y==1,][1:25,] + 3.5 -x[y == 1, ][26:50,] = x[y==1,][26:50,] - 3.5 +y <- rep(c(-1, 1), c(50, 50)) +## Make x to have extreme values when y equals 1 +x[y == 1, ][1:25, ] <- x[y == 1, ][1:25, ] + 3.5 +x[y == 1, ][26:50, ] <- x[y == 1, ][26:50, ] - 3.5 ## Construct data set -d2 = data.frame(x1 = x[,1], x2 = x[,2], y = as.factor(y)) +d2 <- data.frame(x1 = x[, 1], x2 = x[, 2], y = as.factor(y)) ## Split it into training and test data -d2 = setNames(split(d2, flag), c("train", "test")) +d2 <- setNames(split(d2, flag), c("train", "test")) # Plot data ggplot(data = d2$train, aes(x = x1, y = x2, color = y, shape = y)) + - geom_point(size = 2) + + geom_point(size = 2) + scale_color_manual(values = c("darkred", "steelblue")) # SVM classification -svmfit2 = svm(y ~ ., data = d2$train, kernel = "radial", cost = 10, scale = FALSE) +svmfit2 <- svm(y ~ ., data = d2$train, kernel = "radial", cost = 10, scale = FALSE) print(svmfit2) plot(svmfit2, d2$train) # Predictability -pred.d2 = predict(svmfit2, newdata = d2$test) +pred.d2 <- predict(svmfit2, newdata = d2$test) table(pred.d2, d2$test$y) - ``` -## Stata +## Stata The below code shows how to implement support vector machines in Stata using the svmachines command. To learn more about this community contriuted command, you can read [this Stata Journal article.](http://schonlau.net/publication/16svm_stata.pdf) ```stata clear all -set more off +set more off *Install svmachines ssc install svmachines @@ -205,21 +210,21 @@ ssc install svmachines *Import Data with a binary outcome for classification use http://www.stata-press.com/data/r16/fvex.dta, clear -*First try logistic regression to benchmark the prediction quality of SVM against +*First try logistic regression to benchmark the prediction quality of SVM against logit outcome group sex arm age distance y // Run the regression predict outcome_predicted // Generate predictions from the regression *Calculate the log loss - see https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html for more info gen log_loss = outcome*log(outcome_predicted)+(1-outcome)*log(1-outcome_predicted) -*Run SVM +*Run SVM svmachines outcome group sex arm age distance y, prob // Specifiying the prob option to generate predicted probabilities in the next line predict sv_outcome_predicted, probability ``` Next we will Calculate the [log loss (or cross-entropy loss)](https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html) for SVM. - -Note: Predictions following svmachines generate three variables from the stub you provide in the predict command (in this case sv_outcome_predicted). The first is just the same as the stub and stores the best-guess classification (the group with the highest probability out of the possible options). The next n variables store the probability that the given observation will fall into each of the possible classes (in the binary case, this is just n=2 possible classes). These new variables are the stub + the value of each class. In the case below, the suffixes are `_0` and `_1`. We use `sv_outcome_predicted_1` because it produces probabilities that are equivalent in their intepretation (probability of having a class of 1) to the probabilities produced by the logit model and that can be used in calculating the log loss. Calculating loss functions for multi-class classifiers is more complicated, and you can read more about that at the link above. + +Note: Predictions following svmachines generate three variables from the stub you provide in the predict command (in this case sv_outcome_predicted). The first is just the same as the stub and stores the best-guess classification (the group with the highest probability out of the possible options). The next n variables store the probability that the given observation will fall into each of the possible classes (in the binary case, this is just n=2 possible classes). These new variables are the stub + the value of each class. In the case below, the suffixes are `_0` and `_1`. We use `sv_outcome_predicted_1` because it produces probabilities that are equivalent in their intepretation (probability of having a class of 1) to the probabilities produced by the logit model and that can be used in calculating the log loss. Calculating loss functions for multi-class classifiers is more complicated, and you can read more about that at the link above. ```stata gen log_loss_svm = outcome*log(sv_outcome_predicted_1)+(1-outcome)*log(1-sv_outcome_predicted_1) @@ -227,3 +232,4 @@ gen log_loss_svm = outcome*log(sv_outcome_predicted_1)+(1-outcome)*log(1-sv_outc *Show log loss for both logit and SVM, remember lower is better sum log_loss log_loss_svm ``` + diff --git a/Model_Estimation/GLS/GLS.md b/Model_Estimation/GLS/GLS.md index 7b9f58a4..f1efea5d 100644 --- a/Model_Estimation/GLS/GLS.md +++ b/Model_Estimation/GLS/GLS.md @@ -7,3 +7,4 @@ nav_order: 2 --- # Generalised Least Squares + diff --git a/Model_Estimation/GLS/gmm.md b/Model_Estimation/GLS/gmm.md index 80ef75c0..a5f2925d 100644 --- a/Model_Estimation/GLS/gmm.md +++ b/Model_Estimation/GLS/gmm.md @@ -4,16 +4,16 @@ parent: Generalised Least Squares grand_parent: Model Estimation has_children: false nav_order: 1 -mathjax: true +mathjax: true --- -# Generalized Method of Moments +# Generalized Method of Moments GMM is an estimation technique that does not require strong assumptions about the distributions of the underlying parameters. The key intuition is that if we know the expected value of population moments (such as mean or variance), then the sample equivalents will converge to that expected value using the law of large numbers. If the moments are functions of the parameters that we wish to estimate, then we can use these moment restrictions to estimate our parameters. -Suppose we have a vector of $$K$$ parameters we want to estimate, where the true values of those parameters are $$\theta_0$$ and a set of $$L \geq K$$ moment conditions provided by theory, $$E\left[g(Y_i,\theta_0)\right] = 0$$, where $$Y_i$$ is a vector of variables corresponding to one observation in our data. +Suppose we have a vector of $$K$$ parameters we want to estimate, where the true values of those parameters are $$\theta_0$$ and a set of $$L \geq K$$ moment conditions provided by theory, $$E\left[g(Y_i,\theta_0)\right] = 0$$, where $$Y_i$$ is a vector of variables corresponding to one observation in our data. -The sample version of the population moments are $$\hat{g}(\theta) \equiv \frac{1}{n}\sum_{i=1}^ng(Y_i,\theta)$$. Thus, we are trying find $$\theta$$ that makes $$\hat{g}(\theta)$$ as close to $$0$$ as possible. The GMM estimator is +The sample version of the population moments are $$\hat{g}(\theta) \equiv \frac{1}{n}\sum_{i=1}^ng(Y_i,\theta)$$. Thus, we are trying find $$\theta$$ that makes $$\hat{g}(\theta)$$ as close to $$0$$ as possible. The GMM estimator is $$ \hat{\theta}_{GMM} = \underset{\theta}{\operatorname{argmin}}\hat{g}(\theta)^\prime \hat{W}\hat{g}(\theta) @@ -21,13 +21,13 @@ $$ Where $$\hat{W}$$ is some positive semi-definite matrix, which gives us a consistent estimate of true parameters $$\theta$$ under relatively benign assumptions. -For more details, visit the [Wikipedia Page](https://en.wikipedia.org/wiki/Generalized_method_of_moments). +For more details, visit the [Wikipedia Page](https://en.wikipedia.org/wiki/Generalized_method_of_moments). ## Keep in Mind -- There are two important assumptions necessary for identification, meaning that $$\hat{\theta}_{GMM}$$ is uniquely minimized at the true value $$\theta_0$$ - - **Order Condition**: There are at least as many moment conditions as parameters to be estimated, $$L \geq K$$. +- There are two important assumptions necessary for identification, meaning that $$\hat{\theta}_{GMM}$$ is uniquely minimized at the true value $$\theta_0$$ + - **Order Condition**: There are at least as many moment conditions as parameters to be estimated, $$L \geq K$$. - **Rank Condition**: The $$K \times L$$ matrix of derivatives $$\bar{G}_n(\theta_0)$$ will have full column rank, $$L$$. - Any positive semi-definite weight matrix $$\hat{W}$$ will produce an asymptotically consistent estimator for $$\theta$$, but we want to choose the weight matrix that gives estimates the smallest asymptotic variance. There are various methods for choosing $$\hat{W}$$ outlined [here](https://en.wikipedia.org/wiki/Generalized_method_of_moments#Implementation), which are various iterative processes - [Sargan-Hansen J-Test](https://en.wikipedia.org/wiki/Generalized_method_of_moments#Sargan%E2%80%93Hansen_J-test) can be used to test the specification of the model, by determining whether the sample moments are sufficiently close to zero @@ -37,8 +37,8 @@ For more details, visit the [Wikipedia Page](https://en.wikipedia.org/wiki/Gener - Under certain moment conditions, GMM is equivalent to many other estimators that are used more commonly. These include... - **[OLS]({{ "Model_Estimation/OLS/simple_linear_regression.html" | relative_url }})** if $$E[x_i(y_i - x_i^\prime\beta)]=0$$ - - **[Instrumental Variables]({{ "Model_Estimation/Research_Design/instrumental_variables.html" | relative_url }})** if $$E[z_i(y_i - x_i^\prime\beta)]=0$$ -- Maximum likelihood estimation is also a specific case of GMM that makes assumptions about the distributions of the parameters. This gives maximum likelihood better small sample properties, at the cost of the stronger assumptions + - **[Instrumental Variables]({{ "Model_Estimation/Research_Design/instrumental_variables.html" | relative_url }})** if $$E[z_i(y_i - x_i^\prime\beta)]=0$$ +- Maximum likelihood estimation is also a specific case of GMM that makes assumptions about the distributions of the parameters. This gives maximum likelihood better small sample properties, at the cost of the stronger assumptions # Implementations @@ -54,76 +54,74 @@ if (!require("pacman")) install.packages("pacman") pacman::p_load(gmm) # Parameters we are going to estimate -mu = 3 -sigma = 2 +mu <- 3 +sigma <- 2 -# Generating random numbers +# Generating random numbers set.seed(0219) -n = 500 -x = rnorm(n = n, mean = mu, sd = sigma) +n <- 500 +x <- rnorm(n = n, mean = mu, sd = sigma) -# Moment restrictions +# Moment restrictions g1 <- function(theta, x) { - m1 = (theta[1]-x) - m2 = (theta[2]^2 - (x - theta[1])^2) - m3 = x^3-theta[1]*(theta[1]^2+3*theta[2]^2) - f = cbind(m1,m2,m3) - return(f) + m1 <- (theta[1] - x) + m2 <- (theta[2]^2 - (x - theta[1])^2) + m3 <- x^3 - theta[1] * (theta[1]^2 + 3 * theta[2]^2) + f <- cbind(m1, m2, m3) + return(f) } -# Running GMM -gmm_mod = gmm( - # Moment restriction equations +# Running GMM +gmm_mod <- gmm( + # Moment restriction equations g = g1, # Matrix of data x = x, # Starting location for minimization algorithm - t0 = c(0,0) # Required when g argument is a function + t0 = c(0, 0) # Required when g argument is a function ) # Reporting results summary(gmm_mod) - ``` -Another common application of GMM is with linear moment restrictions. These can be specified by writing the regression formula as the `g` argument of the `gmm()` function and the matrix of instruments as the `x` argument. Suppose we have a model $$y_i = \alpha + \beta x_i + \epsilon_i$$, but $$E[x_i(y_i - x_i^\prime\beta)]\neq 0$$, so OLS would produce a biased estimate of $$\beta$$. If we have a vector of instruments $$z_i$$ that are correlated with $$x_i$$ and have moment conditions $$E[z_i(y_i - x_i^\prime\beta)]=0$$, then we can use GMM to estimate $$\beta$$. +Another common application of GMM is with linear moment restrictions. These can be specified by writing the regression formula as the `g` argument of the `gmm()` function and the matrix of instruments as the `x` argument. Suppose we have a model $$y_i = \alpha + \beta x_i + \epsilon_i$$, but $$E[x_i(y_i - x_i^\prime\beta)]\neq 0$$, so OLS would produce a biased estimate of $$\beta$$. If we have a vector of instruments $$z_i$$ that are correlated with $$x_i$$ and have moment conditions $$E[z_i(y_i - x_i^\prime\beta)]=0$$, then we can use GMM to estimate $$\beta$$. ```r # library(gmm) # already loaded # Setting parameter values -alpha = 1 -beta = 2 +alpha <- 1 +beta <- 2 # Taking random draws set.seed(0219) -z1 = rnorm(n = 500, 1,2) -z2 = rnorm(n = 500,-1,1) -e = rnorm(n = 500, 0, 1) +z1 <- rnorm(n = 500, 1, 2) +z2 <- rnorm(n = 500, -1, 1) +e <- rnorm(n = 500, 0, 1) # Collecting instruments -Z = cbind(z1, z2) +Z <- cbind(z1, z2) # Specifying model, where x is endogenous -x = z1 + z2 + e -y = alpha + beta * x + e +x <- z1 + z2 + e +y <- alpha + beta * x + e # Running GMM -lin_gmm_mod = gmm( +lin_gmm_mod <- gmm( g = y ~ x, x = Z ) # Reporting results summary(lin_gmm_mod) - ``` ## Stata -Stata provides an official command `gmm`, which can be used for the estimation of models via this method if you provide moments of interest. +Stata provides an official command `gmm`, which can be used for the estimation of models via this method if you provide moments of interest. The first example will be in recovering the coefficients that determine the distribution of a variable, assuming that variable follows a normal distribution. @@ -160,24 +158,24 @@ In this case, because the true distribution is normal, you only need two paramet local m1 {mu=1}-x local m2 {sigma=1}^2 - (x-{mu=1})^2 local m3 x^3 - {mu=1}*({mu}^2+3*{sigma=1}^2) -gmm (`m1') (`m2') (`m3'), winitial(identity) +gmm (`m1') (`m2') (`m3'), winitial(identity) est sto m1 -gmm (`m1') (`m2') , winitial(identity) +gmm (`m1') (`m2') , winitial(identity) est sto m2 -gmm (`m1') (`m3') , winitial(identity) +gmm (`m1') (`m3') , winitial(identity) est sto m3 -gmm (`m2') (`m3') , winitial(identity) +gmm (`m2') (`m3') , winitial(identity) est sto m4 est tab m1 m2 m3 m4, se ``` -A second example for the use of `gmm` is for the estimation of standard linear regression models. -For this, lets create some data, where the variable of interest is $$X$$ +A second example for the use of `gmm` is for the estimation of standard linear regression models. +For this, lets create some data, where the variable of interest is $$X$$ ```stata **# LR estimation -*** Parameters to estimate +*** Parameters to estimate local a0 1 local a1 2 clear @@ -199,13 +197,13 @@ Next, we can use `gmm` to estimate the model, under different assumptions ```stata *** gmm ignoring endogeneity ** this is your error -local m1 y-{a0}-{a1}*x -** which implies First order condition E(m1)=0 -** and E(x*m1)=0 +local m1 y-{a0}-{a1}*x +** which implies First order condition E(m1)=0 +** and E(x*m1)=0 gmm (`m1'), winitial(identity) instrument(x) *** gmm with endogeneity -** here, it implies E(z*m1)=0 +** here, it implies E(z*m1)=0 local m1 y-{a0}-{a1}*x gmm (`m1'), winitial(identity) instrument(z1) est sto m1 @@ -219,9 +217,9 @@ est tab m1 m2 m3, se local m1 y-{a0}-{a1}*x gmm (`m1') (z1*(`m1')), winitial(identity) est sto m1 -gmm (`m1') (z2*(`m1')), winitial(identity) +gmm (`m1') (z2*(`m1')), winitial(identity) est sto m2 -gmm (`m1') (z1*(`m1')) (z2*(`m1')), winitial(identity) +gmm (`m1') (z1*(`m1')) (z2*(`m1')), winitial(identity) est sto m3 est tab m1 m2 m3, se ** producing the same results as before diff --git a/Model_Estimation/GLS/heckman_correction_model.md b/Model_Estimation/GLS/heckman_correction_model.md index 57caa059..b32caf15 100644 --- a/Model_Estimation/GLS/heckman_correction_model.md +++ b/Model_Estimation/GLS/heckman_correction_model.md @@ -47,7 +47,7 @@ data("Mroz87") # First consider our selection model # We only observe wages for labor force participants (lfp == 1) -# So we model that as a function of work experience (linear and squared), +# So we model that as a function of work experience (linear and squared), # income from the rest of the family, education, and number of kids 5 or younger. # lfp ~ exper + I(exper^2) + faminc + educ + kids5 @@ -59,9 +59,11 @@ data("Mroz87") # wage ~ exper + I(exper^2) + educ + city # Put them together in a selection() command -heck_model <- selection(lfp ~ exper + I(exper^2) + faminc + educ + kids5, - wage ~ exper + I(exper^2) + educ + city, - Mroz87) +heck_model <- selection( + lfp ~ exper + I(exper^2) + faminc + educ + kids5, + wage ~ exper + I(exper^2) + educ + city, + Mroz87 +) summary(heck_model) ``` @@ -76,11 +78,11 @@ Stata allows to estimate the Heckman selection model using two approaches. A Max * (data via the sampleSelection package in R) import delimited "https://raw.githubusercontent.com/LOST-STATS/LOST-STATS.github.io/master/Estimation/Data/Heckman_Correction_Model/Mroz87.csv", clear -* First, consider the regression of interest. +* First, consider the regression of interest. * First consider our selection model * We only observe wages for labor force participants (lfp == 1) -* So we model that as a function of work experience (linear and squared), +* So we model that as a function of work experience (linear and squared), * income from the rest of the family, education, and number of kids 5 or younger. * select(lfp = c.exper##c.exper faminc educ kids5) @@ -96,3 +98,4 @@ heckman wage c.exper##c.exper educ city, select(lfp = c.exper##c.exper faminc ed * And this would be estimating the model using a two step-approach, also known as Heckit. heckman wage c.exper##c.exper educ city, select(lfp = c.exper##c.exper faminc educ kids5) two ``` + diff --git a/Model_Estimation/GLS/logit_model.md b/Model_Estimation/GLS/logit_model.md index d473cc6c..7d20806a 100644 --- a/Model_Estimation/GLS/logit_model.md +++ b/Model_Estimation/GLS/logit_model.md @@ -9,7 +9,7 @@ nav_order: 1 # Logit Regressions -A logistical regression (Logit) is a statistical method for a best-fit line between a binary [0/1] outcome variable $$Y$$ and any number of independent variables. Logit regressions follow a [logistical distribution](https://en.wikipedia.org/wiki/Logistic_distribution) and the predicted probabilities are bounded between 0 and 1. +A logistical regression (Logit) is a statistical method for a best-fit line between a binary [0/1] outcome variable $$Y$$ and any number of independent variables. Logit regressions follow a [logistical distribution](https://en.wikipedia.org/wiki/Logistic_distribution) and the predicted probabilities are bounded between 0 and 1. For more information about Logit, see [Wikipedia: Logit](https://en.wikipedia.org/wiki/Logit). @@ -40,11 +40,13 @@ There are a number of Python packages that can perform logit regressions but the import pandas as pd import statsmodels.formula.api as smf -df = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/datasets/mtcars.csv', - index_col=0) +df = pd.read_csv( + "https://vincentarelbundock.github.io/Rdatasets/csv/datasets/mtcars.csv", + index_col=0, +) # Specify the model, regressing vs on mpg and cyl -mod = smf.logit('vs ~ mpg + cyl', data=df) +mod = smf.logit("vs ~ mpg + cyl", data=df) # Fit the model res = mod.fit() @@ -53,7 +55,7 @@ res = mod.fit() res.summary() # Compute marginal effects -marg_effect = res.get_margeff(at='mean', method='dydx') +marg_effect = res.get_margeff(at="mean", method="dydx") # Show marginal effects marg_effect.summary() @@ -73,10 +75,12 @@ library(mfx) data(mtcars) # Use the glm() function to run logit -# Here we are predicting engine type using +# Here we are predicting engine type using # miles per gallon and number of cylinders as predictors -my_logit <- glm(vs ~ mpg + cyl, data = mtcars, - family = binomial(link = 'logit')) +my_logit <- glm(vs ~ mpg + cyl, + data = mtcars, + family = binomial(link = "logit") +) # The family argument says we are working with binary data # and using a logit link function (rather than, say, probit) diff --git a/Model_Estimation/GLS/mcfaddens_choice_model.md b/Model_Estimation/GLS/mcfaddens_choice_model.md index 8d0f37be..12b2a7ac 100644 --- a/Model_Estimation/GLS/mcfaddens_choice_model.md +++ b/Model_Estimation/GLS/mcfaddens_choice_model.md @@ -9,7 +9,7 @@ nav_order: 1 # McFadden's Choice Model (Alternative-Specific Conditional Logit) -Discrete choice models are a regression method used to predict a categorical dependent variable with more than two categories. For example, a discrete choice model might be used to predict whether someone is going to take a train, car, or bus to work. +Discrete choice models are a regression method used to predict a categorical dependent variable with more than two categories. For example, a discrete choice model might be used to predict whether someone is going to take a train, car, or bus to work. McFadden's Choice Model is a discrete choice model that uses [conditional logit]({{ "/Model_Estimation/GLS/conditional_logit.html" | relative_url }}), in which the variables that predict choice can vary either at the individual level (perhaps tall people are more likely to take the bus), or at the alternative level (perhaps the train is cheaper than the bus). @@ -50,7 +50,7 @@ This might be referred to as "long" choice data. "Wide" choice data is also comm We will implement McFadden's choice model in R using the **mlogit** package, which can accept "wide" or "long" data in the `mlogit.data` function. -```R +```r # If necessary, install mlogit package # install.packages('mlogit') library(mlogit) @@ -66,25 +66,27 @@ data(Car) # We also need sep = "" since our wide-format variable names are type1, type2, etc. # If the variable names were type_1, type_2, etc., we'd need sep = "_". # If this were long data we'd also want: -# the case identifier with id.var (for individuals) and/or chid.var +# the case identifier with id.var (for individuals) and/or chid.var # (for multiple choices within individuals) # And a variable indicating the alternatives with alt.var # But could skip the alt.levels and sep arguments mlogit.Car <- mlogit.data(Car, - choice = 'choice', - shape = 'wide', - varying = 5:70, - alt.levels = 1:6, - sep="") + choice = "choice", + shape = "wide", + varying = 5:70, + alt.levels = 1:6, + sep = "" +) # mlogit.Car is now in "long" format # Note that if we did start with "long" format we could probably skip the mlogit.data() step. # Now we can run the regression with mlogit(). # We "regress" the choice on the alternative-specific variables like type, fuel, and price -# Then put a pipe separator | +# Then put a pipe separator | # and add our case-specific variables like college -model <- mlogit(choice ~ type + fuel + price | college, - data = mlogit.Car) +model <- mlogit(choice ~ type + fuel + price | college, + data = mlogit.Car +) # Look at the results summary(model) @@ -116,3 +118,4 @@ cmclogit purchase dealers, casevars(gender income) ``` Why bother with the `cmclogit` version? `cmset` gives you a lot more information about your data, and makes it easy to transition between different choice model types, including those incorporating panel data (each person makes multiple choices). + diff --git a/Model_Estimation/GLS/nested_logit.md b/Model_Estimation/GLS/nested_logit.md index 62f923dc..cc63c052 100644 --- a/Model_Estimation/GLS/nested_logit.md +++ b/Model_Estimation/GLS/nested_logit.md @@ -6,7 +6,7 @@ nav_order: 1 mathjax: TRUE --- -A nested logistical regression (nested logit, for short) is a statistical method for finding a best-fit line when the the outcome variable $Y$ is a binary variable, taking values of 0 or 1. Logit regressions, in general, follow a [logistical distribution](https://en.wikipedia.org/wiki/Logistical_distribution) and restrict predicted probabilities between 0 and 1. +A nested logistical regression (nested logit, for short) is a statistical method for finding a best-fit line when the the outcome variable $Y$ is a binary variable, taking values of 0 or 1. Logit regressions, in general, follow a [logistical distribution](https://en.wikipedia.org/wiki/Logistical_distribution) and restrict predicted probabilities between 0 and 1. Traditional logit models require that the [Independence of Irrelevant Alternatives(IIA)](https://en.wikipedia.org/wiki/Independence_of_irrelevant_alternatives) property holds for all possible outcomes of some process. Nested logit models differ by allowing 'nests' of outcomes that satisfy IIA within them, but not requiring that all outcomes jointly satisfy IIA. @@ -30,7 +30,7 @@ For a more thorough theoretical treatment, see [SAS Documentation: Nested Logit ## R -R has multiple packages that can estimate a nested logit model. To show a simple example, we will use the `mlogit` package. +R has multiple packages that can estimate a nested logit model. To show a simple example, we will use the `mlogit` package. ```r @@ -46,16 +46,16 @@ data("TravelMode", package = "AER") # Here, we will predict what mode of travel individuals # choose using cost and wait times -nestedlogit = mlogit( +nestedlogit <- mlogit( choice ~ gcost + wait, data = TravelMode, - ##The variable from which our nests are determined - alt.var = 'mode', - #The variable that dictates the binary choice - choice = 'choice', - #List of nests as named vectors - nests = list(Fast = c('air','train'), Slow = c('car','bus')) - ) + ## The variable from which our nests are determined + alt.var = "mode", + # The variable that dictates the binary choice + choice = "choice", + # List of nests as named vectors + nests = list(Fast = c("air", "train"), Slow = c("car", "bus")) +) # The results diff --git a/Model_Estimation/GLS/probit_model.md b/Model_Estimation/GLS/probit_model.md index 5bc2cc0f..de667958 100644 --- a/Model_Estimation/GLS/probit_model.md +++ b/Model_Estimation/GLS/probit_model.md @@ -9,7 +9,7 @@ nav_order: 1 # Probit Regressions -A Probit regression is a statistical method for a best-fit line between a binary [0/1] outcome variable $$Y$$ and any number of independent variables. Probit regressions follow a [standard normal probability distribution](https://en.wikipedia.org/wiki/Normal_distribution) and the predicted values are bounded between 0 and 1. +A Probit regression is a statistical method for a best-fit line between a binary [0/1] outcome variable $$Y$$ and any number of independent variables. Probit regressions follow a [standard normal probability distribution](https://en.wikipedia.org/wiki/Normal_distribution) and the predicted values are bounded between 0 and 1. For more information about Probit, see [Wikipedia: Probit](https://en.wikipedia.org/wiki/Probit_model). @@ -40,11 +40,13 @@ import pandas as pd import statsmodels.formula.api as smf # Read in the data -df = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/datasets/mtcars.csv', - index_col=0) +df = pd.read_csv( + "https://vincentarelbundock.github.io/Rdatasets/csv/datasets/mtcars.csv", + index_col=0, +) # Specify the model -mod = smf.probit('vs ~ mpg + cyl', data=df) +mod = smf.probit("vs ~ mpg + cyl", data=df) # Fit the model res = mod.fit() @@ -53,7 +55,7 @@ res = mod.fit() res.summary() # Compute marginal effects -marge_effect = res.get_margeff(at='mean', method='dydx') +marge_effect = res.get_margeff(at="mean", method="dydx") # Show marginal effects marge_effect.summary() @@ -72,10 +74,12 @@ library(mfx) data(mtcars) # Use the glm() function to run probit -# Here we are predicting engine type using +# Here we are predicting engine type using # miles per gallon and number of cylinders as predictors -my_probit <- glm(vs ~ mpg + cyl, data = mtcars, - family = binomial(link = 'probit')) +my_probit <- glm(vs ~ mpg + cyl, + data = mtcars, + family = binomial(link = "probit") +) # The family argument says we are working with binary data # and using a probit link function (rather than, say, logit) @@ -98,3 +102,4 @@ probit foreign mpg weight headroom trunk * Recover the Marginal Effects (Beta Coefficient in OLS) margins, dydx(*) ``` + diff --git a/Model_Estimation/Matching/entropy_balancing.md b/Model_Estimation/Matching/entropy_balancing.md index b5d3af64..5108362d 100644 --- a/Model_Estimation/Matching/entropy_balancing.md +++ b/Model_Estimation/Matching/entropy_balancing.md @@ -11,9 +11,9 @@ mathjax: true ## Switch to false if this page has no equations or other math ren Entropy balancing is a method for matching treatment and control observations that comes from [Hainmueller (2012)](https://www.jstor.org/stable/41403737). It constructs a set of matching weights that, by design, forces certain balance metrics to hold. This means that, like with [Coarsened Exact Matching]({{ "/Model_Estimation/Matching/coarsened_exact_matching.html" | relative_url }}) there is no need to iterate on a matching model by performing the match, checking the balance, and trying different parameters to improve balance. However, unlike coarsened exact matching, entropy balancing does not require enormous data sets or drop large portions of the sample. -Entropy balancing requires a set of balance conditions to be provided. These are often of the form "the mean of matching variable $$A$$ must be the same between treated and control observations," i.e. +Entropy balancing requires a set of balance conditions to be provided. These are often of the form "the mean of matching variable $$A$$ must be the same between treated and control observations," i.e. -$$\sum_{i|D_i=0}w_iA_i = \sum_{i|D_i=1}A_i$$ +$$\sum_{i|D_i=0}w_iA_i = \sum_{i|D_i=1}A_i$$ where $$D_i$$ indicates treatment status and $$w_i$$ are the matching weights, and similarly for other variables for which the mean should match. However, other conditions can also be included, such as matching to equalize the variance of a matching variable, or the skewness, and so on. This is sort of like the [Generalized Method of Moments]({{ "/Model_Estimation/GLS/gmm.html" | relative_url }}) @@ -39,9 +39,10 @@ Entropy balancing can be implemented in R using the **ebal** package. ```r # R CODE -library(ebal); library(tidyverse) +library(ebal) +library(tidyverse) -br <- read_csv('https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Model_Estimation/Matching/Data/broockman2013.csv') +br <- read_csv("https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Model_Estimation/Matching/Data/broockman2013.csv") # Outcome Y <- br %>% @@ -53,8 +54,10 @@ D <- br %>% X <- br %>% select(medianhhincom, blackpercent, leg_democrat) %>% # Add square terms to match variances if we like - mutate(incsq = medianhhincom^2, - bpsq = blackpercent^2) %>% + mutate( + incsq = medianhhincom^2, + bpsq = blackpercent^2 + ) %>% as.matrix() eb <- ebalance(D, X) @@ -96,3 +99,4 @@ ebalance leg_black medianhhincom blackpercent leg_democrat, targets(2 2 1) g(wt) * Use pweight = wt to adjust estimates reg responded leg_black [pw = wt] ``` + diff --git a/Model_Estimation/Matching/matching.md b/Model_Estimation/Matching/matching.md index 783be758..3a47657c 100644 --- a/Model_Estimation/Matching/matching.md +++ b/Model_Estimation/Matching/matching.md @@ -7,3 +7,4 @@ nav_order: 1 --- # Matching + diff --git a/Model_Estimation/Model_Estimation.md b/Model_Estimation/Model_Estimation.md index 76827c13..1770ad21 100644 --- a/Model_Estimation/Model_Estimation.md +++ b/Model_Estimation/Model_Estimation.md @@ -5,3 +5,4 @@ nav_order: 5 --- # Model Estimation + diff --git a/Model_Estimation/Multilevel_Models/Multilevel_Models.md b/Model_Estimation/Multilevel_Models/Multilevel_Models.md index c0a64183..f381961d 100644 --- a/Model_Estimation/Multilevel_Models/Multilevel_Models.md +++ b/Model_Estimation/Multilevel_Models/Multilevel_Models.md @@ -7,3 +7,4 @@ nav_order: 3 --- # Multilevel Models + diff --git a/Model_Estimation/Multilevel_Models/linear_mixed_effects_regression.md b/Model_Estimation/Multilevel_Models/linear_mixed_effects_regression.md index bf6201d7..2d826513 100644 --- a/Model_Estimation/Multilevel_Models/linear_mixed_effects_regression.md +++ b/Model_Estimation/Multilevel_Models/linear_mixed_effects_regression.md @@ -19,7 +19,7 @@ $$ The intercept $$\beta_{0j}$$ has a $j$ subscript and is allowed to vary over the sample at the $$j$$ level, where $$j$$ may indicate individual or group, depending on context. The slope on $$X_{1ij}$$, $$\beta_{1j}$$, is similarly allowed to vary over the sample. These are random effects. $$\beta_{2}$$ is not allowed to vary over the sample and so is fixed. -The random parameters have their own "level-two" equations, which may or may not include level-two covariates. +The random parameters have their own "level-two" equations, which may or may not include level-two covariates. $$ \beta_{0j} = \gamma_{00} + \gamma_{01}W_j + u_{0j} @@ -93,7 +93,7 @@ sysuse nlsw88.dta, clear reg wage tenure married * Now we will allow the intercept to vary with occupation -mixed wage tenure married || occupation: +mixed wage tenure married || occupation: * Next we will allow the slope on tenure to vary with occupation mixed wage tenure married || occupation: tenure, nocons @@ -105,3 +105,4 @@ mixed wage tenure married || occupation: tenure * and age mixed wage tenure married || occupation: tenure || age: tenure ``` + diff --git a/Model_Estimation/Multilevel_Models/mixed_logit.md b/Model_Estimation/Multilevel_Models/mixed_logit.md index 7ca67c74..ea4ea04b 100644 --- a/Model_Estimation/Multilevel_Models/mixed_logit.md +++ b/Model_Estimation/Multilevel_Models/mixed_logit.md @@ -39,30 +39,34 @@ data("Electricity", package = "mlogit") # For further documentation, see dfidx. Electricity$index <- 1:nrow(Electricity) -elec = dfidx(Electricity, idx = list(c("index", "id")), - choice = "choice", varying = 3:26, sep = "") +elec <- dfidx(Electricity, + idx = list(c("index", "id")), + choice = "choice", varying = 3:26, sep = "" +) # We then estimate individual choice over electricity providers for # different cost and contract structures with a suppressed intercept -my_mixed_logit = mlogit(data = elec, - formula = choice ~ 0 + pf + cl + loc + wk + tod + seas, - # Specify distributions for random parameter estimates - # "n" indicates we have specified a normal distribution - # note pf is omitted from rpar, so it will not be estimated as random - rpar = c(cl = "n", loc = "u", wk = "n", tod = "n", seas = "n"), - # R is the number of simulation draws - R = 100, - # For simplicity, we won't include correlated parameter estimates - correlation = FALSE, - # This data is from a panel - panel = TRUE) +my_mixed_logit <- mlogit( + data = elec, + formula = choice ~ 0 + pf + cl + loc + wk + tod + seas, + # Specify distributions for random parameter estimates + # "n" indicates we have specified a normal distribution + # note pf is omitted from rpar, so it will not be estimated as random + rpar = c(cl = "n", loc = "u", wk = "n", tod = "n", seas = "n"), + # R is the number of simulation draws + R = 100, + # For simplicity, we won't include correlated parameter estimates + correlation = FALSE, + # This data is from a panel + panel = TRUE +) # Results summary(my_mixed_logit) -# Note that this output will include the simulated coefficient estimates, +# Note that this output will include the simulated coefficient estimates, # simulated standard error estimates, and distributional details for the # random coefficients (all, in this case) # Note also that pf is given as a point estimate, and mlogit does not generate @@ -70,18 +74,16 @@ summary(my_mixed_logit) # You can extract and summarize coefficient estimates using the rpar function -marg_loc = rpar(my_mixed_logit, "loc") +marg_loc <- rpar(my_mixed_logit, "loc") summary(marg_loc) # You can also normalize coefficients and distributions by, say, price -cl_by_pf = rpar(my_mixed_logit, "cl", norm = "pf") +cl_by_pf <- rpar(my_mixed_logit, "cl", norm = "pf") summary(cl_by_pf) - - - ``` For further examples, visit the CRAN vignette [here.](https://cran.r-project.org/web/packages/mlogit/vignettes/c5.mxl.html) For a very detailed example using the Electricity dataset, see [here.](https://cran.r-project.org/web/packages/mlogit/vignettes/e3mxlogit.html) + diff --git a/Model_Estimation/Multilevel_Models/random_mixed_effects_estimation.md b/Model_Estimation/Multilevel_Models/random_mixed_effects_estimation.md index 061ce3e8..233ce7be 100644 --- a/Model_Estimation/Multilevel_Models/random_mixed_effects_estimation.md +++ b/Model_Estimation/Multilevel_Models/random_mixed_effects_estimation.md @@ -39,20 +39,22 @@ That is, average earnings of graduates of an institution depends on proportion e Several packages can be used to implement a random effects model in R - such as [**lme4**](https://cran.r-project.org/web/packages/lme4/index.html) and [**nlme**](https://cran.r-project.org/web/packages/nlme/nlme.pdf). **lme4** is more widely used. The example that follows uses the **lme4** package. -``` r +```r # If necessary, install lme4 -if(!require(lme4)){install.packages("lme4")} +if (!require(lme4)) { + install.packages("lme4") +} library(lme4) # Read in data from the College Scorecard -df <- read.csv('https://raw.githubusercontent.com/LOST-STATS/LOST-STATS.github.io/master/Model_Estimation/Data/Fixed_Effects_in_Linear_Regression/Scorecard.csv') +df <- read.csv("https://raw.githubusercontent.com/LOST-STATS/LOST-STATS.github.io/master/Model_Estimation/Data/Fixed_Effects_in_Linear_Regression/Scorecard.csv") # Calculate proportion of graduates working -df$prop_working <- df$count_working/(df$count_working + df$count_not_working) +df$prop_working <- df$count_working / (df$count_working + df$count_not_working) # We write the mixed effect formula for estimation in lme4 as: -# dependent_var ~ -# covariates (that can include fixed effects) + +# dependent_var ~ +# covariates (that can include fixed effects) + # random effects - we need to specify if our model is random effects in intercepts or in slopes. In our example, we suspect random effects in intercepts at institutions. So we write "...+(1 | inst_name), ...." If we wanted to specify a model where the coefficient on prop_working was also varying by institution - we would use (1 + open | inst_name). # Here we regress average earnings graduates in an institution on prop_working, year fixed effects and random effects in intercepts for institutions. @@ -72,7 +74,7 @@ We will estimate a mixed effects model using Stata using the built in `xtreg` co ```stata * Obtain same data from Fixed Effect tutorial - + import delimited "https://raw.githubusercontent.com/LOST-STATS/LOST-STATS.github.io/master/Model_Estimation/Data/Fix ed_Effects_in_Linear_Regression/Scorecard.csv", clear @@ -91,7 +93,7 @@ encode inst_name, g(name_number) * Set the data as panel data with xtset xtset name_number -* Use xtreg with the "re" option to run random effects on institution intercepts +* Use xtreg with the "re" option to run random effects on institution intercepts * Regressing earnings_med on prop_working * with random effects for name_number (implied by re) * and also year fixed effects (which we'll add manually with i.year) @@ -100,3 +102,4 @@ xtreg earnings_med prop_working i.year, re * We note that comparing with the fixed effects model, our estimates are more precise. But, correlation between X`s and errors suggest bias in our random effect model, and we do see a large increase in estimated beta. ``` + diff --git a/Model_Estimation/OLS/OLS.md b/Model_Estimation/OLS/OLS.md index 1f6c08b9..05d546a3 100644 --- a/Model_Estimation/OLS/OLS.md +++ b/Model_Estimation/OLS/OLS.md @@ -7,3 +7,4 @@ nav_order: 1 --- # Ordinary Least Squares + diff --git a/Model_Estimation/OLS/fixed_effects_in_linear_regression.md b/Model_Estimation/OLS/fixed_effects_in_linear_regression.md index 798a2579..2f198538 100644 --- a/Model_Estimation/OLS/fixed_effects_in_linear_regression.md +++ b/Model_Estimation/OLS/fixed_effects_in_linear_regression.md @@ -57,7 +57,7 @@ reg(df, @formula(earnings_med ~ prop_working + fe(inst_name) + fe(year)), Vcov.c ## Python -There are a few packages for doing the same task in Python, however, there is a well-known issue with these packages.That is, the calculation of standard deviation might be a little different. +There are a few packages for doing the same task in Python, however, there is a well-known issue with these packages.That is, the calculation of standard deviation might be a little different. We are going to use `linearmodels` in python. Installation can be done through `pip install linearmodels` and the documentation is [here](https://bashtage.github.io/linearmodels/) @@ -69,26 +69,30 @@ import numpy as np # Load the data -data = pd.read_csv(r"https://raw.githubusercontent.com/LOST-STATS/LOST-STATS.github.io/master/Model_Estimation/Data/Fixed_Effects_in_Linear_Regression/Scorecard.csv") +data = pd.read_csv( + r"https://raw.githubusercontent.com/LOST-STATS/LOST-STATS.github.io/master/Model_Estimation/Data/Fixed_Effects_in_Linear_Regression/Scorecard.csv" +) # Set the index for fixed effects -data = data.set_index(['inst_name', 'year']) +data = data.set_index(["inst_name", "year"]) # Calculate and drop the NA Values -data['prop_working'] = data['count_working']/(data['count_working'] + data['count_not_working']) -#data = data.dropna(subset=['earnings_med', 'prop_working']) +data["prop_working"] = data["count_working"] / ( + data["count_working"] + data["count_not_working"] +) +# data = data.dropna(subset=['earnings_med', 'prop_working']) # Regression -FE = PanelOLS(data.earnings_med, data['prop_working'], - entity_effects = True, - time_effects=True - ) - +FE = PanelOLS( + data.earnings_med, data["prop_working"], entity_effects=True, time_effects=True +) + # Result -result = FE.fit(cov_type = 'clustered', - cluster_entity=True, - # cluster_time=True - ) +result = FE.fit( + cov_type="clustered", + cluster_entity=True, + # cluster_time=True +) ``` There are also other packages for fixed effect models, such as `econtools` ([link](https://pypi.org/project/econtools/)), `FixedEffectModelPyHDFE` ([link](https://pypi.org/project/FixedEffectModelPyHDFE/)), `regpyhdfe`([link](https://regpyhdfe.readthedocs.io/en/latest/intro.html)) and `econtools` ([link](https://pypi.org/project/econtools/)). @@ -105,9 +109,9 @@ We first demonstrate fixed effects in R using `felm` from the **lfe** package ([ library(lfe) # Read in data from the College Scorecard -df <- read.csv('https://raw.githubusercontent.com/LOST-STATS/LOST-STATS.github.io/master/Model_Estimation/Data/Fixed_Effects_in_Linear_Regression/Scorecard.csv') +df <- read.csv("https://raw.githubusercontent.com/LOST-STATS/LOST-STATS.github.io/master/Model_Estimation/Data/Fixed_Effects_in_Linear_Regression/Scorecard.csv") # Calculate proportion of graduates working -df$prop_working <- df$count_working/(df$count_working + df$count_not_working) +df$prop_working <- df$count_working / (df$count_working + df$count_not_working) # A felm formula is constructed as: # outcome ~ @@ -181,3 +185,4 @@ xtset, clear * We specify both sets of fixed effects in absorb() reghdfe earnings_med prop_working, absorb(name_number year) vce(cluster inst_name) ``` + diff --git a/Model_Estimation/OLS/interaction_terms_and_polynomials.md b/Model_Estimation/OLS/interaction_terms_and_polynomials.md index 00d47a82..58516865 100644 --- a/Model_Estimation/OLS/interaction_terms_and_polynomials.md +++ b/Model_Estimation/OLS/interaction_terms_and_polynomials.md @@ -15,7 +15,7 @@ $$ Y = \beta_0+\beta_1X_1+\beta_2X_2 $$ -However, if the independent variables have a nonlinear effect on the outcome, the model will be incorrectly specified. This is fine as long as that nonlinearity is modeled by including those nonlinear terms in the index. +However, if the independent variables have a nonlinear effect on the outcome, the model will be incorrectly specified. This is fine as long as that nonlinearity is modeled by including those nonlinear terms in the index. The two most common ways this occurs is by including interactions or polynomial terms. With an interaction, the effect of one variable varies according to the value of another: @@ -56,25 +56,27 @@ import statsmodels.formula.api as sms from matplotlib import pyplot as plt # Load the R mtcars dataset from a URL -df = pd.read_csv('https://raw.githubusercontent.com/LOST-STATS/lost-stats.github.io/source/Data/mtcars.csv') +df = pd.read_csv( + "https://raw.githubusercontent.com/LOST-STATS/lost-stats.github.io/source/Data/mtcars.csv" +) # Include a linear, squared, and cubic term using the I() function. # N.B. Python uses ** for exponentiation (^ means bitwise xor) -model1 = sms.ols('mpg ~ hp + I(hp**2) + I(hp**3) + cyl', data=df) +model1 = sms.ols("mpg ~ hp + I(hp**2) + I(hp**3) + cyl", data=df) print(model1.fit().summary()) # Include an interaction term and the variables by themselves using * # The interaction term is represented by hp:cyl -model2 = sms.ols('mpg ~ hp * cyl', data=df) +model2 = sms.ols("mpg ~ hp * cyl", data=df) print(model2.fit().summary()) # Equivalently, you can request "all quadratic interaction terms" by doing -model3 = sms.ols('mpg ~ (hp + cyl) ** 2', data=df) +model3 = sms.ols("mpg ~ (hp + cyl) ** 2", data=df) print(model3.fit().summary()) # Include only the interaction term and not the variables themselves with : # Hard to interpret! Occasionally useful though. -model4 = sms.ols('mpg ~ hp : cyl', data=df) +model4 = sms.ols("mpg ~ hp : cyl", data=df) print(model4.fit().summary()) ``` @@ -94,7 +96,7 @@ model1 <- lm(mpg ~ hp + I(hp^2) + I(hp^3) + cyl, data = mtcars) model2 <- lm(mpg ~ poly(hp, 3, raw = TRUE) + cyl, data = mtcars) # Include an interaction term and the variables by themselves using * -model3 <- lm(mpg ~ hp*cyl, data = mtcars) +model3 <- lm(mpg ~ hp * cyl, data = mtcars) # Include only the interaction term and not the variables themselves with : # Hard to interpret! Occasionally useful though. @@ -126,7 +128,7 @@ reg mpg c.weight##c.weight##c.weight foreign It is also possible to use other type of functions and obtain correct marginal effects. For example: Say that you want to estimate the model: -$$ y = a_0 + a_1 * x + a_2 * 1/x + e $$ +$$ y = a_0 + a_1 * x + a_2 * 1/x + e $$ and you want to estimate the marginal effects with respect to $x$. You can do this as follows: @@ -147,3 +149,4 @@ margins, dydx(price) nl (mpg = {a0} + {a1} * price + {a2}*1/price), var(price) margins, dydx(price) ``` + diff --git a/Model_Estimation/OLS/simple_linear_regression.md b/Model_Estimation/OLS/simple_linear_regression.md index 53a34128..04e52f30 100644 --- a/Model_Estimation/OLS/simple_linear_regression.md +++ b/Model_Estimation/OLS/simple_linear_regression.md @@ -17,7 +17,7 @@ For more information about OLS, see [Wikipedia: Ordinary Least Squares](https:// - OLS assumes that you have specified a true linear relationship. - OLS results are not guaranteed to have a causal interpretation. Just because OLS estimates a positive relationship between $$X_1$$ and $$Y$$ does not necessarily mean that an increase in $$X_1$$ will cause $$Y$$ to increase. -- OLS does *not* require that your variables follow a normal distribution. +- OLS does *not* require that your variables follow a normal distribution. ## Also Consider @@ -68,11 +68,10 @@ import statsmodels.formula.api as smf mtcars = sm.datasets.get_rdataset("mtcars").data # Fit OLS regression model to mtcars -ols = smf.ols(formula='mpg ~ cyl + hp + wt', data=mtcars).fit() +ols = smf.ols(formula="mpg ~ cyl + hp + wt", data=mtcars).fit() # Look at the OLS results print(ols.summary()) - ``` ## R @@ -93,13 +92,13 @@ summary(olsmodel) ```sas /* Load Data */ -proc import datafile="C:mtcars.dbf" +proc import datafile="C:mtcars.dbf" out=fromr dbms=dbf; run; /* OLS regression */ proc reg; model mpg = cyl hp wt; -run; +run; ``` ## Stata @@ -112,3 +111,4 @@ sysuse https://github.com/LOST-STATS/lost-stats.github.io/blob/master/Data/auto. * and headroom, trunk, and weight as predictors regress mpg headroom trunk weight ``` + diff --git a/Model_Estimation/Research_Design/Research_Design.md b/Model_Estimation/Research_Design/Research_Design.md index a826e54e..4311efc3 100644 --- a/Model_Estimation/Research_Design/Research_Design.md +++ b/Model_Estimation/Research_Design/Research_Design.md @@ -7,3 +7,4 @@ nav_order: 4 --- # Research Design + diff --git a/Model_Estimation/Research_Design/density_discontinuity_test.md b/Model_Estimation/Research_Design/density_discontinuity_test.md index 9c8175d9..ff3989a1 100644 --- a/Model_Estimation/Research_Design/density_discontinuity_test.md +++ b/Model_Estimation/Research_Design/density_discontinuity_test.md @@ -77,3 +77,4 @@ import delimited "https://raw.githubusercontent.com/LOST-STATS/LOST-STATS.github * the "plot" option adds a plot while we're at it rddensity x, c(0) plot ``` + diff --git a/Model_Estimation/Research_Design/event_study.md b/Model_Estimation/Research_Design/event_study.md index a6eaa7dc..b6749c67 100644 --- a/Model_Estimation/Research_Design/event_study.md +++ b/Model_Estimation/Research_Design/event_study.md @@ -68,18 +68,20 @@ import pandas as pd import linearmodels as lm # Read in data -df = pd.read_stata("https://raw.githubusercontent.com/LOST-STATS/LOST-STATS.github.io/master/Model_Estimation/Data/Event_Study_DiD/bacon_example.dta") +df = pd.read_stata( + "https://raw.githubusercontent.com/LOST-STATS/LOST-STATS.github.io/master/Model_Estimation/Data/Event_Study_DiD/bacon_example.dta" +) # create the lag/lead for treated states # fill in control obs with 0 # This allows for the interaction between `treat` and `time_to_treat` to occur for each state. # Otherwise, there may be some missingss and the estimations will be off. -df['time_to_treat'] = ( - df['_nfd'].sub(df['year']) - # missing values for _nfd implies no treatment - .fillna(0) - # so we don't have decimals in our factor names - .astype('int') +df["time_to_treat"] = ( + df["_nfd"].sub(df["year"]) + # missing values for _nfd implies no treatment + .fillna(0) + # so we don't have decimals in our factor names + .astype("int") ) # Create our interactions by hand, @@ -88,39 +90,38 @@ df['time_to_treat'] = ( df = ( # returns dataframe with dummy columns in place of the columns # in the named argument, all other columns untouched - pd.get_dummies(df, columns=['time_to_treat'], prefix='INX') - # Be sure not to include the minuses in the name - .rename(columns=lambda x: x.replace('-', 'm')) - # get_dummies has a `drop_first` argument, but if we want to - # refer to a specific level, we should return all levels and - # drop out reference column manually - .drop(columns='INX_m1') - # Set our individual and time (index) for our data - .set_index(['stfips', 'year']) + pd.get_dummies(df, columns=["time_to_treat"], prefix="INX") + # Be sure not to include the minuses in the name + .rename(columns=lambda x: x.replace("-", "m")) + # get_dummies has a `drop_first` argument, but if we want to + # refer to a specific level, we should return all levels and + # drop out reference column manually + .drop(columns="INX_m1") + # Set our individual and time (index) for our data + .set_index(["stfips", "year"]) ) # Estimate the regression -scalars = ['pcinc', 'asmrh', 'cases'] -factors = df.columns[df.columns.str.contains('INX')] +scalars = ["pcinc", "asmrh", "cases"] +factors = df.columns[df.columns.str.contains("INX")] exog = factors.union(scalars) -endog = 'asmrs' +endog = "asmrs" # with the standard api: mod = lm.PanelOLS(df[endog], df[exog], entity_effects=True, time_effects=True) -fit = mod.fit(cov_type='clustered', cluster_entity=True) +fit = mod.fit(cov_type="clustered", cluster_entity=True) fit.summary # with the formula api: # We can save ourselves some time by creating the regression formula automatically -inxnames = df.columns[range(13,df.shape[1])] -formula = '{} ~ {} + EntityEffects + TimeEffects'.format(endog, '+'.join(exog)) +inxnames = df.columns[range(13, df.shape[1])] +formula = "{} ~ {} + EntityEffects + TimeEffects".format(endog, "+".join(exog)) -mod = lm.PanelOLS.from_formula(formula,df) +mod = lm.PanelOLS.from_formula(formula, df) # Specify clustering when we fit the model -clfe = mod.fit(cov_type = 'clustered', - cluster_entity = True) +clfe = mod.fit(cov_type="clustered", cluster_entity=True) # Look at regression results clfe.summary @@ -130,39 +131,38 @@ Now we can plot the results with **matplotlib**. Two common approaches are to in ```python?example=pyevent # Get coefficients and CIs -res = pd.concat([clfe.params, clfe.std_errors], axis = 1) +res = pd.concat([clfe.params, clfe.std_errors], axis=1) # Scale standard error to 95% CI -res['ci'] = res['std_error']*1.96 +res["ci"] = res["std_error"] * 1.96 # We only want time interactions -res = res.filter(like='INX', axis=0) +res = res.filter(like="INX", axis=0) # Turn the coefficient names back to numbers res.index = ( - res.index - .str.replace('INX_', '') - .str.replace('m', '-') - .astype('int') - .rename('time_to_treat') + res.index.str.replace("INX_", "") + .str.replace("m", "-") + .astype("int") + .rename("time_to_treat") ) # And add our reference period back in, and sort automatically -res.reindex(range(res.index.min(), res.index.max()+1)).fillna(0) +res.reindex(range(res.index.min(), res.index.max() + 1)).fillna(0) # Plot the estimates as connected lines with error bars ax = res.plot( - y='parameter', - yerr='ci', - xlabel='Time to Treatment', - ylabel='Estimated Effect', - legend=False + y="parameter", + yerr="ci", + xlabel="Time to Treatment", + ylabel="Estimated Effect", + legend=False, ) # Add a horizontal line at 0 -ax.axhline(0, linestyle='dashed') +ax.axhline(0, linestyle="dashed") # And a vertical line at the treatment time # some versions of pandas have bug return x-axis object with data_interval # starting at 0. In that case change 0 to 21 -ax.axvline(0, linestyle='dashed') +ax.axvline(0, linestyle="dashed") ``` Which produces: @@ -182,17 +182,17 @@ library(tidyverse) library(broom) library(haven) -#Load and prepare data +# Load and prepare data bacon_df <- read_dta("https://raw.githubusercontent.com/LOST-STATS/LOST-STATS.github.io/master/Model_Estimation/Data/Event_Study_DiD/bacon_example.dta") %>% mutate( - # create the lag/lead for treated states - # fill in control obs with 0 - # This allows for the interaction between `treat` and `time_to_treat` to occur for each state. - # Otherwise, there may be some NAs and the estimations will be off. - time_to_treat =ifelse(is.na(`_nfd`),0,year - `_nfd`), - # this will determine the difference - # btw controls and treated states - treat = ifelse(is.na(`_nfd`),0,1) + # create the lag/lead for treated states + # fill in control obs with 0 + # This allows for the interaction between `treat` and `time_to_treat` to occur for each state. + # Otherwise, there may be some NAs and the estimations will be off. + time_to_treat = ifelse(is.na(`_nfd`), 0, year - `_nfd`), + # this will determine the difference + # btw controls and treated states + treat = ifelse(is.na(`_nfd`), 0, 1) ) ``` @@ -201,21 +201,7 @@ Also, while it's not necessary given how we're about to use the **fixest** packa We will run the event-study regression using `feols()` from the **fixest** package. **fixest** is very fast, contains support for complex fixed-effects interactions, selecting our own reference group like we need with `i()`, and will also help run the Sun and Abraham (2020) estimator. ```r?example=event_study -m_1 <- feols(asmrs ~ - # The time-treatment interaction terms - i(treat, time_to_treat, ref=-1) + - # Controls - pcinc + asmrh + cases - # State and year fixed effects - | stfips + year, - # feols clusters by the first fixed effect anyway, just making that clear - cluster=~stfips, data=bacon_df) -# Now turn the results into a data frame with a year column for easy plotting -event_1 <- tidy(m_1, conf.int = TRUE) %>% - # For plotting purposes, we only want the terms that reference years - # and not the controls - mutate(year = as.numeric(parse_number(term))) %>% - filter(!is.na(year)) + ``` Now we can plot the results with **ggplot2**. Two common approaches are to include vertical-line confidence intervals with `geom_pointrange()` or to include a confidence interval ribbon with `geom_ribbon()`. I'll show the `geom_pointrange()` version, but this is easy to swap out. @@ -224,30 +210,34 @@ Now, you could just simply use `coefplot(m_1)` from **fixest** and be done with ```r?example=event_study event_1 %>% - ggplot(mapping = aes(x = year, y = estimate, - ymin = conf.low, ymax = conf.high))+ - geom_pointrange(position = position_dodge(width = 1), - # Optional decoration: - color="black", fatten=.5, alpha=.8) + - # Add a line marker for y = 0 (to see if the CI overlaps 0) - geom_hline(yintercept=0, color = "red",alpha=0.2)+ - # A marker for the last pre-event period - geom_vline(xintercept = -1, color = "black", size=0.5, alpha=0.4) + - # And the event period - geom_vline(xintercept = 0, linetype="dotted", color = "black", size=0.5, alpha=0.2)+ - # Additional decoration: - theme_bw()+ - theme( - plot.title = element_text(face = "bold", size = 12), - legend.background = element_rect(fill = "white", size = 4, colour = "white"), - legend.justification = c(0, 1), - legend.position = c(0, 1), - axis.ticks = element_line(colour = "white", size = 0.1), - panel.grid.major = element_line(colour = "white", size = 0.07), - panel.grid.minor = element_blank() - )+ - annotate("text", x = c(0,2), y=30, label = c("","treat"))+ - labs(title="Event Study: Staggered Treatment", y="Estimate", x="Time") + ggplot(mapping = aes( + x = year, y = estimate, + ymin = conf.low, ymax = conf.high + )) + + geom_pointrange( + position = position_dodge(width = 1), + # Optional decoration: + color = "black", fatten = .5, alpha = .8 + ) + + # Add a line marker for y = 0 (to see if the CI overlaps 0) + geom_hline(yintercept = 0, color = "red", alpha = 0.2) + + # A marker for the last pre-event period + geom_vline(xintercept = -1, color = "black", size = 0.5, alpha = 0.4) + + # And the event period + geom_vline(xintercept = 0, linetype = "dotted", color = "black", size = 0.5, alpha = 0.2) + + # Additional decoration: + theme_bw() + + theme( + plot.title = element_text(face = "bold", size = 12), + legend.background = element_rect(fill = "white", size = 4, colour = "white"), + legend.justification = c(0, 1), + legend.position = c(0, 1), + axis.ticks = element_line(colour = "white", size = 0.1), + panel.grid.major = element_line(colour = "white", size = 0.07), + panel.grid.minor = element_blank() + ) + + annotate("text", x = c(0, 2), y = 30, label = c("", "treat")) + + labs(title = "Event Study: Staggered Treatment", y = "Estimate", x = "Time") ``` This results in: @@ -259,62 +249,7 @@ Another common option in these graphs is to link all the individual point estima Of course, as earlier mentioned, this analysis is subject to the critique by Sun and Abraham (2020). We can also use **fixest** to estimate the Sun and Abraham estimator to calculate effects separately by time-when-treated, and then aggregate to the time-to-treatment level properly, avoiding the way these estimates can "contaminate" each other in the regular model. ```r?example=event_study -# see help(aggregate.fixest) -# As Sun and Abraham indicate, drop any always-treated groups -sun_df <- bacon_df %>% - filter(`_nfd` > min(year) | !treat) %>% - # and set time_to_treat to -1000 for untreated groups - mutate(time_to_treat = case_when( - treat == 0 ~ -1000, - treat == 1 ~ time_to_treat - )) %>% - # and create a new year-treated variable that's impossibly far in the future - # for untreated groups - mutate(year_treated = case_when( - treat == 0 ~ 10000, - treat == 1 ~ `_nfd` - )) %>% - # and a shared identifier for year treated and year - mutate(id = paste0(year_treated, ':', year)) - -# Don't include so many pre- and post-lags that you've got a lot of tiny periods -table(sun_df$time_to_treat) -sun_df <- sun_df %>% - filter(time_to_treat == -1000 | (time_to_treat >= -9 & time_to_treat <= 24)) - -# Read the Sun and Abraham paper before including controls as I do here -m_2 <- feols(asmrs ~ - # This time, interact time_to_treatment with year treated - # Dropping as reference the -1 period and the never-treated - i(time_to_treat, f2 = year_treated, drop = c(-1, -1000)) + - # Controls - pcinc + asmrh + cases - # Fixed effects for group and year - | stfips + year, - data=sun_df) -# Aggregate the coefficients by group -agg_coef = aggregate(m_2, "(time_to_treat)::(-?[[:digit:]]+)") -# And plot -agg_coef %>% - as_tibble() %>% - mutate(conf.low = Estimate - 1.96*`Std. Error`, - conf.high = Estimate + 1.96*`Std. Error`, - `Time to Treatment` = c(-9:-2, 0:24)) %>% - ggplot(mapping = aes(x = `Time to Treatment`, y = Estimate, - ymin = conf.low, ymax = conf.high))+ - geom_pointrange(position = position_dodge(width = 1), - # Optional decoration: - color="black", fatten=.5, alpha=.8) + - geom_line() + - # Add a line marker for y = 0 (to see if the CI overlaps 0) - geom_hline(yintercept=0, color = "red",alpha=0.2)+ - # A marker for the last pre-event period - geom_vline(xintercept = -1, color = "black", size=0.5, alpha=0.4) + - # And the event period - geom_vline(xintercept = 0, linetype="dotted", color = "black", size=0.5, alpha=0.2)+ - # Additional decoration: - theme_bw()+ - labs(title="Event Study: Staggered Treatment with Sun and Abraham (2020) Estimation", y="Estimate", x="Time") + ``` @@ -473,3 +408,4 @@ twoway (sc coef time_to_treat, connect(line)) /// (function y = 0, range(`bottom_range' `top_range') horiz), /// xtitle("Time to Treatment with Sun and Abraham (2020) Estimation") caption("95% Confidence Intervals Shown") ``` + diff --git a/Model_Estimation/Research_Design/instrumental_variables.md b/Model_Estimation/Research_Design/instrumental_variables.md index 8bc0154a..53539c28 100644 --- a/Model_Estimation/Research_Design/instrumental_variables.md +++ b/Model_Estimation/Research_Design/instrumental_variables.md @@ -13,7 +13,7 @@ In the regression model $$ Y = \beta_0 + \beta_1 X + \epsilon $$ -where $$\epsilon$$ is an error term, the estimated $$\hat{\beta}_1$$ will not give the causal effect of $$X$$ on $$Y$$ if $$X$$ is *endogenous* - that is, if $$X$$ is related to $$\epsilon$$ and so determined by forces *within the model* (endogenous). +where $$\epsilon$$ is an error term, the estimated $$\hat{\beta}_1$$ will not give the causal effect of $$X$$ on $$Y$$ if $$X$$ is *endogenous* - that is, if $$X$$ is related to $$\epsilon$$ and so determined by forces *within the model* (endogenous). One way to recover the causal effect of $$X$$ on $$Y$$ is to use instrumental variables. If there exists a variable $$Z$$ that is related to $$X$$ but is completely unrelated to $$\epsilon$$ (perhaps after adding some controls), then you can use instrumental variables estimation to isolate only the part of the variation in $$X$$ that is explained by $$Z$$. Naturally, then, this part of the variation is unrelated to $$\epsilon$$ because $$Z$$ is unrelated to $$\epsilon$$, and you can get the causal effect of that part of $$X$$. @@ -45,21 +45,23 @@ from linearmodels.iv import IV2SLS import pandas as pd import numpy as np -df = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/AER/CigarettesSW.csv', - index_col=0) +df = pd.read_csv( + "https://vincentarelbundock.github.io/Rdatasets/csv/AER/CigarettesSW.csv", + index_col=0, +) # We will use cigarette taxes as an instrument for cigarette prices # to evaluate the effect of cigarette price on log number of packs smoked # With income per capita as a control # Adjust everything for inflation -df['rprice'] = df['price']/df['cpi'] -df['rincome'] = df['income']/df['population']/df['cpi'] -df['tdiff'] = (df['taxs'] - df['tax'])/df['cpi'] +df["rprice"] = df["price"] / df["cpi"] +df["rincome"] = df["income"] / df["population"] / df["cpi"] +df["tdiff"] = (df["taxs"] - df["tax"]) / df["cpi"] # Specify formula in format of 'y ~ exog + [endog ~ instruments]'. # The '1' on the right-hand side of the formula adds a constant. -formula = 'np.log(packs) ~ 1 + np.log(rincome) + [np.log(rprice) ~ tdiff]' +formula = "np.log(packs) ~ 1 + np.log(rincome) + [np.log(rprice) ~ tdiff]" # Specify model and data mod = IV2SLS.from_formula(formula, df) @@ -69,7 +71,6 @@ res = mod.fit() # Show model summary res.summary - ``` ## R @@ -89,14 +90,15 @@ data(CigarettesSW) # With income per capita as a control # Adjust everything for inflation -CigarettesSW$rprice <- CigarettesSW$price/CigarettesSW$cpi -CigarettesSW$rincome <- CigarettesSW$income/CigarettesSW$population/CigarettesSW$cpi -CigarettesSW$tdiff <- (CigarettesSW$taxs - CigarettesSW$tax)/CigarettesSW$cpi +CigarettesSW$rprice <- CigarettesSW$price / CigarettesSW$cpi +CigarettesSW$rincome <- CigarettesSW$income / CigarettesSW$population / CigarettesSW$cpi +CigarettesSW$tdiff <- (CigarettesSW$taxs - CigarettesSW$tax) / CigarettesSW$cpi # The regression formula takes the format # dependent.variable ~ endogenous.variables + controls | instrumental.variables + controls ivmodel <- ivreg(log(packs) ~ log(rprice) + log(rincome) | tdiff + log(rincome), - data = CigarettesSW) + data = CigarettesSW +) summary(ivmodel) @@ -104,23 +106,25 @@ summary(ivmodel) library(lfe) # The regression formula takes the format -# dependent vairable ~ +# dependent vairable ~ # controls | -# fixed.effects | +# fixed.effects | # (endogenous.variables ~ instruments) | # clusters.for.standard.errors # So if need be it is straightforward to adjust this example to account for # fixed effects and clustering. # Note the 0 indicating no fixed effects ivmodel2 <- felm(log(packs) ~ log(rincome) | 0 | (log(rprice) ~ tdiff), - data = CigarettesSW) + data = CigarettesSW +) summary(ivmodel2) # felm can also use several k-class estimation methods; see help(felm) for the full list. -# Let's run it with a limited-information maximum likelihood estimator with +# Let's run it with a limited-information maximum likelihood estimator with # the fuller adjustment set to minimize squared error (4). ivmodel3 <- felm(log(packs) ~ log(rincome) | 0 | (log(rprice) ~ tdiff), - data = CigarettesSW, kclass = 'liml', fuller = 4) + data = CigarettesSW, kclass = "liml", fuller = 4 +) summary(ivmodel3) ``` @@ -144,12 +148,13 @@ g lrprice = ln(rprice) * The syntax for the regression is * name_of_estimator dependent_variable controls (endogenous_variable = instruments) -* where name_of_estimator can be two stage least squares (2sls), -* limited information maximum likelihood (liml, note that ivregress doesn't support k-class estimators), +* where name_of_estimator can be two stage least squares (2sls), +* limited information maximum likelihood (liml, note that ivregress doesn't support k-class estimators), * or generalized method of moments (gmm) * Here we can run two stage least squares ivregress 2sls lpacks rincome (lrprice = tdiff) -* Or gmm. +* Or gmm. ivregress gmm lpacks rincome (lrprice = tdiff) ``` + diff --git a/Model_Estimation/Research_Design/regression_discontinuity_design.md b/Model_Estimation/Research_Design/regression_discontinuity_design.md index 2cde186d..3a1d9d1e 100644 --- a/Model_Estimation/Research_Design/regression_discontinuity_design.md +++ b/Model_Estimation/Research_Design/regression_discontinuity_design.md @@ -85,25 +85,28 @@ df <- read.csv("https://raw.githubusercontent.com/LOST-STATS/LOST-STATS.github.i # If we want to specify options for bandwidth selection, we can run rdbwselect directly. # Otherwise, rdrobust will run it with default options by itself # c(0) indicates that treatment is assigned at 0 (i.e. someone gets more votes than the opponent) -bandwidth <- rdbwselect(df$y, df$x, c=0) +bandwidth <- rdbwselect(df$y, df$x, c = 0) # Run a sharp RDD with a second-order polynomial term rdd <- rdrobust(df$y, df$x, - c=0, p=2) + c = 0, p = 2 +) summary(rdd) # Run a fuzzy RDD # We don't have a fuzzy RDD in this data, but let's create one, where # probability of treatment jumps from 20% to 60% at the cutoff N <- nrow(df) -df$treatment <- (runif(N) < .2)*(df$x < 0) + (runif(N) < .6)*(df$x >= 0) +df$treatment <- (runif(N) < .2) * (df$x < 0) + (runif(N) < .6) * (df$x >= 0) rddfuzzy <- rdrobust(df$y, df$x, - c=0, p=2, fuzzy = df$treatment) + c = 0, p = 2, fuzzy = df$treatment +) summary(rddfuzzy) # Generate a standard RDD plot with a polynomial of 2 (default is 4) rdplot(df$y, df$x, - c = 0, p = 2) + c = 0, p = 2 +) ``` ## Stata @@ -138,4 +141,3 @@ rdrobust y x, c(0) fuzzy(treatment) rdplot y x, c(0) p(2) ``` - diff --git a/Model_Estimation/Research_Design/synthetic_control_method.md b/Model_Estimation/Research_Design/synthetic_control_method.md index 14b7e6cb..86785bbe 100644 --- a/Model_Estimation/Research_Design/synthetic_control_method.md +++ b/Model_Estimation/Research_Design/synthetic_control_method.md @@ -11,15 +11,15 @@ mathjax: true ## Switch to false if this page has no equations or other math ren Synthetic Control Method is a way of estimating the causal effect of an intervention in comparative case studies. It is typically used with a small number of large units (e.g. countries, states, counties) to estimate the effects of aggregate interventions. The idea is to construct a convex combination of similar untreated units (often referred to as the "donor pool") to create a synthetic control that closely resembles the treatment subject and conduct counterfactual analysis with it. -We have $$j = 1, 2, ..., J+1$$ units, assuming without loss of generality that the first unit is the treated unit, $$Y_{1t}$$. Denoting the potential outcome without intervention as $$Y_{1t}^N$$, our goal is to estimate the treatment effect: +We have $$j = 1, 2, ..., J+1$$ units, assuming without loss of generality that the first unit is the treated unit, $$Y_{1t}$$. Denoting the potential outcome without intervention as $$Y_{1t}^N$$, our goal is to estimate the treatment effect: $$ \tau_{1t} = Y_{1t} - Y_{1t}^N $$ -We won't have data for $$Y_{1t}^N$$ but we can use synthetic controls to estimate it. +We won't have data for $$Y_{1t}^N$$ but we can use synthetic controls to estimate it. -Let the $$k$$ x $$J$$ matrix $$X_0 = [X_2 ... X_{J+1}]$$ represent characteristics for the untreated units and the $$k$$-length vector $$X_1$$ represent characteristics for the treatment unit. Last, define our $$J\times 1$$ vector of weights as $$W = (w_2, ..., w_{J+1})'$$. Recall, these weights are used to form a convex combination of the untreated units. Now we have our estimate for the treatment effect: +Let the $$k$$ x $$J$$ matrix $$X_0 = [X_2 ... X_{J+1}]$$ represent characteristics for the untreated units and the $$k$$-length vector $$X_1$$ represent characteristics for the treatment unit. Last, define our $$J\times 1$$ vector of weights as $$W = (w_2, ..., w_{J+1})'$$. Recall, these weights are used to form a convex combination of the untreated units. Now we have our estimate for the treatment effect: $$ \hat{\tau_{1t}} = Y_{1t} - \hat{Y_{1t}^N} @@ -27,10 +27,10 @@ $$ where $$\hat{Y_{1t}^N} = \sum_{j=2}^{J+1} w_j Y_{jt}$$. -The matrix of weights is found by choosing $$W*$$ to minimize +The matrix of weights is found by choosing $$W*$$ to minimize $$ \|X_1 - X_0W\| -$$ +$$ such that $$W >> 0$$ and $$\sum_2^{J+2} w_j = 1$$. Once you've found the $$W*$$, you can put together an estimated $$\hat{Y_{1t}}$$ (synthetic control) for all time periods $$t$$. Because our synthetic control was constructed from untreated units, when the intervention occurs at time $$T_0$$, the difference between the synthetic control and the treated unit gives us our estimated treatment effect. @@ -43,9 +43,9 @@ As a last bit of intuition, below is a graph depicting the upshot of the method. - Unlike the [difference-in-difference](https://lost-stats.github.io/Model_Estimation/Research_Design/two_by_two_difference_in_difference.html) method, parallel trends aren't a necessary assumption. However, the donor pool must still share similar characteristics to the treatment unit in order to construct an accurate estimate. - Panel data is necessary for the synthetic control method and, typically, requires observations over many time periods. Specifically, the pre-intervention time frame ought to be large enough to form an accurate estimate. -- Aggregate data is required for this method. Examples include state-level per-capita GDP, country-level crime rates, and state-level alcohol consumption statistics. Additionally, if aggregate data doesn't exist, you can sometimes aggregate micro-level data to estimate aggregate values. +- Aggregate data is required for this method. Examples include state-level per-capita GDP, country-level crime rates, and state-level alcohol consumption statistics. Additionally, if aggregate data doesn't exist, you can sometimes aggregate micro-level data to estimate aggregate values. - As a caveat to the previous bullet point, be wary of structural breaks when using large pre-intervention periods. -- [Abadie and L'Hour (2020)](https://economics.mit.edu/files/18642) also proposes a penalization method for performing the synthetic control method on disaggregated data. +- [Abadie and L'Hour (2020)](https://economics.mit.edu/files/18642) also proposes a penalization method for performing the synthetic control method on disaggregated data. ## Also Consider @@ -74,7 +74,7 @@ data("synth.data") # Once we've gathered our dataprep() output, we can just use that as our sole input for synth() and we'll be good to go. # One important note is that your data must be in long format with id variables (integers) and name variables (character) for each unit. -dataprep_out = dataprep( +dataprep_out <- dataprep( foo = synth.data, # first input is our data predictors = c("X1", "X2", "X3"), # identify our predictor variables predictors.op = "mean", # operation to be performed on the predictor variables for when we form our X_1 and X_0 matrices. @@ -84,8 +84,10 @@ dataprep_out = dataprep( unit.names.variable = "name", # identify our name variable time.variable = "year", # identify our time period variable treatment.identifier = 7, # integer that indicates the id variable value for our treatment unit - controls.identifier = c(2, 13, 17, 29, - 32, 36, 38), # vector that indicates the id variable values for the donor pool + controls.identifier = c( + 2, 13, 17, 29, + 32, 36, 38 + ), # vector that indicates the id variable values for the donor pool time.optimize.ssr = c(1984:1990), # identify the time period you want to optimize over to find the W*. Includes pre-treatment period and the treatment year. time.plot = c(1984:1996) # periods over which results are to be plotted with Synth's plot functions ) @@ -93,7 +95,7 @@ dataprep_out = dataprep( # Now we have our data ready in the form of a list. We have all the matrices we need to run synth() # Our output from the synth() function will be a list that includes our optimal weight matrix W* -synth_out = dataprep_out %>% synth() +synth_out <- dataprep_out %>% synth() # From here, we can plot the treatment variable and the synthetic control using Synth's plot function. # The variable tr.intake is an optional variable if you want a dashed vertical line where the intervention takes place. @@ -102,8 +104,7 @@ synth_out %>% path.plot(dataprep.res = dataprep_out, tr.intake = 1990) # Finally, we can construct our synthetic control variable if we wanted to conduct difference-in-difference analysis on it to estimate the treatment effect. -synth_control = dataprep_out$Y0plot %*% synth_out$solution.w - +synth_control <- dataprep_out$Y0plot %*% synth_out$solution.w ``` ## Stata @@ -115,57 +116,58 @@ To implement the synthetic control method in Stata, we will be using the [synth] ssc install blindschemes *Install synth and synth_runner if they're not already installed (uncomment these to install) -* ssc install synth, all +* ssc install synth, all * cap ado uninstall synth_runner //in-case already installed * net install synth_runner, from(https://raw.github.com/bquistorff/synth_runner/master/) replace -*Import Dataset -sysuse synth_smoking.dta, clear +*Import Dataset +sysuse synth_smoking.dta, clear -*Need to set the data as time series, using tsset -tsset state year +*Need to set the data as time series, using tsset +tsset state year ``` -Next we will run the synthetic control analysis using synth_runner, which adds some useful options for estimation. +Next we will run the synthetic control analysis using synth_runner, which adds some useful options for estimation. -Note that this example uses the pre-treatment outcome for just three years (1988, 1980, and 1975), but any combination of pre-treatment outcome years can be specified. The `nested` option specifies a more computationally intensive but comprehensive method for estimating the synthetic control. The `trunit()` option specifies the ID of the treated entity (in this case, the state of California has an ID of 3). +Note that this example uses the pre-treatment outcome for just three years (1988, 1980, and 1975), but any combination of pre-treatment outcome years can be specified. The `nested` option specifies a more computationally intensive but comprehensive method for estimating the synthetic control. The `trunit()` option specifies the ID of the treated entity (in this case, the state of California has an ID of 3). ```stata synth cigsale beer lnincome retprice age15to24 cigsale(1988) /// cigsale(1980) cigsale(1975), trunit(3) trperiod(1989) fig /// - nested keep(synth_results_data.dta) replace + nested keep(synth_results_data.dta) replace -/*Keeping the synth_results_data.dta stores a dataset of all the time series values of cigsale for each +/*Keeping the synth_results_data.dta stores a dataset of all the time series values of cigsale for each year for California (observed) and synthetic California (constructed using a weighted average of - observed data from donor states). We can then import this dataset to create a synth plot whose + observed data from donor states). We can then import this dataset to create a synth plot whose attributes we can control. */ -use synth_results_data.dta, clear +use synth_results_data.dta, clear drop _Co_Number _W_Weight // Drops the columns of the data that store the donor state weights twoway line (_Y_treated _Y_synthetic _time), scheme(plottig) xline(1989) /// xtitle(Year) ytitle(Cigarette Sales) legend(pos(6) rows(1)) ** Run the analysis using synth_runner -*Import Dataset -sysuse synth_smoking.dta, clear +*Import Dataset +sysuse synth_smoking.dta, clear -*Need to set the data as time series, using tsset -tsset state year +*Need to set the data as time series, using tsset +tsset state year *Estimate Synthetic Control using synth_runner synth_runner cigsale beer(1984(1)1988) lnincome(1972(1)1988) retprice age15to24 cigsale(1988) cigsale(1980) /// cigsale(1975), trunit(3) trperiod(1989) gen_vars ``` -We can plot the effects in two ways: displaying both the treated and synthetic time series together and displaying the difference between the two over the time series. The first plot is equivalent to the plot produced by specifying the `fig` option for synth, except you can control aspects of the figure. For both plots you can control the plot appearence by specifying `effect_options()` or `tc_options()`, depending on which plot you would like to control. +We can plot the effects in two ways: displaying both the treated and synthetic time series together and displaying the difference between the two over the time series. The first plot is equivalent to the plot produced by specifying the `fig` option for synth, except you can control aspects of the figure. For both plots you can control the plot appearence by specifying `effect_options()` or `tc_options()`, depending on which plot you would like to control. ```stata effect_graphs, trlinediff(-1) effect_gname(cigsale1_effect) tc_gname(cigsale1_tc) /// effect_options(scheme(plottig)) tc_options(scheme(plottig)) - -/*Graph the outcome paths of all units and (if there is only one treated unit) + +/*Graph the outcome paths of all units and (if there is only one treated unit) a second graph that shows prediction differences for all units */ single_treatment_graphs, trlinediff(-1) raw_gname(cigsale1_raw) /// effects_gname(cigsale1_effects) effects_ylabels(-30(10)30) /// effects_ymax(35) effects_ymin(-35) ``` + diff --git a/Model_Estimation/Research_Design/two_by_two_difference_in_difference.md b/Model_Estimation/Research_Design/two_by_two_difference_in_difference.md index 17d1eeef..88545b32 100644 --- a/Model_Estimation/Research_Design/two_by_two_difference_in_difference.md +++ b/Model_Estimation/Research_Design/two_by_two_difference_in_difference.md @@ -32,7 +32,6 @@ Difference-in-difference makes use of a treatment that was applied to one group ## Python ```python - # Step 1: Load libraries and import data import pandas as pd @@ -49,47 +48,48 @@ url = ( df = pd.read_excel(url) -# Step 2: indicator variables +# Step 2: indicator variables # whether treatment has occured at all -df['after'] = df['year'] >= 2014 +df["after"] = df["year"] >= 2014 # whether it has occurred to this entity -df['treatafter'] = df['after'] * df['treat'] +df["treatafter"] = df["after"] * df["treat"] # Step 3: # use pandas basic built in plot functionality to get a visual # perspective of our parallel trends assumption -ax = df.pivot(index='year', columns='treat', values='murder').plot( +ax = df.pivot(index="year", columns="treat", values="murder").plot( figsize=(20, 10), - marker='.', - markersize=20, - title='Murder and Time', - xlabel='Year', - ylabel='Murder Rate', + marker=".", + markersize=20, + title="Murder and Time", + xlabel="Year", + ylabel="Murder Rate", # to make sure each year is displayed on axis - xticks=df['year'].drop_duplicates().sort_values().astype('int') + xticks=df["year"].drop_duplicates().sort_values().astype("int"), ) -# the function returns a matplotlib.pyplot.Axes object +# the function returns a matplotlib.pyplot.Axes object # we can use this axis to add additional decoration to our plot -ax.axvline(x=2014, color='gray', linestyle='--') # treatment year -ax.legend(loc='upper left', title='treat', prop={'size': 20}) # move and label legend +ax.axvline(x=2014, color="gray", linestyle="--") # treatment year +ax.legend(loc="upper left", title="treat", prop={"size": 20}) # move and label legend # Step 4: # statsmodels has two separate APIs # the original API is more complete both in terms of functionality and documentation -X = sm.add_constant(df[['treat', 'treatafter', 'after']].astype('float')) -y = df['murder'] +X = sm.add_constant(df[["treat", "treatafter", "after"]].astype("float")) +y = df["murder"] sm_fit = sm.OLS(y, X).fit() -# the formula API is more familiar for R users +# the formula API is more familiar for R users # it can be accessed through an alternate constructor bound to each model class -smff_fit = sm.OLS.from_formula('murder ~ 1 + treat + treatafter + after', data=df).fit() +smff_fit = sm.OLS.from_formula("murder ~ 1 + treat + treatafter + after", data=df).fit() # it can also be accessed through a separate namespace import statsmodels.formula.api as smf -smf_fit = smf.ols('murder ~ 1 + treat + treatafter + after', data=df).fit() + +smf_fit = smf.ols("murder ~ 1 + treat + treatafter + after", data=df).fit() # if using jupyter, rich output is displayed without the print function # we should see three identical outputs @@ -133,7 +133,7 @@ If the year is after 2014 **and** the state decided to legalize marijuana, the i ```r?example=did DiD <- DiD %>% mutate(after = year >= 2014) %>% - mutate(treatafter = after*treat) + mutate(treatafter = after * treat) ``` Step 3: @@ -141,10 +141,11 @@ Step 3: Then we need to plot the graph to visualize the impact of legalize marijuana on murder rate by using `ggplot`. ```r?example=did -mt <- ggplot(DiD,aes(x=year, y=murder, color = treat)) + - geom_point(size=3)+geom_line() + - geom_vline(xintercept=2014,lty=4) + - labs(title="Murder and Time", x="Year", y="Murder Rate") +mt <- ggplot(DiD, aes(x = year, y = murder, color = treat)) + + geom_point(size = 3) + + geom_line() + + geom_vline(xintercept = 2014, lty = 4) + + labs(title = "Murder and Time", x = "Year", y = "Murder Rate") mt ``` ![Diff-in-Diff](../Images/Two_by_Two_Difference_in_Difference/difindif.jpg) @@ -156,8 +157,9 @@ Step 4: We need to measure the impact of impact of legalize marijuana. If we include `treat`, `after`, and `treatafter` in a regression, the coefficient on `treatafter` can be interpreted as "how much bigger was the before-after difference for the treated group?" which is the DiD estimate. ```r?example=did -reg<-lm(murder ~ treat+treatafter+after, data = DiD) +reg <- lm(murder ~ treat + treatafter + after, data = DiD) summary(reg) ``` After legalization, the murder rate dropped by 0.3% more in treated than untreated states, suggesting that legalization reduced the murder rate. + diff --git a/Model_Estimation/Statistical_Inference/Nonstandard_Errors/bootstrap_se.md b/Model_Estimation/Statistical_Inference/Nonstandard_Errors/bootstrap_se.md index 3317a4da..1558eb0a 100644 --- a/Model_Estimation/Statistical_Inference/Nonstandard_Errors/bootstrap_se.md +++ b/Model_Estimation/Statistical_Inference/Nonstandard_Errors/bootstrap_se.md @@ -16,8 +16,8 @@ Bootstrap is commonly used to calculate standard errors. If you produce many boo ## Keep in Mind -- Although it feels entirely data-driven, bootstrap standard errors rely on assumptions just like everything else. It assumes your original model is correctly specified, for example. Basic bootstrapping assumes observations are independent of each other. -- It is possible to allow for correlations across units by using block-bootstrap. +- Although it feels entirely data-driven, bootstrap standard errors rely on assumptions just like everything else. It assumes your original model is correctly specified, for example. Basic bootstrapping assumes observations are independent of each other. +- It is possible to allow for correlations across units by using block-bootstrap. - Bootstrapping can also be used to calculate other features of the parameter's sample distribution, like the percentile, not just the standard error. ## Also Consider @@ -43,10 +43,10 @@ library(lmtest) data(mtcars) # Run a regression with normal (iid) errors - m <- lm(hp~mpg + cyl, data = mtcars) - - # Obtain the boostrapped SEs - coeftest(m, vcov = vcovBS(m)) +m <- lm(hp ~ mpg + cyl, data = mtcars) + +# Obtain the boostrapped SEs +coeftest(m, vcov = vcovBS(m)) ``` Another approach to obtaining bootstrapping standard errors in R is to use the **boot** package ([link](https://cran.r-project.org/web/packages/boot/)). This is typcally more hands-on, but gives the user a lot of control over how the bootrapping procedure will execute. @@ -62,8 +62,8 @@ library(boot) # A dataset and indices as input, and then # performs analysis and returns a parameter of interest regboot <- function(data, indices) { - m1 <- lm(hp~mpg + cyl, data = data[indices,]) - + m1 <- lm(hp ~ mpg + cyl, data = data[indices, ]) + return(coefficients(m1)) } @@ -86,8 +86,8 @@ library(broom) tidy_results <- tidy(boot_results) library(stargazer) -m1 <- lm(hp~mpg + cyl, data = mtcars) -stargazer(m1, se = list(tidy_results$std.error), type = 'text') +m1 <- lm(hp ~ mpg + cyl, data = mtcars) +stargazer(m1, se = list(tidy_results$std.error), type = "text") ``` ## Stata @@ -107,35 +107,36 @@ reg mpg weight length, vce(bootstrap, reps(200)) ``` Alternatively, most commands will also accept using the `bootstrap` prefix. Even if they do not allow the option `vce(bootstrap)`. -```Stata +```stata * If a command does not support vce(bootstrap), there's a good chance it will * work with a bootstrap: prefix, which works similarly -bootstrap, reps(200): reg mpg weight length +bootstrap, reps(200): reg mpg weight length ``` If your model uses weights, `bootstrap` prefix (or `vce(bootstrap)` ) will not be appropriate, and the above command may give you an error: -```Stata +```stata *This should give you an error bootstrap, reps(200): reg mpg foreign length [pw=weight] ``` `bootstrap`, however, can be used to estimate standard errors of more complex systems. This, however, require some programming. Below an example for bootstrapping marginal effects for `ivprobit`. -```Stata +```stata webuse laborsup, clear ** Start creating a small program program two_ivp, eclass * estimate first stage reg other_inc male_educ fem_educ kids -* estimate residuals +* estimate residuals capture drop res predict res, res * add them to the probit first stage -* This is what ivprobit two step does. +* This is what ivprobit two step does. probit fem_work fem_educ kids other_inc res margins, dydx(fem_educ kids other_inc) post end ** now simply bootstrap the program: bootstrap, reps(100):two_ivp ``` + diff --git a/Model_Estimation/Statistical_Inference/Nonstandard_Errors/clustered_se.md b/Model_Estimation/Statistical_Inference/Nonstandard_Errors/clustered_se.md index 718fb6be..819859b4 100644 --- a/Model_Estimation/Statistical_Inference/Nonstandard_Errors/clustered_se.md +++ b/Model_Estimation/Statistical_Inference/Nonstandard_Errors/clustered_se.md @@ -32,9 +32,9 @@ For cluster-robust estimation of (high-dimensional) fixed effect models in R, se Cluster-robust standard errors for many different kinds of regression objects in R can be obtained using the `vcovCL` or `vcovBS` functions from the **sandwich** package ([link](http://sandwich.r-forge.r-project.org/index.html)). To perform statistical inference, we combine these with the `coeftest` function from the **lmtest** package. This approach allows users to adjust the standard errors for a model "[on-the-fly](https://grantmcdermott.com/better-way-adjust-SEs/)" (i.e. post-estimation) and is thus very flexible. -```R?example=clustered +```r?example=clustered # Read in data from the College Scorecard -df <- read.csv('https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Model_Estimation/Data/Fixed_Effects_in_Linear_Regression/Scorecard.csv') +df <- read.csv("https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Model_Estimation/Data/Fixed_Effects_in_Linear_Regression/Scorecard.csv") # Create a regression model with normal (iid) errors my_model <- lm(repay_rate ~ earnings_med + state_abbr, data = df) @@ -47,15 +47,17 @@ coeftest(my_model, vcov = vcovCL(my_model, cluster = ~inst_name)) Alternately, users can specify clustered standard errors directly in the model call using the `lm_robust` function from the **estimatr** package ([link](https://github.com/DeclareDesign/estimatr)). This latter approach is very similar to how errors are clustered in Stata, for example. -```R?example=clustered +```r?example=clustered # Alternately, use estimator::lm_robust to specify clustered SEs in the original model call. # Standard error types are referred to as CR0, CR1 ("stata"), CR2 here. # Here, CR2 is the default library(estimatr) -my_model2 <- lm_robust(repay_rate ~ earnings_med + state_abbr, data = df, - clusters = inst_name, - se_type = "stata") +my_model2 <- lm_robust(repay_rate ~ earnings_med + state_abbr, + data = df, + clusters = inst_name, + se_type = "stata" +) summary(my_model2) ``` @@ -78,3 +80,4 @@ encode state_abbr, g(state_encoded) * This will give you CR1 regress repay_rate earnings_med i.state_encoded, vce(cluster inst_name_encoded) ``` + diff --git a/Model_Estimation/Statistical_Inference/Nonstandard_Errors/hc_se.md b/Model_Estimation/Statistical_Inference/Nonstandard_Errors/hc_se.md index 42e67de7..fb545380 100644 --- a/Model_Estimation/Statistical_Inference/Nonstandard_Errors/hc_se.md +++ b/Model_Estimation/Statistical_Inference/Nonstandard_Errors/hc_se.md @@ -28,9 +28,9 @@ Many regression models assume homoskedasticity (i.e. constant variance of the er ## R -The easiest way to obtain robust standard errors in R is with the **estimatr** package ([link](https://declaredesign.org/r/estimatr/)) and its family of `lm_robust` functions. These will default to "HC2" errors, but users can specify a variety of other options. +The easiest way to obtain robust standard errors in R is with the **estimatr** package ([link](https://declaredesign.org/r/estimatr/)) and its family of `lm_robust` functions. These will default to "HC2" errors, but users can specify a variety of other options. -```R +```r # If necessary, install estimatr # install.packages(c('estimatr')) library(estimatr) @@ -45,7 +45,7 @@ summary(m1) Alternately, users may consider the `vcovHC` function from the **sandwich** package ([link](https://cran.r-project.org/web/packages/sandwich/index.html)), which is very flexible and supports a wide variety of generic regression objects. For inference (t-tests, etc.), use in conjunction with the `coeftest` function from the **lmtest** package ([link](https://cran.r-project.org/web/packages/lmtest/index.html)). -```R +```r # If necessary, install lmtest and sandwich # install.packages(c('lmtest','sandwich')) library(sandwich) @@ -54,8 +54,8 @@ library(lmtest) # Create a normal regression model (i.e. without robust standard errors) m2 <- lm(mpg ~ cyl + disp + hp, data = mtcars) -# Get the robust VCOV matrix using sandwich::vcovHC(). We can pick the kind of robust errors -# with the "type" argument. Note that, unlike estimatr::lm_robust(), the default this time +# Get the robust VCOV matrix using sandwich::vcovHC(). We can pick the kind of robust errors +# with the "type" argument. Note that, unlike estimatr::lm_robust(), the default this time # is "HC3". I'll specify it here anyway just to illustrate. vcovHC(m2, type = "HC3") sqrt(diag(vcovHC(m2))) ## HAC SEs @@ -79,3 +79,4 @@ regress price mpg gear_ratio foreign, robust * For other kinds of robust standard errors use vce() regress price mpg gear_ratio foreign, vce(hc3) ``` + diff --git a/Model_Estimation/Statistical_Inference/Nonstandard_Errors/nonstandard_errors.md b/Model_Estimation/Statistical_Inference/Nonstandard_Errors/nonstandard_errors.md index 1040d242..847a1fa4 100644 --- a/Model_Estimation/Statistical_Inference/Nonstandard_Errors/nonstandard_errors.md +++ b/Model_Estimation/Statistical_Inference/Nonstandard_Errors/nonstandard_errors.md @@ -8,3 +8,4 @@ nav_order: 100 --- # Nonstandard errors + diff --git a/Model_Estimation/Statistical_Inference/Statistical_Inference.md b/Model_Estimation/Statistical_Inference/Statistical_Inference.md index ce208e88..af22609a 100644 --- a/Model_Estimation/Statistical_Inference/Statistical_Inference.md +++ b/Model_Estimation/Statistical_Inference/Statistical_Inference.md @@ -7,3 +7,4 @@ nav_order: 5 --- # Statistical Inference + diff --git a/Model_Estimation/Statistical_Inference/linear_hypothesis_tests.md b/Model_Estimation/Statistical_Inference/linear_hypothesis_tests.md index 89510105..8049590e 100644 --- a/Model_Estimation/Statistical_Inference/linear_hypothesis_tests.md +++ b/Model_Estimation/Statistical_Inference/linear_hypothesis_tests.md @@ -35,7 +35,7 @@ Alternately, you may want to do a joint significance test of multiple linear hyp Linear hypothesis test in R can be performed for most regression models using the `linearHypothesis()` function in the **car** package. See [this guide](https://www.econometrics-with-r.org/7-3-joint-hypothesis-testing-using-the-f-statistic.html) for more information. -```R +```r # If necessary # install.packages('car') library(car) @@ -46,13 +46,13 @@ data(mtcars) m1 <- lm(mpg ~ hp + disp + am + wt, data = mtcars) # Test a linear combination of coefficients -linearHypothesis(m1, c('hp + disp = 0')) +linearHypothesis(m1, c("hp + disp = 0")) # Test joint significance of multiple coefficients -linearHypothesis(m1, c('hp = 0','disp = 0')) +linearHypothesis(m1, c("hp = 0", "disp = 0")) # Test joint significance of multiple linear combinations -linearHypothesis(m1, c('hp + disp = 0','am + wt = 0')) +linearHypothesis(m1, c("hp + disp = 0", "am + wt = 0")) ``` ## Stata @@ -80,3 +80,4 @@ test headroom + trunk = 0 test headroom + trunk = 0 test price + rep78 = 0, accumulate ``` + diff --git a/NewPageTemplate.md b/NewPageTemplate.md index 0edef7c1..4227aaa2 100644 --- a/NewPageTemplate.md +++ b/NewPageTemplate.md @@ -35,7 +35,7 @@ $$ ## NAME OF LANGUAGE/SOFTWARE 1 -```identifier for language type, see this page: https://github.com/jmm/gfm-lang-ids/wiki/GitHub-Flavored-Markdown-%28GFM%29-language-IDs +```identifier for language type, see this page: https://github.com/jmm/gfm-lang-ids/wiki/github-flavored-markdown-%28gfm%29-language-ids Commented code demonstrating the technique ``` @@ -45,12 +45,13 @@ There are two ways to perform this technique in language/software 2. First, explanation of what is different about the first way: -```identifier for language type, see this page: https://github.com/jmm/gfm-lang-ids/wiki/GitHub-Flavored-Markdown-%28GFM%29-language-IDs +```identifier for language type, see this page: https://github.com/jmm/gfm-lang-ids/wiki/github-flavored-markdown-%28gfm%29-language-ids Commented code demonstrating the technique ``` Second, explanation of what is different about the second way: -```identifier for language type, see this page: https://github.com/jmm/gfm-lang-ids/wiki/GitHub-Flavored-Markdown-%28GFM%29-language-IDs +```identifier for language type, see this page: https://github.com/jmm/gfm-lang-ids/wiki/github-flavored-markdown-%28gfm%29-language-ids Commented code demonstrating the technique ``` + diff --git a/Other/Other.md b/Other/Other.md index 462450a9..9d400210 100644 --- a/Other/Other.md +++ b/Other/Other.md @@ -5,3 +5,4 @@ nav_order: 9 --- # Other + diff --git a/Other/create_a_conda_package.md b/Other/create_a_conda_package.md index cc080bb5..80d656b2 100644 --- a/Other/create_a_conda_package.md +++ b/Other/create_a_conda_package.md @@ -146,3 +146,4 @@ Then in the `.circleci/build_steps.sh` file, comment out the line that starts `g Then you should be able to run `./.circleci/run_docker_build.sh`. You'll probably see some errors which you'll need to fix. Once these errors are sorted out, you can push your recipe to GitHub and create a PR. Make sure to name the PR something memorable, e.g., `Adding linearmodels recipe`. + diff --git a/Other/get_a_list_of_files.md b/Other/get_a_list_of_files.md index 43415bc2..3b60c918 100644 --- a/Other/get_a_list_of_files.md +++ b/Other/get_a_list_of_files.md @@ -25,33 +25,32 @@ Note that, because these code examples necessarily refer to files on disk, they The `glob` module finds all pathnames matching a specified pattern and stores them in a list. ```python - import glob # Retrieve all csvs in the working directory -list_of_files = glob.glob('*.csv') +list_of_files = glob.glob("*.csv") # Retrieve all csvs in the working directory and all sub-directories -list_of_files = glob.glob('**/*.csv', recursive=True) +list_of_files = glob.glob("**/*.csv", recursive=True) ``` ## R The `list.files()` function can produce a list of files that can be looped over. -```r?skip=true&skipReason=files_dont_exist +```r?skip=true&skipreason=files_dont_exist # Get a list of all .csv files in the Data folder # (which sits inside our working directory) -filelist <- list.files('Data','*.csv') +filelist <- list.files("Data", "*.csv") # filelist just contains file names now. If we want it to # open them up from the Data folder we must say so -filelist <- paste0('Data/',filelist) +filelist <- paste0("Data/", filelist) # Read them all in and then row-bind them together # (assuming they're all the same format and can be rbind-ed) -datasets <- lapply(filelist,read.csv) -data <- do.call(rbind,datasets) +datasets <- lapply(filelist, read.csv) +data <- do.call(rbind, datasets) # Or, use the tidyverse with purrr # (assuming they're all the same format and can be rbind-ed) @@ -89,3 +88,4 @@ foreach f in `filelist' { local firsttime = 0 } ``` + diff --git a/Other/import_a_foreign_data_file.md b/Other/import_a_foreign_data_file.md index f609d48b..2c302f76 100644 --- a/Other/import_a_foreign_data_file.md +++ b/Other/import_a_foreign_data_file.md @@ -28,28 +28,28 @@ Because there are so many potential foreign formats, these implementations will ## R -```r?skip=true&skipReason=files_dont_exist +```r?skip=true&skipreason=files_dont_exist library(readxl) -data <- read_excel('filename.xlsx') +data <- read_excel("filename.xlsx") # Read Stata, SAS, and SPSS files with the haven package # install.packages('haven') library(haven) -data <- read_stata('filename.dta') -data <- read_spss('filename.sav') +data <- read_stata("filename.dta") +data <- read_spss("filename.sav") # read_sas also supports .sas7bcat, or read_xpt supports transport files -data <- read_sas('filename.sas7bdat') +data <- read_sas("filename.sas7bdat") # Read lots of other types with the foreign package # install.packages('foreign') library(foreign) -data <- read.arff('filename.arff') -data <- read.dbf('filename.dbf') -data <- read.epiinfo('filename.epiinfo') -data <- read.mtb('filename.mtb') -data <- read.octave('filename.octave') -data <- read.S('filename.S') -data <- read.systat('filename.systat') +data <- read.arff("filename.arff") +data <- read.dbf("filename.dbf") +data <- read.epiinfo("filename.epiinfo") +data <- read.mtb("filename.mtb") +data <- read.octave("filename.octave") +data <- read.S("filename.S") +data <- read.systat("filename.systat") ``` ## Stata @@ -61,3 +61,4 @@ import type using filename ``` where `type` can be `excel`, `spss`, `sas`, `haver`, or `dbase` (`import` can also be used to download data directly from sources like FRED). + diff --git a/Other/task_scheduling_with_github_actions.md b/Other/task_scheduling_with_github_actions.md index 48870138..6ded8db7 100644 --- a/Other/task_scheduling_with_github_actions.md +++ b/Other/task_scheduling_with_github_actions.md @@ -44,6 +44,7 @@ import requests URL = "https://whereveryourdatais.com/" + def process_page(html: str) -> List[List[Union[int, str]]]: """ This is the meat of your web scraper: @@ -55,7 +56,7 @@ def pull_data(url: str) -> List[List[Union[int, str]]]: resp = requests.get(url) resp.raise_for_status() - content = resp.content.decode('utf8') + content = resp.content.decode("utf8") return process_page(content) @@ -73,7 +74,7 @@ def main(): print(f"Done pulling data.") print("Writing data...") - with open(filename, 'wt') as outfile: + with open(filename, "wt") as outfile: writer = csv.writer(outfile) writer.writerows(data) print("Done writing data.") @@ -91,7 +92,7 @@ python3 main.py Similarly, if you're using `R`, you'll want to create a `main.R` file to similar effect. For instance, it might look something like: -```R?skip=true&reason=fake_urls +```r?skip=true&reason=fake_urls library(readr) library(httr) @@ -100,40 +101,40 @@ URL <- "https://whereveryourdatais.com/" #' This hte meat of your web scraper: #' Pulling out the data you want from the HTML of the web page process_page <- function(html) { - # Process html + # Process html } #' Pull data from a single URL and return a tibble with it nice and ordered pull_data <- function(url) { - resp <- GET(url) - if (resp$status_code >= 400) { - stop(paste0("Something bad occurred in trying to pull ", URL)) - } + resp <- GET(url) + if (resp$status_code >= 400) { + stop(paste0("Something bad occurred in trying to pull ", URL)) + } - return(process_page(content(resp))) + return(process_page(content(resp))) } main <- function() { - # The program takes 1 optional argument: an output filename. If not present, - # we will write the output a default filename, which is: - date <- Sys.time() - attr(date, "tzone") <- "UTC" - filename <- paste0("data/output-", as.Date(date, format = "%Y-%m-%d")) - - args <- commandArgs(trailingOnly = TRUE) - if (length(args) > 0) { - filename <- args[1] - } + # The program takes 1 optional argument: an output filename. If not present, + # we will write the output a default filename, which is: + date <- Sys.time() + attr(date, "tzone") <- "UTC" + filename <- paste0("data/output-", as.Date(date, format = "%Y-%m-%d")) + + args <- commandArgs(trailingOnly = TRUE) + if (length(args) > 0) { + filename <- args[1] + } - print(paste0("Will write data to ", filename)) + print(paste0("Will write data to ", filename)) - print(paste0("Pulling data from ", URL)) - data <- pull_data(URL) - print("Done pulling data") + print(paste0("Pulling data from ", URL)) + data <- pull_data(URL) + print("Done pulling data") - print("Writing data...") - write_csv(data, filename) - print("Done writing data.") + print("Writing data...") + write_csv(data, filename) + print("Done writing data.") } ``` @@ -162,7 +163,7 @@ readr If you're using `R`, you'll also need to add the following script in a file called `install.R` to your project: -```R?skip=true&skipReason=installing_packages +```r?skip=true&skipreason=installing_packages CRAN <- "https://mirror.las.iastate.edu/CRAN/" process_file <- function(filepath) { @@ -344,7 +345,7 @@ api_key = os.environ.get("API_KEY", "some_other_way") or in R you might do -```R +```r api_key <- Sys.getenv("API_KEY", unset = "some_other_way") ``` @@ -355,4 +356,5 @@ api_key <- Sys.getenv("API_KEY", unset = "some_other_way") run: python3 main.py env: API_KEY: {% raw %}${{ secrets.API_KEY }}{% endraw %} -``` \ No newline at end of file +``` + diff --git a/Presentation/Figures/Data/Bar_Graphs/README.md b/Presentation/Figures/Data/Bar_Graphs/README.md index ccfcca65..1f94f492 100644 --- a/Presentation/Figures/Data/Bar_Graphs/README.md +++ b/Presentation/Figures/Data/Bar_Graphs/README.md @@ -1,3 +1,4 @@ # Acknowledgments for Data -The data in this folder was originally posted on the [Star Wars Fandom Wiki](https://starwars.fandom.com/wiki). It was then pulled using the defunct Star Wars API and then posted on Kaggle at [https://www.kaggle.com/jsphyg/star-wars]. \ No newline at end of file +The data in this folder was originally posted on the [Star Wars Fandom Wiki](https://starwars.fandom.com/wiki). It was then pulled using the defunct Star Wars API and then posted on Kaggle at [https://www.kaggle.com/jsphyg/star-wars]. + diff --git a/Presentation/Figures/Figures.md b/Presentation/Figures/Figures.md index 5799555e..b5a4086c 100644 --- a/Presentation/Figures/Figures.md +++ b/Presentation/Figures/Figures.md @@ -7,3 +7,4 @@ nav_order: 1 --- # Figures + diff --git a/Presentation/Figures/Scatterplots.md b/Presentation/Figures/Scatterplots.md index 11b0d32a..ca2a2a61 100644 --- a/Presentation/Figures/Scatterplots.md +++ b/Presentation/Figures/Scatterplots.md @@ -44,12 +44,9 @@ df = sns.load_dataset("tips") # alpha sets the transparency of points. There # are various other keyword arguments to add other # dimensions of information too, eg size. -sns.scatterplot(data=df, - x="total_bill", - y="tip", - alpha=.8, - hue='time').set_title('Tips data', loc='right') - +sns.scatterplot(data=df, x="total_bill", y="tip", alpha=0.8, hue="time").set_title( + "Tips data", loc="right" +) ``` This results in: @@ -63,19 +60,19 @@ In R, one of the best tools for creating scatterplots is the function `ggplot()` To begin we will need to make sure we install and load `ggplot2` as well as any other packages that are useful. ```r?example=ggplot -#install and load necessary packages +# install and load necessary packages library(ggplot2) -#load the dataset +# load the dataset data(mtcars) ``` Next, we will use `ggplot()`, `aes()`, and `geom_point()` in order to create a basic scatterplot. For this plot, we will put car weight on the x-axis and miles-per-gallon on the y-axis. ```r?example=ggplot -#assign the mtcars dataset to the plot and set each axis -ggplot(data = mtcars,aes(x=wt,y=mpg)) + -#create points on the plot for each observation +# assign the mtcars dataset to the plot and set each axis +ggplot(data = mtcars, aes(x = wt, y = mpg)) + + # create points on the plot for each observation geom_point() ``` ![Basic Scatterplot](Images/Scatterplots/basic_scatterplot.png) @@ -86,13 +83,15 @@ It is important to remember to include the + after each line when creating a plo Labelling is also an important task. In order to give our scatterplot axis labels and title, we will use the `labs()` function, in conjunction with our previous code. Don't forget your +'s! ```r?example=ggplot -#assign our dataset and variables of interest to the plot +# assign our dataset and variables of interest to the plot ggplot(data = mtcars, aes(x = wt, y = mpg)) + - #create the points + # create the points geom_point() + - #create axis labels and a title - labs(x = "Weight", y = "Miles Per Gallon", - title = "Car MPG by Weight") + # create axis labels and a title + labs( + x = "Weight", y = "Miles Per Gallon", + title = "Car MPG by Weight" + ) ``` ![Scatterplot with Title and Axis Labels](Images/Scatterplots/scatter_titles.png) @@ -102,29 +101,34 @@ That is starting to look better, but our graph could still use a little variety * To change the color of our points, we will use `color`. In this example we will make our points blue. ```r?example=ggplot -#assign our dataset and variables of interest -ggplot(data = mtcars, aes(x =wt, y = mpg)) + - #create points and tell ggplot we want them to be size 4 and blue +# assign our dataset and variables of interest +ggplot(data = mtcars, aes(x = wt, y = mpg)) + + # create points and tell ggplot we want them to be size 4 and blue geom_point(size = 4, color = "blue") + - #don't forget the labels - labs(x = "Weight", y = "Miles Per Gallon", - title = "Car MPG by Weight") + # don't forget the labels + labs( + x = "Weight", y = "Miles Per Gallon", + title = "Car MPG by Weight" + ) ``` ![Scatterplot with Large Blue Points](Images/Scatterplots/scatter_size_color.png) Finally, lets label our points. We can do this by adding a new element to our plot, `geom_text()`. For this example we will label the points on our plot with their horse power. This will allow us to see how horsepower is related to weight and miles-per-gallon. We are also going to set the size of our points to 0.5 to avoid cluttering the scatterplot too much. Just like we can change the color of our points, we can change the color of the labels we put on them. We'll make them red in this example, but feel free to choose another color. ```r?example=ggplot -#assign our dataset and variables of interest -ggplot(data = mtcars, aes(x =wt, y = mpg)) + - #create points and tell ggplot we want them to be size 0.5 and blue - geom_point(size = 0.5, color = 'blue') + - #add the labels for our points - geom_text(label = mtcars$hp, color = 'red') - #don't forget the labels - labs(x = "Weight", y = "Miles Per Gallon", - title = "Car MPG by Weight") +# assign our dataset and variables of interest +ggplot(data = mtcars, aes(x = wt, y = mpg)) + + # create points and tell ggplot we want them to be size 0.5 and blue + geom_point(size = 0.5, color = "blue") + + # add the labels for our points + geom_text(label = mtcars$hp, color = "red") +# don't forget the labels +labs( + x = "Weight", y = "Miles Per Gallon", + title = "Car MPG by Weight" +) ``` ![Scatterplot with Labels Points](Images/Scatterplots/scatter_labels.png) Congrats! You're well on your way to becoming a scatterplot master! Don't forget to check out the LOST page on [styling scatterplots]({{ "/Presentation/Figures/Styling_Scatterplots.html" | relative_url }}) if you would like to learn more. + diff --git a/Presentation/Figures/Styling_Scatterplots.md b/Presentation/Figures/Styling_Scatterplots.md index 625846b8..93031a4c 100644 --- a/Presentation/Figures/Styling_Scatterplots.md +++ b/Presentation/Figures/Styling_Scatterplots.md @@ -59,7 +59,8 @@ If you have questions about how to use `ggplot` and `aes`, check [Here]({{ "/Pre ```r?example=ggplot ggplot(data = iris, aes( ## Put Sepal.Length on the x-axis, Sepal.Width on the y-axis - x=Sepal.Length, y=Sepal.Width))+ + x = Sepal.Length, y = Sepal.Width +)) + ## Make it a scatterplot with geom_point() geom_point() ``` @@ -79,10 +80,12 @@ Notice that attributes set *outside* of `aes()` apply to *all* points (like `siz We can distinguish the `Species` by `alpha` (transparency). ```r?example=ggplot -ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, - ## Where transparency comes in - alpha=Species)) + - geom_point(size =4, color="seagreen") +ggplot(iris, aes( + x = Sepal.Length, y = Sepal.Width, + ## Where transparency comes in + alpha = Species +)) + + geom_point(size = 4, color = "seagreen") ``` ![Scatterplot with Transparency](Images/Styling_Scatterplots/R_transparency.png) @@ -93,10 +96,12 @@ ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, ```r?example=ggplot -ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, - ## Where shape comes in - shape=Species)) + - geom_point(size = 4,color="orange") +ggplot(iris, aes( + x = Sepal.Length, y = Sepal.Width, + ## Where shape comes in + shape = Species +)) + + geom_point(size = 4, color = "orange") ``` ![Scatterplot with Different Shapes](Images/Styling_Scatterplots/R_shape.png) @@ -107,10 +112,12 @@ ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, `size` is a great option that we can take a look at as well. However, note that `size` will work better with continuous variables. ```r?example=ggplot -ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, - ## Where size comes in - size=Species)) + - geom_point(shape = 18, color = "#FC4E07") +ggplot(iris, aes( + x = Sepal.Length, y = Sepal.Width, + ## Where size comes in + size = Species +)) + + geom_point(shape = 18, color = "#FC4E07") ``` ![Scatterplot With Different Sizes](Images/Styling_Scatterplots/R_size.png) @@ -128,9 +135,11 @@ Last but not least, let's `color` these points depends on the variable `Species` ## iris$Species <- as.factor(iris$Species) ## Then, we are ready to plot -ggplot(data = iris, aes(x=Sepal.Length, y=Sepal.Width, - ## distinguish the species by color - color=Species))+ +ggplot(data = iris, aes( + x = Sepal.Length, y = Sepal.Width, + ## distinguish the species by color + color = Species +)) + geom_point() ``` ![Scatterplot with different colors](Images/Styling_Scatterplots/R_color.png) @@ -141,15 +150,15 @@ ggplot(data = iris, aes(x=Sepal.Length, y=Sepal.Width, * If you do not like all the options that the **RColorBrewer** and **viridis** packages provide, see [here](http://www.sthda.com/english/wiki/ggplot2-colors-how-to-change-colors-automatically-and-manually) to work with color in the **ggplot2** package. ```r?example=ggplot -ggplot(data = iris, aes(x=Sepal.Length, y=Sepal.Width, color=Species))+ - geom_point()+ +ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) + + geom_point() + ## Where RColorBrewer package comes in scale_colour_brewer(palette = "Set1") ## There are more options available for palette -ggplot(data = iris, aes(x=Sepal.Length, y=Sepal.Width, color=Species))+ - geom_point()+ +ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) + + geom_point() + ## Where viridis package comes in - scale_color_viridis(discrete=TRUE,option = "D") ## There are more options to choose + scale_color_viridis(discrete = TRUE, option = "D") ## There are more options to choose ``` This first graph is using `RColorBrewer` package,and the second graph is using `viridis` package. @@ -172,11 +181,13 @@ The next step that we can do is to figure out what the most fittable themes to m In fact, **ggplot2** package has many cool themes available alreay such as `theme_classic()`, `theme_minimal()` and `theme_bw()`. Another famous theme is the dark theme: `theme_dark()`. Let's check out some of them. ```r?example=ggplot -ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, - col=Species, - shape=Species)) + - geom_point(size=3) + - scale_color_viridis(discrete=TRUE,option = "D") + +ggplot(iris, aes( + x = Sepal.Length, y = Sepal.Width, + col = Species, + shape = Species +)) + + geom_point(size = 3) + + scale_color_viridis(discrete = TRUE, option = "D") + theme_minimal(base_size = 12) ``` @@ -188,11 +199,13 @@ ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, `ggthemes` package is also worth to check out for working any plots (maps,time-series data, and any other plots) that you are working on. `theme_gdocs()`, `theme_tufte()`, and `theme_calc()` all work very well. See [here](https://yutannihilation.github.io/allYourFigureAreBelongToUs/ggthemes/) to get more cool themes. ```r?example=ggplot -ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, - col=Species, - shape=Species)) + - geom_point(size=3) + - scale_color_viridis(discrete=TRUE,option = "D") + +ggplot(iris, aes( + x = Sepal.Length, y = Sepal.Width, + col = Species, + shape = Species +)) + + geom_point(size = 3) + + scale_color_viridis(discrete = TRUE, option = "D") + ## Using the theme_tufte() theme_tufte() ``` @@ -213,17 +226,19 @@ Both `labs()` and `ggtitle()` are great tools to deal with labelling information ```r?example=ggplot -ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, - col=Species, - shape=Species)) + - geom_point(size=3) + - scale_color_viridis(discrete=TRUE,option = "D") + - theme_minimal(base_size = 12)+ +ggplot(iris, aes( + x = Sepal.Length, y = Sepal.Width, + col = Species, + shape = Species +)) + + geom_point(size = 3) + + scale_color_viridis(discrete = TRUE, option = "D") + + theme_minimal(base_size = 12) + ## Where the labelling comes in labs( ## Tell people what x and y variables are - x="Sepal Length", - y="Sepal Width", + x = "Sepal Length", + y = "Sepal Width", ## Title of the plot title = "Sepal length vs. Sepal width", subtitle = " plot within different Iris Species" @@ -238,25 +253,30 @@ ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, After the basic labelling, we want to make them nicer by playing around the postion and appearance (text size, color and faces). ```r?example=ggplot -ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, - col=Species, - shape=Species)) + - geom_point(size=3) + - scale_color_viridis(discrete=TRUE,option = "D") + +ggplot(iris, aes( + x = Sepal.Length, y = Sepal.Width, + col = Species, + shape = Species +)) + + geom_point(size = 3) + + scale_color_viridis(discrete = TRUE, option = "D") + labs( - x="Sepal Length", - y="Sepal Width", + x = "Sepal Length", + y = "Sepal Width", title = "Sepal length vs. Sepal width", subtitle = "plot within different Iris Species" - )+ + ) + theme_minimal(base_size = 12) + ## Change the title and subtitle position to the center - theme(plot.title = element_text(hjust = 0.5), - plot.subtitle = element_text(hjust = 0.5))+ + theme( + plot.title = element_text(hjust = 0.5), + plot.subtitle = element_text(hjust = 0.5) + ) + ## Change the appearance of the title and subtitle - theme (plot.title = element_text(color = "black", size = 14, face = "bold"), - plot.subtitle = element_text(color = "grey40",size = 10, face = 'italic') - ) + theme( + plot.title = element_text(color = "black", size = 14, face = "bold"), + plot.subtitle = element_text(color = "grey40", size = 10, face = "italic") + ) ``` ![Scatterplot with Elements Moved](Images/Styling_Scatterplots/R_label_2.png) @@ -271,24 +291,30 @@ After done with step 4, you should end with a very neat and unquie plot. Let's e According to the plot, it seems like there exists a linear relationship between sepal length and sepal width. Thus, let's add a linear trend to our scattplot to help readers see the pattern more directly using `geom_smooth()`. Note that the `method` argument in `geom_smooth()` allows to apply different smoothing method like glm, loess and more. See the [doc](https://ggplot2.tidyverse.org/reference/geom_smooth.html) for more. ```r?example=ggplot -ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, - col=Species, - shape=Species)) + - geom_point(size=3) + - scale_color_viridis(discrete=TRUE,option = "D") + +ggplot(iris, aes( + x = Sepal.Length, y = Sepal.Width, + col = Species, + shape = Species +)) + + geom_point(size = 3) + + scale_color_viridis(discrete = TRUE, option = "D") + labs( - x="Sepal Length", - y="Sepal Width", + x = "Sepal Length", + y = "Sepal Width", title = "Sepal length vs. Sepal width", subtitle = "plot within different Iris Species" - )+ + ) + theme_minimal(base_size = 12) + - theme(plot.title = element_text(hjust = 0.5), - plot.subtitle = element_text(hjust = 0.5))+ - theme (plot.title = element_text(color = "black", size = 14, face = "bold"), - plot.subtitle = element_text(color = "grey40",size = 10, face = 'italic')) + + theme( + plot.title = element_text(hjust = 0.5), + plot.subtitle = element_text(hjust = 0.5) + ) + + theme( + plot.title = element_text(color = "black", size = 14, face = "bold"), + plot.subtitle = element_text(color = "grey40", size = 10, face = "italic") + ) + ## Where linear trend + confidence interval come in - geom_smooth(method = 'lm',se=TRUE) + geom_smooth(method = "lm", se = TRUE) ``` ![Scatterplot with Linear Trend](Images/Styling_Scatterplots/R_linear_trend.png) @@ -477,3 +503,4 @@ twoway (scatter sepwid seplen if iris==1, color("72 27 109") symbol(O)) /// ![Stata scatterplot with local polynomial lines.](Images/Styling_Scatterplots/stata_sc_9.png) And done. You can use the above guide to modify your plots as needed. + diff --git a/Presentation/Figures/bar_graphs.md b/Presentation/Figures/bar_graphs.md index 9c1c5ad6..486507e3 100644 --- a/Presentation/Figures/bar_graphs.md +++ b/Presentation/Figures/bar_graphs.md @@ -25,9 +25,12 @@ By far the quickest way to plot a bar chart is to use data analysis package [**p ```python?example=barpy import pandas as pd -df = pd.read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/DAAG/Manitoba.lakes.csv", index_col=0) +df = pd.read_csv( + "https://vincentarelbundock.github.io/Rdatasets/csv/DAAG/Manitoba.lakes.csv", + index_col=0, +) -df.plot.bar(y='area', legend=False, title='Area of lakes in Manitoba'); +df.plot.bar(y="area", legend=False, title="Area of lakes in Manitoba") ``` ![png](https://github.com/LOST-STATS/LOST-STATS.github.io/raw/master/Presentation/Figures/Images/bar_plot_graphs/bar_py_1.png) @@ -39,10 +42,10 @@ This produces a functional, if not hugely attractive, plot. Calling the function ```python?example=barpy import matplotlib.pyplot as plt -plt.style.use('seaborn') +plt.style.use("seaborn") -ax = df.plot.bar(y='area', legend=False, ylabel='Area', rot=15) -ax.set_title('Area of lakes in Manitoba', loc='left'); +ax = df.plot.bar(y="area", legend=False, ylabel="Area", rot=15) +ax.set_title("Area of lakes in Manitoba", loc="left") ``` @@ -60,7 +63,7 @@ import seaborn as sns tips = sns.load_dataset("tips") -sns.barplot(x="day", y="total_bill", hue="sex", data=tips); +sns.barplot(x="day", y="total_bill", hue="sex", data=tips) ``` ![png](https://github.com/LOST-STATS/LOST-STATS.github.io/raw/master/Presentation/Figures/Images/bar_plot_graphs/bar_py_3.png) @@ -74,8 +77,8 @@ from plotnine import ggplot, geom_bar, aes, labs ( ggplot(tips) - + geom_bar(aes(x='day'), colour='black', fill='blue') - + labs(x = "Day", y = "Number", title = "Number of diners") + + geom_bar(aes(x="day"), colour="black", fill="blue") + + labs(x="Day", y="Number", title="Number of diners") ) ``` @@ -101,17 +104,8 @@ This tutorial will use a dataset that already exists in R, so no need to load an - Next we want to tell `ggplot` what we want to map. We use the mapping function to do this. We set mapping to the aesthetic function. `(mapping = aes(x = species))` Within the `aes` function we want to specify what we want our `x` value to be, in this case `species`. Copy the code below to make your first bar graph! ```r?example=bargraph -starwars <- read.csv("https://github.com/LOST-STATS/LOST-STATS.github.io/raw/source/Presentation/Figures/Data/Bar_Graphs/star_wars_characters.csv") - - ggplot() + - geom_bar(data = starwars, mapping = aes(x = species)) - ``` - -![Unstyled R Bar Graph](https://github.com/LOST-STATS/LOST-STATS.github.io/raw/master/Presentation/Figures/Images/bar_plot_graphs/r_bar_graph_1.png) -As you can see, there are some issues. We can't tell what the individual species are on the `x` axis. We also might want to give our graph a title, maybe give it some color, etc. How do we do this? By adding additional functions to our graph! - -```r?example=bargraph +``` ggplot(data = starwars) + geom_bar( mapping = aes(x = species), color = "black", fill = "blue") + labs(x = "Species", y = "Total", title = "Character Appearences in Movies by Species") + @@ -130,7 +124,7 @@ Stata, like R, also has pre-installed datasets available for use. To find them, This is fictionalized blood pressure data. In your variables column you should have five variables (`patient, sex, agegrp, when, bp`). Let's make a bar chart that looks at the patients within our dataset by gender and age. To make a bar chart type into your stata command console: -```stata +``` graph bar, over(sex) over(agegrp) ``` and the following output should appear in another window. @@ -139,7 +133,7 @@ and the following output should appear in another window. Congratulations, you've made your first bar chart in Stata! We can now visually see the make-up of our dataset by gender and age. We might want to change the axis labels or give this a title. To do so type the following in your command window: -```stata +``` graph bar, over(sex) over(agegrp) title(Our Graph) ytitle(Percent) ``` @@ -149,12 +143,6 @@ and the following graph shoud appear Notice we gave our graph a title and capitalized the y axis. Lets add some color next. To do so type -```stata -graph bar, over(sex) over(agegrp) title(Our Graph) ytitle(Percent) bar(1, fcolor(red)) bar(2, fcolor(blue)) ``` -and the following graph should appear - - -![Colored and Styled Stata Bar Graph](https://github.com/LOST-STATS/LOST-STATS.github.io/raw/master/Presentation/Figures/Images/bar_plot_graphs/bar_graph_3.png) +graph bar, over(sex) over(agegrp) title(Our Graph) ytitle(Percent) bar(1, fcolor(red)) bar(2, fcolor(blue)) -Our bars are now red with a blue outline. Pretty neat! There are many sources of Stata help on the internet and many different way to customize your bar graphs. There is an official [Stata support](http://www.stata.com/support/) page that can answer queries regarding Stata. diff --git a/Presentation/Figures/density_plots.md b/Presentation/Figures/density_plots.md index da8438b6..34db31f9 100644 --- a/Presentation/Figures/density_plots.md +++ b/Presentation/Figures/density_plots.md @@ -128,7 +128,7 @@ print(diamonds.head()) ```python?example=densitypy -sns.kdeplot(data=diamonds, x="price", cut=0); +sns.kdeplot(data=diamonds, x="price", cut=0) ``` ![png](https://github.com/LOST-STATS/LOST-STATS.github.io/raw/master/Presentation/Figures/Images/density_plot/py_density_plot_1.png) @@ -138,14 +138,16 @@ This is basic, but there are lots of ways to adjust it through keyword arguments Let's use further keyword arguments to enrich the plot, including different colours ('hues') for each cut of diamond. One keyword argument that may not be obvious is `hue_order`. The default function call would have arranged the `cut` types so that the 'Fair' cut obscured the other types, so the argument passed to the `hue_order` keyword below *reverses* the order of the unique list of diamond cuts via `[::-1]`. ```python?example=densitypy -sns.kdeplot(data=diamonds, - x="price", - hue="cut", - hue_order=diamonds['cut'].unique()[::-1], - fill=True, - alpha=.4, - linewidth=0.5, - cut=0.); +sns.kdeplot( + data=diamonds, + x="price", + hue="cut", + hue_order=diamonds["cut"].unique()[::-1], + fill=True, + alpha=0.4, + linewidth=0.5, + cut=0.0, +) ``` ![png](https://github.com/LOST-STATS/LOST-STATS.github.io/raw/master/Presentation/Figures/Images/density_plot/py_density_plot_2.png) @@ -179,8 +181,8 @@ ggplot(diamonds, aes(x = price)) + We can always change the color of the density plot using the `col` argument and fill the color inside the density plot using `fill` argument. Furthermore, we can specify the degree of transparency density fill area using the argument `alpha` where `alpha` ranges from 0 to 1. ```r?example=density -ggplot(diamonds, aes(x = price))+ - geom_density(fill = "lightblue", col = 'black', alpha = 0.6) +ggplot(diamonds, aes(x = price)) + + geom_density(fill = "lightblue", col = "black", alpha = 0.6) ``` ![Colored density plot]({{ "/Presentation/Figures/Images/density_plot/2.png" | relative_url }}) @@ -188,7 +190,7 @@ We can also change the type of line of the density plot as well by adding `linet ```r?example=density ggplot(diamonds, aes(x = price)) + - geom_density(fill = "lightblue", col = 'black', linetype = "dashed") + geom_density(fill = "lightblue", col = "black", linetype = "dashed") ``` ![Density plot with linetype]({{ "/Presentation/Figures/Images/density_plot/3.png" | relative_url }}) @@ -197,7 +199,7 @@ Furthermore, you can also combine both histogram and density plots together. ```r?example=density ggplot(diamonds, aes(x = price)) + geom_histogram(aes(y = ..density..), colour = "black", fill = "grey45") + - geom_density(col = "red", size = 1,linetype = "dashed") + geom_density(col = "red", size = 1, linetype = "dashed") ``` ![Density Plot Overlaid on Histogram]({{ "/Presentation/Figures/Images/density_plot/4.png" | relative_url }}) @@ -206,8 +208,8 @@ What happen if we want to make multiple densities? For example, we want to make multiple densities plots for price based on the type of cut, all we need to do is adding `fill=cut` inside `aes()`. ```r?example=density -ggplot(data=diamonds, aes(x = price, fill = cut)) + - geom_density(adjust = 1.5, alpha = .3) +ggplot(data = diamonds, aes(x = price, fill = cut)) + + geom_density(adjust = 1.5, alpha = .3) ``` ![multiple]({{ "/Presentation/Figures/Images/density_plot/5.png" | relative_url }}) @@ -228,4 +230,5 @@ use http://www.stata-press.com/data/r16/nhanes2.dta, clear *Plot the kernel density kdensity height, scheme(plottig) -``` \ No newline at end of file +``` + diff --git a/Presentation/Figures/faceted_graphs.md b/Presentation/Figures/faceted_graphs.md index 9c836d48..b114f655 100644 --- a/Presentation/Figures/faceted_graphs.md +++ b/Presentation/Figures/faceted_graphs.md @@ -39,9 +39,15 @@ df = sns.load_dataset("penguins") # Plot a scatter of bill properties with # columns (facets) given by island and colour # given by the species of Penguin -sns.relplot(x="bill_depth_mm", y="bill_length_mm", - hue="species", col="island", - alpha=.5, palette="muted", data=df) +sns.relplot( + x="bill_depth_mm", + y="bill_length_mm", + hue="species", + col="island", + alpha=0.5, + palette="muted", + data=df, +) ``` Results in: @@ -54,10 +60,12 @@ If you have used R for plotting, you might be familiar with the **ggplot** packa from plotnine import * from plotnine.data import mtcars -(ggplot(mtcars, aes('wt', 'mpg', color='factor(gear)')) - + geom_point() - + stat_smooth(method='lm') - + facet_wrap('~gear')) +( + ggplot(mtcars, aes("wt", "mpg", color="factor(gear)")) + + geom_point() + + stat_smooth(method="lm") + + facet_wrap("~gear") +) ``` Results in: @@ -75,26 +83,24 @@ x = np.linspace(0, 2 * np.pi, 400) y = np.sin(x ** 2) fig, (ax1, ax2) = plt.subplots(2, sharex=True) -fig.suptitle('Two sine waves') +fig.suptitle("Two sine waves") ax1.plot(x, y) -ax2.scatter(x + 1, -y, color='red') +ax2.scatter(x + 1, -y, color="red") ``` (NB: no figure shown in this case.) Note how everything is specified. While `plt.subplots(nrows, ncols, ...)` allows for a rectangular facet grid, even more complex facets can be constructed using the [mosaic option](https://matplotlib.org/3.3.0/tutorials/provisional/mosaic.html) in **matplotlib** version 3.3.0+. The arrangment of facets can be specified either through text, as in the example below, or with lists of lists: ```python - import matplotlib.pyplot as plt axd = plt.figure(constrained_layout=True).subplot_mosaic( """ TTE L.E - """) + """ +) for k, ax in axd.items(): - ax.text(0.5, 0.5, k, - ha='center', va='center', fontsize=36, - color='darkgrey') + ax.text(0.5, 0.5, k, ha="center", va="center", fontsize=36, color="darkgrey") ``` Results in: @@ -129,7 +135,7 @@ Additionally, one can create faceted graph using two variables with `facet_grid( library(tidyverse) ggplot(data = mpg) + - geom_point(mapping = aes(x = displ, y = hwy))+ + geom_point(mapping = aes(x = displ, y = hwy)) + facet_grid(drv ~ cyl) ``` The code reults in the follwing panel of subplots: @@ -147,3 +153,4 @@ twoway (scatter mpg length), by(foreign) The code generates the following graph: ![Faceted Graph by Origin of Car](https://github.com/LOST-STATS/LOST-STATS.github.io/raw/master/Presentation/Figures/Images/Faceted_Graphs/stata_faceted_graph.png) + diff --git a/Presentation/Figures/formatting_graph_legends.md b/Presentation/Figures/formatting_graph_legends.md index aa8f4ae3..11f08f74 100644 --- a/Presentation/Figures/formatting_graph_legends.md +++ b/Presentation/Figures/formatting_graph_legends.md @@ -35,7 +35,7 @@ The dataset used in this article will be the ***mtcars*** dataset as it comes wi ```r?example=legends fig1 <- ggplot(mtcars, aes(wt, mpg, colour = factor(cyl))) + - geom_point() + geom_point() fig1 ``` @@ -46,7 +46,6 @@ fig2 <- ggplot(mtcars, aes(wt, mpg, colour = factor(cyl), shape = factor(am))) + geom_point() fig2 + labs(colour = "Number of Cylinders", shape = "Transmission Type") - ``` To change the legend position use the `theme()` modifier in ggplot. From there you can choose top, right, bottom, left, or none (removes the legend). To put the legends inside the plot create column vector of size 2 (the first value refers to the x coordinate. while the second refers to the y) where both elements are between 0 and 1. To ensure that the whole legends is within the graph use the `legend.justification` to set the corner where you want the legend. @@ -68,7 +67,7 @@ There are other cool things you can do to the legend to better customize the vis ```r?example=legends fig3 <- fig2 + theme( - legend.box.background = element_rect(color="red", size=2), + legend.box.background = element_rect(color = "red", size = 2), legend.box.margin = margin(116, 6, 6, 6), legend.key = element_rect(fill = "white", colour = "black"), legend.text = element_text(size = 8, colour = "red") @@ -90,12 +89,12 @@ You can alternately remove legends (or components of legends) with `guides` ```r?example=legends # Here we've removed the color legend, but the shape legend is still there. fig5 <- fig2 + - guides(color = FALSE) + guides(color = FALSE) fig5 # This removes both fig6 <- fig2 + - guides(color = FALSE, shape = FALSE) + guides(color = FALSE, shape = FALSE) fig6 ``` @@ -163,3 +162,4 @@ In regards to legend positioning, the same rules discussed above apply. ### Sources Stata's manual on two-way graphs: https://www.stata.com/manuals13/g-2graphtwowayline.pdf Stata's manual on legends: https://www.stata.com/manuals13/g-3legend_options.pdf + diff --git a/Presentation/Figures/heatmap_colored_correlation_matrix.md b/Presentation/Figures/heatmap_colored_correlation_matrix.md index df81bc4c..539126c0 100644 --- a/Presentation/Figures/heatmap_colored_correlation_matrix.md +++ b/Presentation/Figures/heatmap_colored_correlation_matrix.md @@ -46,8 +46,7 @@ from sklearn.datasets import fetch_california_housing data = fetch_california_housing() df = pd.DataFrame( - np.c_[data['data'], data['target']], - columns=data['feature_names'] + ['target'] + np.c_[data["data"], data["target"]], columns=data["feature_names"] + ["target"] ) # Create the correlation matrix @@ -66,16 +65,18 @@ cmap = sns.diverging_palette(220, 10, as_cmap=True) # Draw the heatmap with the mask and correct aspect ratio # More details at https://seaborn.pydata.org/generated/seaborn.heatmap.html sns.heatmap( - corr, # The data to plot - mask=mask, # Mask some cells - cmap=cmap, # What colors to plot the heatmap as - annot=True, # Should the values be plotted in the cells? - vmax=.3, # The maximum value of the legend. All higher vals will be same color - vmin=-.3, # The minimum value of the legend. All lower vals will be same color - center=0, # The center value of the legend. With divergent cmap, where white is - square=True, # Force cells to be square - linewidths=.5, # Width of lines that divide cells - cbar_kws={"shrink": .5} # Extra kwargs for the legend; in this case, shrink by 50% + corr, # The data to plot + mask=mask, # Mask some cells + cmap=cmap, # What colors to plot the heatmap as + annot=True, # Should the values be plotted in the cells? + vmax=0.3, # The maximum value of the legend. All higher vals will be same color + vmin=-0.3, # The minimum value of the legend. All lower vals will be same color + center=0, # The center value of the legend. With divergent cmap, where white is + square=True, # Force cells to be square + linewidths=0.5, # Width of lines that divide cells + cbar_kws={ + "shrink": 0.5 + }, # Extra kwargs for the legend; in this case, shrink by 50% ) # You can save this as a png with @@ -101,19 +102,19 @@ library(corrplot) data(mtcars) # Don't use too many variables or it will get messy! -mtcars <- mtcars[,c('mpg','cyl','disp','hp','drat','wt','qsec')] +mtcars <- mtcars[, c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec")] # Create a corrgram corrplot(cor(mtcars), - # Using the color method for a heatmap - method = 'color', - # And the lower half only for easier readability - type = 'lower', - # Omit the 1's along the diagonal to bring variable names closer - diag = FALSE, - # Add the number on top of the color - addCoef.col = 'black' - ) + # Using the color method for a heatmap + method = "color", + # And the lower half only for easier readability + type = "lower", + # Omit the 1's along the diagonal to bring variable names closer + diag = FALSE, + # Add the number on top of the color + addCoef.col = "black" +) ``` This results in: @@ -155,41 +156,51 @@ C <- C %>% # Use tidyr's pivot_longer to reshape to long format # There are other ways to reshape too -C_Long <- pivot_longer(C, cols = c(mpg, cyl, disp, hp, drat, wt, qsec), - # We will want this option for sure if we dropped the - # upper half of the triangle earlier - values_drop_na = TRUE) %>% +C_Long <- pivot_longer(C, + cols = c(mpg, cyl, disp, hp, drat, wt, qsec), + # We will want this option for sure if we dropped the + # upper half of the triangle earlier + values_drop_na = TRUE +) %>% # Make both variables into factors - mutate(Variable = factor(Variable), - name = factor(name)) %>% + mutate( + Variable = factor(Variable), + name = factor(name) + ) %>% # Reverse the order of one of the variables so that the x and y variables have # Opposing orders, common for a correlation matrix mutate(Variable = factor(Variable, levels = rev(levels(.$Variable)))) # Now we graph! -ggplot(C_Long, - # Our x and y axis are Variable and name - # And we want to fill each cell with the value - aes(x = Variable, y = name, fill = value))+ +ggplot( + C_Long, + # Our x and y axis are Variable and name + # And we want to fill each cell with the value + aes(x = Variable, y = name, fill = value) +) + # geom_tile to draw the graph geom_tile() + # Color the graph as we like # Here our negative correlations are red, positive are blue # gradient2 instead of gradient gives us a "mid" color which we can make white - scale_fill_gradient2(low = "red", high = "blue", mid = "white", - midpoint = 0, limit = c(-1,1), space = "Lab", - name="Pearson\nCorrelation") + + scale_fill_gradient2( + low = "red", high = "blue", mid = "white", + midpoint = 0, limit = c(-1, 1), space = "Lab", + name = "Pearson\nCorrelation" + ) + # Axis names don't make much sense labs(x = NULL, y = NULL) + # We don't need that background theme_minimal() + # If we need more room for variable names at the bottom, rotate them - theme(axis.text.x = element_text(angle = 45, vjust = 1, - size = 12, hjust = 1)) + + theme(axis.text.x = element_text( + angle = 45, vjust = 1, + size = 12, hjust = 1 + )) + # We want those cells to be square! coord_fixed() + # If you also want the correlations to be written directly on there, add geom_text - geom_text(aes(label = round(value,3))) + geom_text(aes(label = round(value, 3))) ``` This results in: diff --git a/Presentation/Figures/histograms.md b/Presentation/Figures/histograms.md index 9cbe9592..8cf78e70 100644 --- a/Presentation/Figures/histograms.md +++ b/Presentation/Figures/histograms.md @@ -38,10 +38,11 @@ By far the quickest way to plot a histogram is to use data analysis package [**p ```python?example=histopy import pandas as pd -df = pd.read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/PSID.csv", - index_col=0) +df = pd.read_csv( + "https://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/PSID.csv", index_col=0 +) -df['earnings'].plot.hist() +df["earnings"].plot.hist() ``` ![png](https://github.com/LOST-STATS/LOST-STATS.github.io/raw/master/Presentation/Figures/Images/histogram_graphs/py_hist_1.png) @@ -53,12 +54,12 @@ We can make this plot a bit more appealing by calling on the customisation featu ```python?example=histopy import matplotlib.pyplot as plt -plt.style.use('seaborn') +plt.style.use("seaborn") -ax = df['earnings'].plot.hist(density=True, log=True, bins=80) -ax.set_title('Earnings in the PSID', loc='left') -ax.set_ylabel('Density') -ax.set_xlabel('Earnings'); +ax = df["earnings"].plot.hist(density=True, log=True, bins=80) +ax.set_title("Earnings in the PSID", loc="left") +ax.set_ylabel("Density") +ax.set_xlabel("Earnings") ``` ![png](https://github.com/LOST-STATS/LOST-STATS.github.io/raw/master/Presentation/Figures/Images/histogram_graphs/py_hist_2.png) @@ -70,10 +71,12 @@ An alternative to the matplotlib-pandas combination is [**seaborn**](), a declar import seaborn as sns age_cut_off = 45 -df[f'Older than {age_cut_off}'] = df['age']>age_cut_off +df[f"Older than {age_cut_off}"] = df["age"] > age_cut_off -ax = sns.histplot(df, x="earnings", hue=f"Older than {age_cut_off}", element="step", stat="density") -ax.set_yscale('log') +ax = sns.histplot( + df, x="earnings", hue=f"Older than {age_cut_off}", element="step", stat="density" +) +ax.set_yscale("log") ``` ![png](https://github.com/LOST-STATS/LOST-STATS.github.io/raw/master/Presentation/Figures/Images/histogram_graphs/py_hist_3.png) @@ -83,11 +86,7 @@ Finally, let's look at a different declarative library, [**plotnine**](https://p ```python?example=histopy from plotnine import ggplot, aes, geom_histogram -( - ggplot(df, aes(x='earnings', y='stat(density)') - ) - + geom_histogram(bins=80) -) +(ggplot(df, aes(x="earnings", y="stat(density)")) + geom_histogram(bins=80)) ``` ![png](https://github.com/LOST-STATS/LOST-STATS.github.io/raw/master/Presentation/Figures/Images/histogram_graphs/py_hist_4.png) @@ -101,7 +100,7 @@ Histograms can be represented using base `R`, or more elegantly with `ggplot`. ` # loading the data -incomes = data.frame(income = state.x77[,'Income']) +incomes <- data.frame(income = state.x77[, "Income"]) # first using base R @@ -149,3 +148,4 @@ histogram mpg, bin(15) frequency hist mpg, width(2) frequency ``` + diff --git a/Presentation/Figures/line_graph_with_labels_at_the_beginning_or_end.md b/Presentation/Figures/line_graph_with_labels_at_the_beginning_or_end.md index beda7e76..8e099d34 100644 --- a/Presentation/Figures/line_graph_with_labels_at_the_beginning_or_end.md +++ b/Presentation/Figures/line_graph_with_labels_at_the_beginning_or_end.md @@ -37,15 +37,17 @@ import numpy as np import matplotlib.dates as mdates # Read in the data -df = pd.read_csv('https://raw.githubusercontent.com/LOST-STATS/LOST-STATS.github.io/master/Presentation/Figures/Data/Line_Graph_with_Labels_at_the_Beginning_or_End_of_Lines/Research_Nobel_Google_Trends.csv', - parse_dates=['date']) +df = pd.read_csv( + "https://raw.githubusercontent.com/LOST-STATS/LOST-STATS.github.io/master/Presentation/Figures/Data/Line_Graph_with_Labels_at_the_Beginning_or_End_of_Lines/Research_Nobel_Google_Trends.csv", + parse_dates=["date"], +) # Create the column we wish to plot -title = 'Log of Google Trends Index' -df[title] = np.log(df['hits']) +title = "Log of Google Trends Index" +df[title] = np.log(df["hits"]) # Set a style for the plot -plt.style.use('ggplot') +plt.style.use("ggplot") # Make a plot fig, ax = plt.subplots() @@ -56,38 +58,41 @@ sns.lineplot(ax=ax, data=df, x="date", y=title, hue="name", legend=None) # Add the text--for each line, find the end, annotate it with a label, and # adjust the chart axes so that everything fits on. for line, name in zip(ax.lines, df.columns.tolist()): - y = line.get_ydata()[-1] - x = line.get_xdata()[-1] - if not np.isfinite(y): - y=next(reversed(line.get_ydata()[~line.get_ydata().mask]),float("nan")) - if not np.isfinite(y) or not np.isfinite(x): - continue - text = ax.annotate(name, - xy=(x, y), - xytext=(0, 0), - color=line.get_color(), - xycoords=(ax.get_xaxis_transform(), - ax.get_yaxis_transform()), - textcoords="offset points") - text_width = (text.get_window_extent( - fig.canvas.get_renderer()).transformed(ax.transData.inverted()).width) - if np.isfinite(text_width): - ax.set_xlim(ax.get_xlim()[0], text.xy[0] + text_width * 1.05) + y = line.get_ydata()[-1] + x = line.get_xdata()[-1] + if not np.isfinite(y): + y = next(reversed(line.get_ydata()[~line.get_ydata().mask]), float("nan")) + if not np.isfinite(y) or not np.isfinite(x): + continue + text = ax.annotate( + name, + xy=(x, y), + xytext=(0, 0), + color=line.get_color(), + xycoords=(ax.get_xaxis_transform(), ax.get_yaxis_transform()), + textcoords="offset points", + ) + text_width = ( + text.get_window_extent(fig.canvas.get_renderer()) + .transformed(ax.transData.inverted()) + .width + ) + if np.isfinite(text_width): + ax.set_xlim(ax.get_xlim()[0], text.xy[0] + text_width * 1.05) # Format the date axis to be prettier. -ax.xaxis.set_major_formatter(mdates.DateFormatter('%b-%d')) +ax.xaxis.set_major_formatter(mdates.DateFormatter("%b-%d")) ax.xaxis.set_minor_locator(mdates.DayLocator()) ax.xaxis.set_major_locator(mdates.AutoDateLocator(interval_multiples=False)) plt.tight_layout() plt.show() - ``` ![Line Graph of Search Popularity for Research Nobels in Python.](https://github.com/LOST-STATS/LOST-STATS.github.io/raw/master/Presentation/Figures/Images/Line_Graph_with_Labels_at_the_Beginning_or_End_of_Lines/py_line_labels.png) ## R -```R +```r # If necessary, install ggplot2, lubridate, and directlabels # install.packages(c('ggplot2','directlabels', 'lubridate')) library(ggplot2) @@ -96,7 +101,7 @@ library(directlabels) # Load in Google Trends Nobel Search Data # Which contains the Google Trends global search popularity index for the four # research-based Nobel prizes over a month. -df <- read.csv('https://raw.githubusercontent.com/LOST-STATS/LOST-STATS.github.io/master/Presentation/Figures/Data/Line_Graph_with_Labels_at_the_Beginning_or_End_of_Lines/Research_Nobel_Google_Trends.csv') +df <- read.csv("https://raw.githubusercontent.com/LOST-STATS/LOST-STATS.github.io/master/Presentation/Figures/Data/Line_Graph_with_Labels_at_the_Beginning_or_End_of_Lines/Research_Nobel_Google_Trends.csv") # Properly treat our date variable as a date # Not necessary in all applications of this technique. @@ -105,20 +110,22 @@ df$date <- lubridate::ymd(df$date) # Construct our standard ggplot line graph # Drawing separate lines by name # And using the log of hits for visibility -ggplot(df, aes(x = date, y = log(hits), color = name)) + - labs(x = "Date", - y = "Log of Google Trends Index")+ - geom_line()+ +ggplot(df, aes(x = date, y = log(hits), color = name)) + + labs( + x = "Date", + y = "Log of Google Trends Index" + ) + + geom_line() + # Since we are about to add line labels, we don't need a legend theme(legend.position = "none") + - # Add, from the directlabels package, - # geom_dl, using method = 'last.bumpup' to put the - # labels at the end, and make sure that if they intersect, + # Add, from the directlabels package, + # geom_dl, using method = 'last.bumpup' to put the + # labels at the end, and make sure that if they intersect, # one is bumped up - geom_dl(aes(label = name), method = 'last.bumpup') + - # Extend the x axis so the labels are visible - + geom_dl(aes(label = name), method = "last.bumpup") + + # Extend the x axis so the labels are visible - # Try the graph a few times until you find a range that works - scale_x_date(limits = c(min(df$date), lubridate::ymd('2019-10-25'))) + scale_x_date(limits = c(min(df$date), lubridate::ymd("2019-10-25"))) ``` This results in: @@ -159,7 +166,7 @@ foreach n in `names' { * Add in the line graph code * by building on the local we already have (`lines') and adding a new twoway segment local lines `lines' (line loghits ymddate if name == "`n'") - + * Figure out the value this line hits on the last point on the graph quietly summ loghits if name == "`n'" & ymddate == `lastday' * The text command takes the y-value (from the mean we just took) @@ -182,3 +189,4 @@ twoway `lines', `textlabs' legend(off) xscale(range(`start' `end')) xtitle("Date This results in: ![Line Graph of Search Popularity for Research Nobels in Stata](https://github.com/LOST-STATS/LOST-STATS.github.io/raw/master/Presentation/Figures/Images/Line_Graph_with_Labels_at_the_Beginning_or_End_of_Lines/stata_line_graph_with_labels.png) + diff --git a/Presentation/Figures/line_graphs.md b/Presentation/Figures/line_graphs.md index 2a45fa42..d4356090 100644 --- a/Presentation/Figures/line_graphs.md +++ b/Presentation/Figures/line_graphs.md @@ -35,17 +35,17 @@ import matplotlib.pyplot as plt import seaborn as sns # Load in data -Orange = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/datasets/Orange.csv') +Orange = pd.read_csv( + "https://vincentarelbundock.github.io/Rdatasets/csv/datasets/Orange.csv" +) # Specify a line plot in Seaborn using # age and circumference on the x and y axis # and picking just Tree 1 from the data -sns.lineplot(x = 'age', - y = 'circumference', - data = Orange.loc[Orange.Tree == 1]) +sns.lineplot(x="age", y="circumference", data=Orange.loc[Orange.Tree == 1]) # And title the axes -plt.xlabel('Age (days since 12/31/1968)') -plt.ylabel('Circumference') +plt.xlabel("Age (days since 12/31/1968)") +plt.ylabel("Circumference") ``` The result is: @@ -57,13 +57,10 @@ If we want to include all the trees on the graph, with color to distinguish them ```python?example=seaborn # Add on a hue axis to add objects of different color by tree # So we can graph all the trees -sns.lineplot(x = 'age', - y = 'circumference', - hue = 'Tree', - data = Orange) +sns.lineplot(x="age", y="circumference", hue="Tree", data=Orange) # And title the axes -plt.xlabel('Age (days since 12/31/1968)') -plt.ylabel('Circumference') +plt.xlabel("Age (days since 12/31/1968)") +plt.ylabel("Circumference") ``` Which results in: @@ -76,19 +73,19 @@ Which results in: To make a line graph in R, we'll be using a dataset that's already built in to R, called 'Orange'. This dataset tracks the growth in circumference of several trees as they age. -```R?example=basicline +```r?example=basicline library(dplyr) library(lubridate) library(ggplot2) -#load in dataset +# load in dataset data(Orange) ``` This dataset has measurements for four different trees. To start off, we'll only be graphing the growth of Tree #1, so we first need to subset our data. -```R?example=basicline -#subset data to just tree #1 +```r?example=basicline +# subset data to just tree #1 tree_1_df <- Orange %>% filter(Tree == 1) ``` @@ -98,9 +95,9 @@ Then we will construct our plot using `ggplot()`. We'll create our line graph us - To make the actual line of the line graph, we will add the line geom_line() to our ggplot line using the `+` symbol. Using the `+` symbol allows us to add different lines of code to the same graph in order to create new elements within it. - Putting those steps together, we get the following code resulting in our first line graph: -```R?example=basicline +```r?example=basicline ggplot(tree_1_df, aes(x = age, y = circumference)) + - geom_line() + geom_line() ``` ![Unstyled R Line Graph](Images/Line_Graphs/line_graph_basic_R.png) @@ -112,12 +109,14 @@ This does show us how the tree grows over time, but it's rather plain and lacks - Using the function `theme()` allows us to manipulate the apperance of our labels through the element_text function - Let's change the line color, add a title and center it, and also add more information to our axes labels. -```R?example=basicline +```r?example=basicline ggplot(tree_1_df, aes(x = age, y = circumference)) + - geom_line(color = "orange") + - labs(x = "Age (days since 12/31/1968)", y = "Circumference (mm)", - title = "Orange Tree Circumference Growth by Age") + - theme(plot.title = element_text(hjust = 0.5)) + geom_line(color = "orange") + + labs( + x = "Age (days since 12/31/1968)", y = "Circumference (mm)", + title = "Orange Tree Circumference Growth by Age" + ) + + theme(plot.title = element_text(hjust = 0.5)) ``` ![Styled R Line Graph](Images/Line_Graphs/line_graph_styled_R.png) @@ -128,11 +127,11 @@ A great way to employ line graphs is to compare the changes of different values To add multiple lines using data from the same dataframe, simply add the `color` argument to the `aes()` function within our `ggplot()` line. Set the color argument to the identifying variable within your data set, here, that variable is `Tree`, so we will set `color = Tree`. -```R?example=basicline +```r?example=basicline ggplot(Orange, aes(x = age, y = circumference, color = Tree)) + - geom_line() + - labs(x = "Age (days since 12/31/1968)", y = "Circumference (mm)", title = "Orange Tree Circumference Growth by Age") + - theme(plot.title = element_text(hjust = 0.5)) + geom_line() + + labs(x = "Age (days since 12/31/1968)", y = "Circumference (mm)", title = "Orange Tree Circumference Growth by Age") + + theme(plot.title = element_text(hjust = 0.5)) ``` ![R Line Graph with Multiple Lines](Images/Line_Graphs/line_graph_multi_R.png) diff --git a/Presentation/Figures/marginal_effects_plots_for_interactions_with_categorical_variables.md b/Presentation/Figures/marginal_effects_plots_for_interactions_with_categorical_variables.md index 0740b3e4..77016a25 100644 --- a/Presentation/Figures/marginal_effects_plots_for_interactions_with_categorical_variables.md +++ b/Presentation/Figures/marginal_effects_plots_for_interactions_with_categorical_variables.md @@ -36,64 +36,61 @@ import matplotlib.pyplot as plt import linearmodels as lm # Read in data -od = pd.read_csv('https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Presentation/Figures/Data/Marginal_Effects_Plots_For_Interactions_With_Categorical_Variables/organ_donation.csv') +od = pd.read_csv( + "https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Presentation/Figures/Data/Marginal_Effects_Plots_For_Interactions_With_Categorical_Variables/organ_donation.csv" +) # Create Treatment Variable -od['California'] = od['State'] == 'California' +od["California"] = od["State"] == "California" # PanelOLS requires a numeric time variable -od['Qtr'] = 1 -od.loc[od['Quarter'] == 'Q12011', 'Qtr'] = 2 -od.loc[od['Quarter'] == 'Q22011', 'Qtr'] = 3 -od.loc[od['Quarter'] == 'Q32011', 'Qtr'] = 4 -od.loc[od['Quarter'] == 'Q42011', 'Qtr'] = 5 -od.loc[od['Quarter'] == 'Q12012', 'Qtr'] = 6 +od["Qtr"] = 1 +od.loc[od["Quarter"] == "Q12011", "Qtr"] = 2 +od.loc[od["Quarter"] == "Q22011", "Qtr"] = 3 +od.loc[od["Quarter"] == "Q32011", "Qtr"] = 4 +od.loc[od["Quarter"] == "Q42011", "Qtr"] = 5 +od.loc[od["Quarter"] == "Q12012", "Qtr"] = 6 # Create our interactions by hand, # skipping quarter 3, the last one before treatment for i in range(1, 7): - name = f"INX{i}" - od[name] = 1 * od['California'] - od.loc[od['Qtr'] != i, name] = 0 + name = f"INX{i}" + od[name] = 1 * od["California"] + od.loc[od["Qtr"] != i, name] = 0 # Set our individual and time (index) for our data -od = od.set_index(['State','Qtr']) +od = od.set_index(["State", "Qtr"]) -mod = lm.PanelOLS.from_formula('''Rate ~ +mod = lm.PanelOLS.from_formula( + """Rate ~ INX1 + INX2 + INX4 + INX5 + INX6 + -EntityEffects + TimeEffects''',od) +EntityEffects + TimeEffects""", + od, +) # Specify clustering when we fit the model -clfe = mod.fit(cov_type = 'clustered', - cluster_entity = True) +clfe = mod.fit(cov_type="clustered", cluster_entity=True) # Get coefficients and CIs -res = pd.concat([clfe.params, clfe.std_errors], axis = 1) +res = pd.concat([clfe.params, clfe.std_errors], axis=1) # Scale standard error to CI -res['ci'] = res['std_error']*1.96 +res["ci"] = res["std_error"] * 1.96 # Add our quarter values -res['Qtr'] = [1, 2, 4, 5, 6] +res["Qtr"] = [1, 2, 4, 5, 6] # And add our reference period back in -reference = pd.DataFrame([[0,0,0,3]], - columns = ['parameter', - 'lower', - 'upper', - 'Qtr']) +reference = pd.DataFrame([[0, 0, 0, 3]], columns=["parameter", "lower", "upper", "Qtr"]) res = pd.concat([res, reference]) # For plotting, sort and add labels -res = res.sort_values('Qtr') -res['Quarter'] = ['Q42010','Q12011', - 'Q22011','Q32011', - 'Q42011','Q12012'] +res = res.sort_values("Qtr") +res["Quarter"] = ["Q42010", "Q12011", "Q22011", "Q32011", "Q42011", "Q12012"] # Plot the estimates as connected lines with error bars -plt.errorbar(x = 'Quarter', y = 'parameter', - yerr = 'ci', data = res) +plt.errorbar(x="Quarter", y="parameter", yerr="ci", data=res) # Add a horizontal line at 0 -plt.axhline(0, linestyle = 'dashed') +plt.axhline(0, linestyle="dashed") ``` ![Categorical marginal effect plot in Python](https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Presentation/Figures/Images/Marginal_Effects_Plots_For_Interactions_With_Categorical_Variables/Python_Categorical_Interaction_Effect.png) @@ -109,19 +106,19 @@ library(tidyverse) library(fixest) library(broom) -od <- read_csv('https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Presentation/Figures/Data/Marginal_Effects_Plots_For_Interactions_With_Categorical_Variables/organ_donation.csv') +od <- read_csv("https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Presentation/Figures/Data/Marginal_Effects_Plots_For_Interactions_With_Categorical_Variables/organ_donation.csv") # Treatment variable od <- od %>% - mutate(Treated = State == 'California' & - Quarter %in% c('Q32011','Q42011','Q12012')) %>% + mutate(Treated = State == "California" & + Quarter %in% c("Q32011", "Q42011", "Q12012")) %>% # Create an ordered version of Quarter so we can graph it # and make sure we drop the last pre-treatment interaction, # which is quarter 2 of 2011 - mutate(Quarter = relevel(factor(Quarter), ref = 'Q22011')) %>% + mutate(Quarter = relevel(factor(Quarter), ref = "Q22011")) %>% # The treated group is the state of California # The 1* is only necessary for the first fixest method below; optional for the second, more general method - mutate(California = 1*(State == 'California')) + mutate(California = 1 * (State == "California")) ``` Next, our steps to do the **fixest**-specific method: @@ -129,39 +126,50 @@ Next, our steps to do the **fixest**-specific method: ```r?example=categorical # in the *specific example* of fixest, there is a simple and easy method: od <- od %>% mutate(fQuarter = factor(Quarter, - levels = c('Q42010','Q12011','Q22011', - 'Q32011','Q42011','Q12012'))) -femethod <- feols(Rate ~ i(California, fQuarter, drop = 'Q22011') | - State + Quarter, data = od) - -coefplot(femethod, ref = c('Q22011' = 3), pt.join = TRUE) + levels = c( + "Q42010", "Q12011", "Q22011", + "Q32011", "Q42011", "Q12012" + ) +)) +femethod <- feols(Rate ~ i(California, fQuarter, drop = "Q22011") | + State + Quarter, data = od) + +coefplot(femethod, ref = c("Q22011" = 3), pt.join = TRUE) ``` However, for other packages this may not work, so I will also do it by hand in a way that will work with models more generally (even though we'll still run the model in fixest): ```rr?example=categorical # Interact quarter with being in the treated group -clfe <- feols(Rate ~ California*Quarter | State, - data = od) +clfe <- feols(Rate ~ California * Quarter | State, + data = od +) -coefplot(clfe, ref = 'Q22011') +coefplot(clfe, ref = "Q22011") # Use broom::tidy to get the coefficients and SEs res <- tidy(clfe) %>% # Keep only the interactions - filter(str_detect(term, ':')) %>% + filter(str_detect(term, ":")) %>% # Pull the quarter out of the term mutate(Quarter = str_sub(term, -6)) %>% # Add in the term we dropped as 0 - add_row(estimate = 0, std.error = 0, - Quarter = 'Q22011') %>% + add_row( + estimate = 0, std.error = 0, + Quarter = "Q22011" + ) %>% # and add 95% confidence intervals - mutate(ci_bottom = estimate - 1.96*std.error, - ci_top = estimate + 1.96*std.error) %>% + mutate( + ci_bottom = estimate - 1.96 * std.error, + ci_top = estimate + 1.96 * std.error + ) %>% # And put the quarters in order mutate(Quarter = factor(Quarter, - levels = c('Q42010','Q12011','Q22011', - 'Q32011','Q42011','Q12012'))) + levels = c( + "Q42010", "Q12011", "Q22011", + "Q32011", "Q42011", "Q12012" + ) + )) # And graph @@ -174,9 +182,9 @@ ggplot(res, aes(x = Quarter, y = estimate, group = 1)) + # Add confidence intervals geom_linerange(aes(ymin = ci_bottom, ymax = ci_top)) + # Add a line so we know where 0 is - geom_hline(aes(yintercept = 0), linetype = 'dashed') + + geom_hline(aes(yintercept = 0), linetype = "dashed") + # Always label! - labs(caption = '95% Confidence Intervals Shown') + labs(caption = "95% Confidence Intervals Shown") ``` ![Categorical marginal effect plot in R/ggplot2](https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Presentation/Figures/Images/Marginal_Effects_Plots_For_Interactions_With_Categorical_Variables/R_Categorical_Interaction_Effect.png) @@ -231,3 +239,4 @@ twoway (sc coef Qtr, connect(line)) (rcap ci_top ci_bottom Qtr) (function y = 0, ``` ![Categorical marginal effect plot in Stata](https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Presentation/Figures/Images/Marginal_Effects_Plots_For_Interactions_With_Categorical_Variables/Stata_Categorical_Interaction_Effect.png) + diff --git a/Presentation/Figures/marginal_effects_plots_for_interactions_with_continuous_variables.md b/Presentation/Figures/marginal_effects_plots_for_interactions_with_continuous_variables.md index 233f8ea4..aa26f8cb 100644 --- a/Presentation/Figures/marginal_effects_plots_for_interactions_with_continuous_variables.md +++ b/Presentation/Figures/marginal_effects_plots_for_interactions_with_continuous_variables.md @@ -42,19 +42,23 @@ library(interplot) data(txhousing) # Estimate a regression with a nonlinear term -cubic_model <- lm(sales ~ listings + I(listings^2) + - I(listings^3), - data = txhousing) +cubic_model <- lm(sales ~ listings + I(listings^2) + + I(listings^3), +data = txhousing +) # Get the marginal effect of var1 (listings) # at different values of var2 (listings), with confidence ribbon. -# This will return a ggplot object, so you can +# This will return a ggplot object, so you can # customize using ggplot elements like labs(). -interplot(cubic_model, - var1 = "listings", - var2 = "listings")+ - labs(x = "Number of Listings", - y = "Marginal Effect of Listings") +interplot(cubic_model, + var1 = "listings", + var2 = "listings" +) + + labs( + x = "Number of Listings", + y = "Marginal Effect of Listings" + ) # Try setting adding listings*date to the regression model # and then in interplot set var2 = "date" to get the effect of listings at different values of date ``` @@ -77,7 +81,7 @@ regress wage c.tenure##c.tenure * Put the variable we're interested in getting the effect of in dydx() * And the values we want to evaluate it at in at() margins, dydx(tenure) at(tenure = (0(1)26)) -* (If we had interacted with another variable, say age, we would specify similarly, +* (If we had interacted with another variable, say age, we would specify similarly, * with at(age = (start(count-by)end))) * Then, marginsplot @@ -88,3 +92,4 @@ marginsplot, xtitle("Tenure") ytitle("Marginal Effect of Tenure") recast(line) r This results in: ![Marginal effect of tenure varying over tenure, produced with Stata.](https://github.com/LOST-STATS/LOST-STATS.github.io/raw/master/Presentation/Figures/Images/Marginal-Effects-Plots-for-Interactions-with-Continuous-Variables/stata_marginal_effects_continuous_interaction.png) + diff --git a/Presentation/Figures/scatterplot_by_group_on_shared_axes.md b/Presentation/Figures/scatterplot_by_group_on_shared_axes.md index b4a74da1..21954559 100644 --- a/Presentation/Figures/scatterplot_by_group_on_shared_axes.md +++ b/Presentation/Figures/scatterplot_by_group_on_shared_axes.md @@ -13,7 +13,7 @@ nav_order: 1 ## Keep in Mind -- Scatterplots may not work well if the data is discrete, or if there are a large number of data points. +- Scatterplots may not work well if the data is discrete, or if there are a large number of data points. ## Also Consider @@ -32,15 +32,18 @@ data(mtcars) # Make sure that our grouping variable is a factor # and labeled properly -mtcars$Transmission <- factor(mtcars$am, - labels = c("Automatic", "Manual")) - -# Put wt on the x-axis, mpg on the y-axis, -ggplot(mtcars, aes(x = wt, y = mpg, - # distinguish the Transmission values by color, - color = Transmission)) + +mtcars$Transmission <- factor(mtcars$am, + labels = c("Automatic", "Manual") +) + +# Put wt on the x-axis, mpg on the y-axis, +ggplot(mtcars, aes( + x = wt, y = mpg, + # distinguish the Transmission values by color, + color = Transmission +)) + # make it a scatterplot with geom_point() - geom_point()+ + geom_point() + # And label properly labs(x = "Car Weight", y = "MPG") ``` @@ -66,3 +69,4 @@ twoway (scatter weight mpg if foreign == 0, mcolor(black)) (scatter weight mpg i This results in: ![Scatterplot of car weight against MPG, differentiated by foreign, in Stata](https://github.com/LOST-STATS/LOST-STATS.github.io/raw/master/Presentation/Figures/Images/Scatterplot-by-Groups-on-Shared-Axes/stata_scatterplot_by_group.png) + diff --git a/Presentation/Figures/styling_line_graphs.md b/Presentation/Figures/styling_line_graphs.md index 7ea81088..51bd6592 100644 --- a/Presentation/Figures/styling_line_graphs.md +++ b/Presentation/Figures/styling_line_graphs.md @@ -32,7 +32,7 @@ eco_df <- economics ## basic plot p1 <- ggplot() + - geom_line(aes(x=date, y = uempmed), data = eco_df) + geom_line(aes(x = date, y = uempmed), data = eco_df) p1 ## Change line color and chart labels @@ -41,29 +41,35 @@ p1 ## a different line for each value of the factor variable, colored differently. p2 <- ggplot() + ## choose a color of preference - geom_line(aes(x=date, y = uempmed), color = "navyblue", data = eco_df) + + geom_line(aes(x = date, y = uempmed), color = "navyblue", data = eco_df) + ## add chart title and change axes labels labs( title = "Median Duration of Unemployment", x = "Date", - y = "") + + y = "" + ) + ## Add a ggplot theme theme_light() - ## center the chart title - theme(plot.title = element_text(hjust = 0.5)) + +## center the chart title +theme(plot.title = element_text(hjust = 0.5)) + -p2 + p2 ## plotting multiple charts (of different line types and sizes) -p3 <-ggplot() + - geom_line(aes(x=date, y = uempmed), color = "navyblue", - size = 1.5, data = eco_df) + - geom_line(aes(x=date, y = psavert), color = "red2", - linetype = "dotted", size = 0.8, data = eco_df) + +p3 <- ggplot() + + geom_line(aes(x = date, y = uempmed), + color = "navyblue", + size = 1.5, data = eco_df + ) + + geom_line(aes(x = date, y = psavert), + color = "red2", + linetype = "dotted", size = 0.8, data = eco_df + ) + labs( title = "Unemployment Duration (Blue) and Savings Rate (Red)", x = "Date", - y = "") + + y = "" + ) + theme_light() + theme(plot.title = element_text(hjust = 0.5)) @@ -71,36 +77,41 @@ p3 ## Plotting a different line type for each group ## There isn't a natural factor in this data so let's just duplicate the data and make one up -eco_df$fac <- factor(1, levels = c(1,2)) +eco_df$fac <- factor(1, levels = c(1, 2)) eco_df2 <- eco_df eco_df2$fac <- 2 eco_df2$uempmed <- eco_df2$uempmed - 2 + rnorm(nrow(eco_df2)) eco_df <- rbind(eco_df, eco_df2) p4 <- ggplot() + ## This time, color goes inside aes - geom_line(aes(x=date, y = uempmed, color = fac), data = eco_df) + + geom_line(aes(x = date, y = uempmed, color = fac), data = eco_df) + ## add chart title and change axes labels labs( title = "Median Duration of Unemployment", x = "Date", - y = "") + + y = "" + ) + ## Add a ggplot theme theme_light() + ## center the chart title - theme(plot.title = element_text(hjust = 0.5), - ## Move the legend onto some blank space on the diagram - legend.position = c(.25,.8), - ## And put a box around it - legend.background = element_rect(color="black")) + + theme( + plot.title = element_text(hjust = 0.5), + ## Move the legend onto some blank space on the diagram + legend.position = c(.25, .8), + ## And put a box around it + legend.background = element_rect(color = "black") + ) + ## Retitle the legend that pops up to explain the discrete (factor) difference in colors ## (note if we just want a name change we could do guides(color = guide_legend(title = 'Random Factor')) instead) - scale_color_manual(name = "Random Factor", - # And specify the colors for the factor levels (1 and 2) by hand if we like - values = c("1" = "red", "2" = "blue")) + scale_color_manual( + name = "Random Factor", + # And specify the colors for the factor levels (1 and 2) by hand if we like + values = c("1" = "red", "2" = "blue") + ) p4 # Put them all together with cowplot for LOST upload -plot_grid(p1,p2,p3,p4, nrow=2) +plot_grid(p1, p2, p3, p4, nrow = 2) ``` The four plots generated by the code are (in order p1, p2, then p3 and p4): @@ -108,7 +119,7 @@ The four plots generated by the code are (in order p1, p2, then p3 and p4): ## Stata -In Stata, one can create plot lines using the command `line`, which in combination with `twoway` allows you to modify components of sub-plots individually. In this demonstration, I will use minimal formatting, but will apply minimal modifications using Ben Jann's `grstyle`. +In Stata, one can create plot lines using the command `line`, which in combination with `twoway` allows you to modify components of sub-plots individually. In this demonstration, I will use minimal formatting, but will apply minimal modifications using Ben Jann's `grstyle`. ```stata ** Setup: Install grstyle @@ -173,7 +184,7 @@ line uempmed date, sort /// ![Stata line graph with title](Images/Styling-Line-Graphs/stsc_4.png) ### Changing Line characteristics. -It is also possible to modify the line width `lwidth()`, line color `lcolor()`, and line pattern `lpattern()`. To show how this can affect the plot, below 4 examples are provided. +It is also possible to modify the line width `lwidth()`, line color `lcolor()`, and line pattern `lpattern()`. To show how this can affect the plot, below 4 examples are provided. Notice that each plot is saved in memory using `name()`, and all are combined using `graph combine`. @@ -208,11 +219,11 @@ You may also want to plot multiple variables in the same figure. There are two w ```stata twoway (line uempmed date, sort lwidth(.75) lpattern(solid) ) /// (line psavert date, sort lwidth(.25) lpattern(dash) ), /// - legend (order(1 "Unemployment duration" 2 "Saving rate")) - + legend (order(1 "Unemployment duration" 2 "Saving rate")) + line uempmed psavert date, sort lwidth(0.75 .25) lpattern(solid dash) /// - legend(order(1 "Unemployment duration" 2 "Saving rate")) -``` + legend(order(1 "Unemployment duration" 2 "Saving rate")) +``` Both options provide the same figure, however, I prefer the first option since that allows for more flexibility. ![Stata line graph with multiple lines](Images/Styling-Line-Graphs/stsc_6.png) @@ -223,7 +234,7 @@ You can also choose to plot each variable in a different `axis`. Each axis can h twoway (line uempmed date, sort lwidth(.75) lpattern(solid) yaxis(1)) /// (line psavert date, sort lwidth(.25) lpattern(dash) yaxis(2)), /// legend(order(1 "Unemployment duration" 2 "Saving rate")) /// - ytitle(Weeks ,axis(1) ) ytitle(Interest rate,axis(2) ) + ytitle(Weeks ,axis(1) ) ytitle(Interest rate,axis(2) ) ``` ![Stata line graph with multiple axes](Images/Styling-Line-Graphs/stsc_7.png) @@ -231,7 +242,7 @@ twoway (line uempmed date, sort lwidth(.75) lpattern(solid) yaxis(1)) /// Finally, it is possible to add vertical lines. This may be useful, for example, to differentiate the great recession period. Additionally, in this plot, I add a note. -```stata +```stata twoway (line uempmed date, sort lwidth(.75) lpattern(solid) yaxis(1)) /// (line psavert date, sort lwidth(.25) lpattern(dash) yaxis(2)), /// legend(order(1 "Unemployment duration" 2 "Saving rate")) /// diff --git a/Presentation/Presentation.md b/Presentation/Presentation.md index fa3c5602..dec24a52 100644 --- a/Presentation/Presentation.md +++ b/Presentation/Presentation.md @@ -5,3 +5,4 @@ nav_order: 6 --- # Presentation + diff --git a/Presentation/Tables/Balance_Tables.md b/Presentation/Tables/Balance_Tables.md index 2480a9ac..33f28a7f 100644 --- a/Presentation/Tables/Balance_Tables.md +++ b/Presentation/Tables/Balance_Tables.md @@ -8,7 +8,7 @@ nav_order: 1 # Balance Tables -Balance Tables are a method by which you can statistically compare differences in characteristics between a treatment and control group. Common in experimental work and when using matching estimators, balance tables show if the treatment and control group are 'balanced' and can be seen as similarly 'identical' for comparison of a causal effect. +Balance Tables are a method by which you can statistically compare differences in characteristics between a treatment and control group. Common in experimental work and when using matching estimators, balance tables show if the treatment and control group are 'balanced' and can be seen as similarly 'identical' for comparison of a causal effect. ## Keep in Mind @@ -36,24 +36,25 @@ Another approach provides an omnibus summary of overall balance on many covariat ```r library(RItools) -options(show.signif.stars=FALSE,digits=3) -xb_res <- xBalance(am~mpg+hp+cyl+wt,strata=list(nostrat=NULL,vsstrat=~vs),data=mtcars,report="all") +options(show.signif.stars = FALSE, digits = 3) +xb_res <- xBalance(am ~ mpg + hp + cyl + wt, strata = list(nostrat = NULL, vsstrat = ~vs), data = mtcars, report = "all") xb_res$overall -xb_res$results[,c(1:3,6:7),] +xb_res$results[, c(1:3, 6:7), ] ``` ## Stata ```stata -* Import Dependency: 'ssc install table1' +* Import Dependency: 'ssc install table1' * Load Data sysuse auto, clear * Create Balance Table -* You need to declare the kind of variable for each, as well as the variable by which you define treatment and control. +* You need to declare the kind of variable for each, as well as the variable by which you define treatment and control. * Adding test gives the statistical difference between the two groups. The ending saves your output as an .xls file table1, by(foreign) vars(price conts \ mpg conts \ weight contn \ length conts) test saving(bal_tab.xls, replace) ``` #### Also Consider The World Bank's very useful [ietoolkit](https://blogs.worldbank.org/impactevaluations/ie-analytics-introducing-ietoolkit) for Stata has a very flexible command for creating balance tables, iebaltab. You can learn more about how to use it on their [Wiki page on the command](https://dimewiki.worldbank.org/wiki/Iebaltab). + diff --git a/Presentation/Tables/Regression_Tables.md b/Presentation/Tables/Regression_Tables.md index 431ee337..2f2163b5 100644 --- a/Presentation/Tables/Regression_Tables.md +++ b/Presentation/Tables/Regression_Tables.md @@ -9,7 +9,7 @@ mathjax: true ## Switch to false if this page has no equations or other math ren # Regression Tables -Statistical packages often report regression results in a way that is not how you would want to display them in a paper or on a website. Additionally, they rarely provide an option to display multiple regression results in the same table. +Statistical packages often report regression results in a way that is not how you would want to display them in a paper or on a website. Additionally, they rarely provide an option to display multiple regression results in the same table. Two (bad) options for including regression results in your paper include copying over each desied number by hand, or taking a screenshot of your regression output. Much better is using a command that outputs regression results in a nice format, in a way you can include in your presentation. @@ -44,12 +44,14 @@ lm2 <- lm(mpg ~ cyl + hp, data = mtcars) # Let's output an HTML table, perhaps for pasting into Word # We could instead set type = 'latex' for LaTeX or type = 'text' for a text-only table. -stargazer(lm1, lm2, type = 'html', out = 'my_reg_table.html') +stargazer(lm1, lm2, type = "html", out = "my_reg_table.html") # In line with good practices, we should use readable names for our variables -stargazer(lm1, lm2, type = 'html', out = 'my_reg_table.html', - covariate.labels = c('Cylinders','Horsepower'), - dep.var.labels = 'Miles per Gallon') +stargazer(lm1, lm2, + type = "html", out = "my_reg_table.html", + covariate.labels = c("Cylinders", "Horsepower"), + dep.var.labels = "Miles per Gallon" +) ``` This produces: @@ -122,7 +124,7 @@ huxreg(lm1, lm2, # We can send it to the screen to view it instantly print_screen() -# Or we can send it to a file with the quick_ functions, which can +# Or we can send it to a file with the quick_ functions, which can # output to pdf, docx, html, xlsx, pptx, rtf, or latex. huxreg(lm1, lm2, coefs=c('Cylinders' = 'cyl', @@ -181,7 +183,7 @@ estimates store weightandforeign * replacing any table we've already made * and making an HTML table with style(html) * style(tex) also works, and the default is tab-delimited data for use in Excel. -* Note also the default is to display t-statistics in parentheses. If we want +* Note also the default is to display t-statistics in parentheses. If we want * standard errors instead, we say so with se esttab weightonly weightandforeign using my_reg_output.html, label replace style(html) se ``` diff --git a/Presentation/Tables/Summary_Statistics_Tables.md b/Presentation/Tables/Summary_Statistics_Tables.md index 7c01fa71..cdeb2e43 100644 --- a/Presentation/Tables/Summary_Statistics_Tables.md +++ b/Presentation/Tables/Summary_Statistics_Tables.md @@ -33,7 +33,7 @@ library(vtable) data(mtcars) # Feed sumtable a data.frame with the variables you want summarized -mt_tosum <- mtcars[,c('mpg','cyl','disp')] +mt_tosum <- mtcars[, c("mpg", "cyl", "disp")] # By default, the table shows up in the Viewer pane (in RStudio) or your browser (otherwise) # (or if being run inside of RMarkdown, in the RMarkdown document format) sumtable(mt_tosum) @@ -44,15 +44,15 @@ st(mt_tosum) # help(sumtable) # Some useful ones include out, which designates a file to send the table to # (note that HTML tables can be copied straight into Word from an output file) -sumtable(mt_tosum, out = 'html', file = 'my_summary.html') +sumtable(mt_tosum, out = "html", file = "my_summary.html") # sumtable will handle factor variables as expected, # and you can replace variable names with "labels" -mt_tosum$trans <- factor(mtcars$am, labels = c('Manual','Automatic')) -st(mt_tosum, labels = c('Miles per Gallon','Cylinders','Displacement','Transmission')) +mt_tosum$trans <- factor(mtcars$am, labels = c("Manual", "Automatic")) +st(mt_tosum, labels = c("Miles per Gallon", "Cylinders", "Displacement", "Transmission")) # Use group to get summary statistics by group -st(mt_tosum, labels = c('Miles per Gallon','Cylinders','Displacement'), group = 'trans') +st(mt_tosum, labels = c("Miles per Gallon", "Cylinders", "Displacement"), group = "trans") ``` Another good option is the package **skimr**, which is an excellent alternative to `base::summary()`. `skimr::skim()` takes different data types and outputs a summary statistic data frame. Numeric data gets miniature histograms and all types of data get information about the number of missing entries. @@ -66,21 +66,20 @@ library(skimr) skim(starwars) -#If you're wondering which columns have missing values, you can use skim() in a pipeline. +# If you're wondering which columns have missing values, you can use skim() in a pipeline. starwars %>% skim() %>% dplyr::filter(n_missing > 0) %>% dplyr::select(skim_variable, n_missing, complete_rate) - -#You can analyze grouped data with skimr. You can also easily customize the output table using skim_with(). + +# You can analyze grouped data with skimr. You can also easily customize the output table using skim_with(). my_skim <- skim_with(base = sfl( - n = length + n = length )) starwars %>% - group_by(species) %>% - my_skim() %>% - dplyr::filter(skim_variable == "height" & n > 1) - + group_by(species) %>% + my_skim() %>% + dplyr::filter(skim_variable == "height" & n > 1) ``` ## Stata @@ -106,7 +105,7 @@ estpost summarize price mpg rep78 f_* * We can then use esttab and cells() to pick columns * Now it's nicely formatted -* The quotes around the statistics put all the statistics in one row +* The quotes around the statistics put all the statistics in one row esttab, cells("count mean sd min max") * If we want to limit the number of significant digits we must do this stat by stat @@ -131,3 +130,4 @@ outreg2 using mysmalltable.doc, word sum(log) eqkeep(N mean) dec(3) replace restore ``` + diff --git a/Presentation/Tables/Tables.md b/Presentation/Tables/Tables.md index ecfd5a72..b2f73fda 100644 --- a/Presentation/Tables/Tables.md +++ b/Presentation/Tables/Tables.md @@ -7,3 +7,4 @@ nav_order: 2 --- # Tables + diff --git a/README.md b/README.md index ff590feb..4d01a659 100644 --- a/README.md +++ b/README.md @@ -28,12 +28,10 @@ We have some facilities for testing to make sure that all the code samples in th ### Requirements -You will first need to install [Docker](https://docs.docker.com/desktop/). You will also need Python 3.8 or above. After this, you will need to run the following commands: +You will first need to install [Docker](https://docs.docker.com/desktop/). You will also need Python 3.8 or above and [poetry](https://python-poetry.org). After this, you will need to run the following commands: ```bash -python3 -m venv venv -source venv/bin/activate -pip install 'mistune==2.0.0rc1' 'py.test==6.1.1' 'pytest-xdist==2.1.0' +poetry install docker pull ghcr.io/lost-stats/lost-docker-images/tester-r docker pull ghcr.io/lost-stats/lost-docker-images/tester-python @@ -46,20 +44,19 @@ At this point, the docker images will _not_ be updated unless you explicitly rep After completing the setup, you can simply run ``` -source venv/bin/activate -py.test test_samples.py +poetry run py.test -k test_samples ``` Note that this will take a _long_ time to run. You can reduce the set of tests run using the `--mdpath` option. For instance, to find and run all the code samples in the `Time_Series` and `Presentation` folders, you can run ``` -py.test test_samples.py --mdpath Time_Series --mdpath Presentation +poetry run py.test -k test_samples --mdpath Time_Series --mdpath Presentation ``` Furthermore, you can run tests in parallel by adding the `-n` parameter: ``` -py.test test_samples.py -n 3 --mdpath Time_Series +poetry run py.test -k test_samples -n 3 --mdpath Time_Series ``` ### Adding dependencies @@ -73,4 +70,18 @@ docker pull ghcr.io/lost-stats/lost-docker-images/tester-python ### Connecting code samples -Note that a lot of code samples in this repository are broken up by raw markdown text. If you would like to connect these in a single runtime, you should specify the language as `language?example=some_id` for each code sample in the chain. For instance, a Python example might be specified as `python?example=seaborn` as you can see in the [Line Graphs Example](https://github.com/LOST-STATS/lost-stats.github.io/blob/source/Presentation/Figures/line_graphs.md). \ No newline at end of file +Note that a lot of code samples in this repository are broken up by raw markdown text. If you would like to connect these in a single runtime, you should specify the language as `language?example=some_id` for each code sample in the chain. For instance, a Python example might be specified as `python?example=seaborn` as you can see in the [Line Graphs Example](https://github.com/LOST-STATS/lost-stats.github.io/blob/source/Presentation/Figures/line_graphs.md). + +### Style code samples + +In order to keep all of our code samples styled consistently, we use [black](https://github.com/psf/black) for Python and [styler](https://styler.r-lib.org/) for R. To style an individual file, run + +``` +poetry run lost style path/to/file.md +``` + +To style _all_ files, run + +``` +poetry run lost style . --skip tests --skip README.md +``` diff --git a/Time_Series/AR-models.md b/Time_Series/AR-models.md index 07ac6afc..671b4315 100644 --- a/Time_Series/AR-models.md +++ b/Time_Series/AR-models.md @@ -38,8 +38,10 @@ Using GDP data, let’s fit an auto-regressive model of order 1, an AR(1), with import pandas as pd from statsmodels.tsa.ar_model import AutoReg, ar_select_order -gdp = pd.read_csv("https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Time_Series/Data/GDPC1.csv", - index_col=0) +gdp = pd.read_csv( + "https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Time_Series/Data/GDPC1.csv", + index_col=0, +) ar1_model = AutoReg(gdp, 1) results = ar1_model.fit() print(results.summary()) @@ -112,17 +114,17 @@ print(results_p.summary()) ## R ```r -#load data -gdp = read.csv("https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Time_Series/Data/GDPC1.csv") +# load data +gdp <- read.csv("https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Time_Series/Data/GDPC1.csv") -#estimation via ols: pay attention to the selection of the 'GDPC1' column. -#if the column is not specified, the function call also interprets the date column as a time series variable! -ar_gdp = ar.ols(gdp$GDPC1) +# estimation via ols: pay attention to the selection of the 'GDPC1' column. +# if the column is not specified, the function call also interprets the date column as a time series variable! +ar_gdp <- ar.ols(gdp$GDPC1) ar_gdp -#lag order is automatically selected by minimizing AIC -#disable this feature with the optional command 'aic = F'. Note: you will also likely wish to specify the argument 'order.max'. -#ar.ols() defaults to demeaning the data automatically. Also consider taking logs and first differencing for statistically meaningful results. +# lag order is automatically selected by minimizing AIC +# disable this feature with the optional command 'aic = F'. Note: you will also likely wish to specify the argument 'order.max'. +# ar.ols() defaults to demeaning the data automatically. Also consider taking logs and first differencing for statistically meaningful results. ``` ## STATA @@ -146,3 +148,4 @@ tsset date_index reg gdpc1 L.gdpc1 L2.gdpc1 *variables are not demeaned automatically by STATA. Also consider taking logs and first differencing for statistically meaningful results. ``` + diff --git a/Time_Series/ARCH_Model.md b/Time_Series/ARCH_Model.md index c693662c..80d2d4d6 100644 --- a/Time_Series/ARCH_Model.md +++ b/Time_Series/ARCH_Model.md @@ -33,7 +33,7 @@ For additional information, see [Wikipedia: Autoregressive conditional heteroske ## Also Consider -- ARCH models can be univariate (scalar) or multivariate (vector). +- ARCH models can be univariate (scalar) or multivariate (vector). - ARCH models are commonly employed in modeling financial time series that exhibit time-varying volatility and volatility clustering, i.e. periods of swings interspersed with periods of relative calm. - If an autoregressive moving average (ARMA) model is assumed for the error variance, the model is a generalized autoregressive conditional heteroskedasticity (GARCH) model. For more information on GARCH models, see [Wikipedia: GARCH](https://en.wikipedia.org/wiki/Autoregressive_conditional_heteroskedasticity#GARCH). For information about estimating an GARCH models, see [LOST: GARCH models]({{ "/Time_Series/GARCH_Model.html" | relative_url }}). @@ -47,18 +47,19 @@ from random import seed from matplotlib import pyplot from arch import arch_model import numpy as np + # seed the process np.random.seed(1) # Simulating a ARCH(1) process a0 = 1 -a1 = .5 +a1 = 0.5 w = np.random.normal(size=1000) e = np.random.normal(size=1000) Y = np.empty_like(w) for t in range(1, len(w)): - Y[t] = w[t] * np.sqrt((a0 + a1*Y[t-1]**2)) + Y[t] = w[t] * np.sqrt((a0 + a1 * Y[t - 1] ** 2)) # fit model -model = arch_model(Y, vol = "ARCH", rescale = "FALSE") +model = arch_model(Y, vol="ARCH", rescale="FALSE") model_fit = model.fit() print(model_fit.summary) ``` @@ -74,9 +75,11 @@ set.seed(1) e <- NULL obs <- 1000 e[1] <- rnorm(1) -for (i in 2:obs) {e[i] <- rnorm(1)*(1+0.5*(e[i-1])^2)^0.5} +for (i in 2:obs) { + e[i] <- rnorm(1) * (1 + 0.5 * (e[i - 1])^2)^0.5 +} # fit the model -arch.fit <- garchFit(~garch(1,0), data = e, trace = F) +arch.fit <- garchFit(~ garch(1, 0), data = e, trace = F) summary(arch.fit) ``` @@ -92,6 +95,7 @@ tsset time gen e=. replace e=rnormal() if time==1 replace e=rnormal()*(1 + .5*(e[_n-1])^2)^.5 if time>=2 & time<=2000 -* Estimate arch parameters.. +* Estimate arch parameters.. arch e, arch(1) ``` + diff --git a/Time_Series/ARIMA-models.md b/Time_Series/ARIMA-models.md index b92d9ced..ac2dbd67 100644 --- a/Time_Series/ARIMA-models.md +++ b/Time_Series/ARIMA-models.md @@ -66,25 +66,25 @@ which leads to $$\Delta^{d} y_{t}$$ being an $$ARMA(p,q)$$ process. The `stats` package, which comes standard-loaded on an RStudio workspace, includes the function `arima`, which allows one to estimate an arima model, if they know $$p,d,$$ and $$q$$ already. ```r?example=rarima -#load data -gdp = read.csv("https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Time_Series/Data/GDPC1.csv") -gdp_ts = ts(gdp[ ,2], frequency = 4, start = c(1947, 01), end = c(2019, 04)) -y = log(gdp_ts)*100 +# load data +gdp <- read.csv("https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Time_Series/Data/GDPC1.csv") +gdp_ts <- ts(gdp[, 2], frequency = 4, start = c(1947, 01), end = c(2019, 04)) +y <- log(gdp_ts) * 100 ``` The output for `arima()` is a list. Use `$coef` to get only the AR and MA estimates. Use `$model` to get the entire estimated model. If you want to see the maximized log-likelihood value, $$sigma^{2}$$, and AIC, simply run the function on the data: ```r?example=rarima -#estimate an ARIMA(2,1,2) model -lgdp_arima <- arima(y, c(2,1,2)) +# estimate an ARIMA(2,1,2) model +lgdp_arima <- arima(y, c(2, 1, 2)) -#To see maximized log-likelihood value, $sigma^{2}$, and AIC: +# To see maximized log-likelihood value, $sigma^{2}$, and AIC: lgdp_arima -#To get only the AR and MA parameter estimates: +# To get only the AR and MA parameter estimates: lgdp_arima$coef -#To see the estimated model: +# To see the estimated model: lgdp_arima$model ``` @@ -94,10 +94,10 @@ differencing the series unit stationary - Create likelihood functions at various ```r?example=rarima library(forecast) -#Finding optimal parameters for an ARIMA using the previous data +# Finding optimal parameters for an ARIMA using the previous data lgdp_auto <- auto.arima(y) -#A seasonal model was selected, with non-seasonal components (p,d,q)=(1,2,1), and seasonal components (P,D,Q)=(2,0,1) +# A seasonal model was selected, with non-seasonal components (p,d,q)=(1,2,1), and seasonal components (P,D,Q)=(2,0,1) ``` `auto.arima()` contains a lot of flexibility. If one knows the value of $$d$$, it can be passed to the function. Maximum and starting values for $$p,q,$$ and $$d$$ can be specified in the seasonal- and non-seasonal cases. If one would like to restrict themselves to a non-seasonal model, or use a different test, these can also be done. Some of these features are demonstrated below. The method for testing unit roots can also be specified. See `?auto.arima` or [the package documentation](https://cran.r-project.org/web/packages/forecast/forecast.pdf) for more. @@ -110,12 +110,13 @@ lgdp_auto <- auto.arima(y) ## p starts at 1 and does not exceed 4 # no drift lgdp_ns <- auto.arima(y, - seasonal = F, - test = "adf", - start.p = 1, - max.p = 4, - allowdrift = F) -#An ARIMA(3,1,0) was specified + seasonal = F, + test = "adf", + start.p = 1, + max.p = 4, + allowdrift = F +) +# An ARIMA(3,1,0) was specified lgdp_ns ``` @@ -125,8 +126,9 @@ given an ARIMA model. Note that the input here should come from either `stats::arima()`. ```r?example=rarima -#Simulate data using a non-seasonal ARIMA() -arima_222 <- Arima(y, c(2,2,2)) +# Simulate data using a non-seasonal ARIMA() +arima_222 <- Arima(y, c(2, 2, 2)) sim_arima <- forecast:::simulate.Arima(arima_222) tail(sim_arima, 20) ``` + diff --git a/Time_Series/ARMA-models.md b/Time_Series/ARMA-models.md index eb84af01..311b70fc 100644 --- a/Time_Series/ARMA-models.md +++ b/Time_Series/ARMA-models.md @@ -47,8 +47,10 @@ import numpy as np import pandas as pd from statsmodels.tsa.arima.model import ARIMA -gdp = pd.read_csv("https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Time_Series/Data/GDPC1.csv", - index_col=0) +gdp = pd.read_csv( + "https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Time_Series/Data/GDPC1.csv", + index_col=0, +) # Take 1st diff of log of gdp d_ln_gdp = np.log(gdp).diff() @@ -122,27 +124,28 @@ if (!require("tsibble")) install.packages("tsibble") library(tsibble) -#load data -gdp = read.csv("https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Time_Series/Data/GDPC1.csv") +# load data +gdp <- read.csv("https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Time_Series/Data/GDPC1.csv") -#set our data up as a time-series +# set our data up as a time-series gdp$DATE <- as.Date(gdp$DATE) gdp_ts <- as_tsibble(gdp, - index = DATE, - regular = FALSE) %>% - index_by(qtr = ~ yearquarter(.)) + index = DATE, + regular = FALSE +) %>% + index_by(qtr = ~ yearquarter(.)) -#construct our first difference of log gdp variable -gdp_ts$lgdp=log(gdp_ts$GDPC1) +# construct our first difference of log gdp variable +gdp_ts$lgdp <- log(gdp_ts$GDPC1) -gdp_ts$ldiffgdp=difference(gdp_ts$lgdp, lag=1, difference=1) +gdp_ts$ldiffgdp <- difference(gdp_ts$lgdp, lag = 1, difference = 1) -#Estimate our ARMA(3,1) -##Note that because we are modeling for the first difference of log GDP, we cannot use our first observation of -##log GDP to estimate our model. -arma_gdp = arima(gdp_ts$lgdp[2:292], order=c(3,0,1)) +# Estimate our ARMA(3,1) +## Note that because we are modeling for the first difference of log GDP, we cannot use our first observation of +## log GDP to estimate our model. +arma_gdp <- arima(gdp_ts$lgdp[2:292], order = c(3, 0, 1)) arma_gdp ``` @@ -177,3 +180,4 @@ gen dlgdp = D.lgdp arima dlgdp, arima(3,0,1) ``` + diff --git a/Time_Series/GARCH_Model.md b/Time_Series/GARCH_Model.md index 9f39dffc..4f56ae49 100644 --- a/Time_Series/GARCH_Model.md +++ b/Time_Series/GARCH_Model.md @@ -38,6 +38,7 @@ from random import seed from matplotlib import pyplot from arch import arch_model import numpy as np + # seed the process np.random.seed(1) # Simulating a GARCH(1, 1) process @@ -49,7 +50,7 @@ w = np.random.normal(size=n) eps = np.zeros_like(w) sigsq = np.zeros_like(w) for i in range(1, n): - sigsq[i] = a0 + a1*(eps[i-1]**2) + b1*sigsq[i-1] + sigsq[i] = a0 + a1 * (eps[i - 1] ** 2) + b1 * sigsq[i - 1] eps[i] = w[i] * np.sqrt(sigsq[i]) model = arch_model(eps) model_fit = model.fit() @@ -71,12 +72,14 @@ a1 <- 0.5 b1 <- 0.3 obs <- 1000 eps <- rep(0, obs) -sigsq <- rep(0,obs) +sigsq <- rep(0, obs) for (i in 2:obs) { - sigsq[i] = a0 + a1*(eps[i-1]^2) + b1*sigsq[i-1] - eps[i] <- rnorm(1)*sqrt(sigsq[i])} + sigsq[i] <- a0 + a1 * (eps[i - 1]^2) + b1 * sigsq[i - 1] + eps[i] <- rnorm(1) * sqrt(sigsq[i]) +} # fit the model -garch.fit <- garchFit(~garch(1,1), data = eps, trace = F) +garch.fit <- garchFit(~ garch(1, 1), data = eps, trace = F) summary(garch.fit) ``` + diff --git a/Time_Series/Granger_Causality.md b/Time_Series/Granger_Causality.md index dfb09cdf..76bc9a87 100644 --- a/Time_Series/Granger_Causality.md +++ b/Time_Series/Granger_Causality.md @@ -78,11 +78,15 @@ alpha <- 0.5 # Intercept of the model Y # Function to create the error of Y ARsim2 <- function(rho, first, serieslength, distribution) { - if(distribution=="runif"){a <- runif(serieslength,min=0,max=1)} - else {a <- rnorm(serieslength,0,1)} + if (distribution == "runif") { + a <- runif(serieslength, min = 0, max = 1) + } + else { + a <- rnorm(serieslength, 0, 1) + } Y <- first - for (i in (length(rho)+1):serieslength){ - Y[i] <- rho*Y[i-1]+(sqrt(1-(rho^2)))*a[i] + for (i in (length(rho) + 1):serieslength) { + Y[i] <- rho * Y[i - 1] + (sqrt(1 - (rho^2))) * a[i] } return(Y) } @@ -104,8 +108,8 @@ for (i in 2:200) { ### Data ```r?example=grangertest -data <- as.data.frame(cbind(1:200,X,as.ts(Y))) -colnames(data) <- c("time", "X","Y") +data <- as.data.frame(cbind(1:200, X, as.ts(Y))) +colnames(data) <- c("time", "X", "Y") ``` ### Graph @@ -118,16 +122,16 @@ colnames(data) <- c("time", "X","Y") library(tidyr) library(ggplot2) -graphdata <- data[2:200,] %>% +graphdata <- data[2:200, ] %>% pivot_longer( - cols = -c(time), names_to="variable", values_to="value" + cols = -c(time), names_to = "variable", values_to = "value" ) -ggplot(graphdata, aes(x = time, y = value, group=variable)) + +ggplot(graphdata, aes(x = time, y = value, group = variable)) + geom_line(aes(color = variable), size = 0.7) + scale_color_manual(values = c("#00AFBB", "#E7B800")) + - theme_minimal()+ - labs(title = "Simulated ADL models")+ + theme_minimal() + + labs(title = "Simulated ADL models") + theme(text = element_text(size = 15)) ``` @@ -144,8 +148,8 @@ ggplot(graphdata, aes(x = time, y = value, group=variable)) + library(tseries) ## ADF test -adf.test(X, k=3) -adf.test(na.omit(Y), k=3) #na.omit() to delete the first 2 periods of lag +adf.test(X, k = 3) +adf.test(na.omit(Y), k = 3) # na.omit() to delete the first 2 periods of lag ``` * With a p-value of 0.01 and 0.01 for series X, and Y, we assure that both are stationary. @@ -203,3 +207,4 @@ Granger, C. W. (1969). Investigating Causal Relations by Econometric Models and Pierce, D.A. (1977). $R^2$ Measures for Time Series. Special Studies Paper No. 93, Washington, D.C.: Federal Reserve Board. + diff --git a/Time_Series/MA_Model.md b/Time_Series/MA_Model.md index 09fc1917..67fcc80c 100644 --- a/Time_Series/MA_Model.md +++ b/Time_Series/MA_Model.md @@ -74,9 +74,9 @@ Additional helpful information can be found at [Wikipedia: Moving Average Models ## R ```r -#in the stats package we can simulate an ARIMA Model. ARIMA stands for Auto-Regressive Integrated Moving Average model. We will be setting the AR and I parts to 0 and only simulating a MA(q) model. +# in the stats package we can simulate an ARIMA Model. ARIMA stands for Auto-Regressive Integrated Moving Average model. We will be setting the AR and I parts to 0 and only simulating a MA(q) model. set.seed(123) -DT = arima.sim(n = 1000, model = list(ma = c(0.1, 0.3, 0.5))) +DT <- arima.sim(n = 1000, model = list(ma = c(0.1, 0.3, 0.5))) plot(DT, ylab = "Value") ``` @@ -85,10 +85,10 @@ plot(DT, ylab = "Value") ```r set.seed(123) -DT = arima.sim(n = 1000, model = list(ma = c(0.1, 0.3, 0.5))) +DT <- arima.sim(n = 1000, model = list(ma = c(0.1, 0.3, 0.5))) -#ACF stands for Autocorrelation Function -#Here we can see that there may be potential for 3 lags in our MA process. (Note: This is due to property (3): the covariance of y_t and y_{t-3} is nonzero while the covariance of y_t and y_{t-4} is 0) +# ACF stands for Autocorrelation Function +# Here we can see that there may be potential for 3 lags in our MA process. (Note: This is due to property (3): the covariance of y_t and y_{t-3} is nonzero while the covariance of y_t and y_{t-4} is 0) acf(DT, type = "covariance") ``` @@ -96,148 +96,149 @@ acf(DT, type = "covariance") ```r set.seed(123) -DT = arima.sim(n = 1000, model = list(ma = c(0.1, 0.3, 0.5))) - -#Here I'm estimating an ARIMA(0,0,3) model which is a MA(3) model. Changing c(0,0,q) allows us to estimate a MA(q) process. -arima(x = DT, order = c(0,0,3)) - - ## - ## Call: - ## arima(x = DT, order = c(0, 0, 3)) - ## - ## Coefficients: - ## ma1 ma2 ma3 intercept - ## 0.0722 0.2807 0.4781 0.0265 - ## s.e. 0.0278 0.0255 0.0294 0.0573 - ## - ## sigma^2 estimated as 0.9825: log likelihood = -1410.63, aic = 2831.25 - -#We can also estimate a MA(7) model and see that the ma4, ma5, ma6, and ma7 are close to 0 and insignificant. -arima(x = DT, order = c(0,0,7)) - - - ## - ## Call: - ## arima(x = DT, order = c(0, 0, 7)) - ## - ## Coefficients: - ## ma1 ma2 ma3 ma4 ma5 ma6 ma7 intercept - ## 0.0714 0.2694 0.4607 -0.0119 -0.0380 -0.0256 -0.0219 0.0267 - ## s.e. 0.0316 0.0321 0.0324 0.0363 0.0339 0.0332 0.0328 0.0533 - ## - ## sigma^2 estimated as 0.9806: log likelihood = -1409.65, aic = 2837.3 - -#fable is a package designed to estimate ARIMA models. We can use it to estimate our MA(3) model. +DT <- arima.sim(n = 1000, model = list(ma = c(0.1, 0.3, 0.5))) + +# Here I'm estimating an ARIMA(0,0,3) model which is a MA(3) model. Changing c(0,0,q) allows us to estimate a MA(q) process. +arima(x = DT, order = c(0, 0, 3)) + +## +## Call: +## arima(x = DT, order = c(0, 0, 3)) +## +## Coefficients: +## ma1 ma2 ma3 intercept +## 0.0722 0.2807 0.4781 0.0265 +## s.e. 0.0278 0.0255 0.0294 0.0573 +## +## sigma^2 estimated as 0.9825: log likelihood = -1410.63, aic = 2831.25 + +# We can also estimate a MA(7) model and see that the ma4, ma5, ma6, and ma7 are close to 0 and insignificant. +arima(x = DT, order = c(0, 0, 7)) + + +## +## Call: +## arima(x = DT, order = c(0, 0, 7)) +## +## Coefficients: +## ma1 ma2 ma3 ma4 ma5 ma6 ma7 intercept +## 0.0714 0.2694 0.4607 -0.0119 -0.0380 -0.0256 -0.0219 0.0267 +## s.e. 0.0316 0.0321 0.0324 0.0363 0.0339 0.0332 0.0328 0.0533 +## +## sigma^2 estimated as 0.9806: log likelihood = -1409.65, aic = 2837.3 + +# fable is a package designed to estimate ARIMA models. We can use it to estimate our MA(3) model. library(fable) -#an extension of tidyverse to temporal data (this allows us to create time series data into tibbles which are needed for fable functionality) +# an extension of tidyverse to temporal data (this allows us to create time series data into tibbles which are needed for fable functionality) library(tsibble) -#visit https://dplyr.tidyverse.org/ to understand dplyr syntax; this package is important for fable functionality +# visit https://dplyr.tidyverse.org/ to understand dplyr syntax; this package is important for fable functionality library(dplyr) -#When using the fable package, we need to convert our object into a tsibble (a time series tibble). This gives us a data frame with values and an index for the time periods -DT = DT %>% +# When using the fable package, we need to convert our object into a tsibble (a time series tibble). This gives us a data frame with values and an index for the time periods +DT <- DT %>% as_tsibble() head(DT) - ## # A tsibble: 6 x 2 [1] - ## index value - ## - ## 1 1 -0.123 - ## 2 2 0.489 - ## 3 3 2.53 - ## 4 4 0.706 - ## 5 5 -0.640 - ## 6 6 0.182 - -#Now we can use the dplyr package to pipe our dataset and create a fitted model -#Note: the ARIMA function in the fable package uses an information criterion for model selection; these can be set as shown below; additional information is above in the Keep in Mind section (the default criterion is aicc) -MAfit = DT %>% +## # A tsibble: 6 x 2 [1] +## index value +## +## 1 1 -0.123 +## 2 2 0.489 +## 3 3 2.53 +## 4 4 0.706 +## 5 5 -0.640 +## 6 6 0.182 + +# Now we can use the dplyr package to pipe our dataset and create a fitted model +# Note: the ARIMA function in the fable package uses an information criterion for model selection; these can be set as shown below; additional information is above in the Keep in Mind section (the default criterion is aicc) +MAfit <- DT %>% model(arima = ARIMA(value, ic = "aicc")) -#report() is needed to view our model +# report() is needed to view our model report(MAfit) - ## Series: value - ## Model: ARIMA(0,0,3) - ## - ## Coefficients: - ## ma1 ma2 ma3 - ## 0.0723 0.2808 0.4782 - ## s.e. 0.0278 0.0255 0.0294 - ## - ## sigma^2 estimated as 0.9857: log likelihood=-1410.73 - ## AIC=2829.47 AICc=2829.51 BIC=2849.1 +## Series: value +## Model: ARIMA(0,0,3) +## +## Coefficients: +## ma1 ma2 ma3 +## 0.0723 0.2808 0.4782 +## s.e. 0.0278 0.0255 0.0294 +## +## sigma^2 estimated as 0.9857: log likelihood=-1410.73 +## AIC=2829.47 AICc=2829.51 BIC=2849.1 -#if instead we want to specify the model manually, we need to specify it. For MA models, set the pdq(0,0,q) term to the MA(q) order you want to estimate. For example: Estimating a MA(7) would mean that I should put pdq(0,0,7). Additionally, you can add a constant if wanted; this is shown below +# if instead we want to specify the model manually, we need to specify it. For MA models, set the pdq(0,0,q) term to the MA(q) order you want to estimate. For example: Estimating a MA(7) would mean that I should put pdq(0,0,7). Additionally, you can add a constant if wanted; this is shown below -#with constant -MAfit = DT %>% - model(arima = ARIMA(value ~ 1 + pdq(0,0,3), ic = "aicc")) +# with constant +MAfit <- DT %>% + model(arima = ARIMA(value ~ 1 + pdq(0, 0, 3), ic = "aicc")) report(MAfit) - ## Series: value - ## Model: ARIMA(0,0,3) w/ mean - ## - ## Coefficients: - ## ma1 ma2 ma3 constant - ## 0.0722 0.2807 0.4781 0.0265 - ## s.e. 0.0278 0.0255 0.0294 0.0573 - ## - ## sigma^2 estimated as 0.9865: log likelihood=-1410.63 - ## AIC=2831.25 AICc=2831.31 BIC=2855.79 - -#without constant -MAfit = DT %>% - model(arima = ARIMA(value ~ 0 + pdq(0,0,3), ic = "aicc")) +## Series: value +## Model: ARIMA(0,0,3) w/ mean +## +## Coefficients: +## ma1 ma2 ma3 constant +## 0.0722 0.2807 0.4781 0.0265 +## s.e. 0.0278 0.0255 0.0294 0.0573 +## +## sigma^2 estimated as 0.9865: log likelihood=-1410.63 +## AIC=2831.25 AICc=2831.31 BIC=2855.79 + +# without constant +MAfit <- DT %>% + model(arima = ARIMA(value ~ 0 + pdq(0, 0, 3), ic = "aicc")) report(MAfit) - ## Series: value - ## Model: ARIMA(0,0,3) - ## - ## Coefficients: - ## ma1 ma2 ma3 - ## 0.0723 0.2808 0.4782 - ## s.e. 0.0278 0.0255 0.0294 - ## - ## sigma^2 estimated as 0.9857: log likelihood=-1410.73 - ## AIC=2829.47 AICc=2829.51 BIC=2849.1 +## Series: value +## Model: ARIMA(0,0,3) +## +## Coefficients: +## ma1 ma2 ma3 +## 0.0723 0.2808 0.4782 +## s.e. 0.0278 0.0255 0.0294 +## +## sigma^2 estimated as 0.9857: log likelihood=-1410.73 +## AIC=2829.47 AICc=2829.51 BIC=2849.1 -#A faster, more compact way to write a code would be as follows: +# A faster, more compact way to write a code would be as follows: -#Automatic estimation +# Automatic estimation DT %>% as_tsibble() %>% model(arima = ARIMA(value)) %>% report() - ## Series: value - ## Model: ARIMA(0,0,3) - ## - ## Coefficients: - ## ma1 ma2 ma3 - ## 0.0723 0.2808 0.4782 - ## s.e. 0.0278 0.0255 0.0294 - ## - ## sigma^2 estimated as 0.9857: log likelihood=-1410.73 - ## AIC=2829.47 AICc=2829.51 BIC=2849.1 - -#Manual estimation +## Series: value +## Model: ARIMA(0,0,3) +## +## Coefficients: +## ma1 ma2 ma3 +## 0.0723 0.2808 0.4782 +## s.e. 0.0278 0.0255 0.0294 +## +## sigma^2 estimated as 0.9857: log likelihood=-1410.73 +## AIC=2829.47 AICc=2829.51 BIC=2849.1 + +# Manual estimation DT %>% as_tsibble() %>% model(arima = ARIMA(value ~ 0 + pdq(0, 0, 3))) %>% report() - ## Series: value - ## Model: ARIMA(0,0,3) - ## - ## Coefficients: - ## ma1 ma2 ma3 - ## 0.0723 0.2808 0.4782 - ## s.e. 0.0278 0.0255 0.0294 - ## - ## sigma^2 estimated as 0.9857: log likelihood=-1410.73 - ## AIC=2829.47 AICc=2829.51 BIC=2849.1 -``` \ No newline at end of file +## Series: value +## Model: ARIMA(0,0,3) +## +## Coefficients: +## ma1 ma2 ma3 +## 0.0723 0.2808 0.4782 +## s.e. 0.0278 0.0255 0.0294 +## +## sigma^2 estimated as 0.9857: log likelihood=-1410.73 +## AIC=2829.47 AICc=2829.51 BIC=2849.1 +``` + diff --git a/Time_Series/State_Space_Models.md b/Time_Series/State_Space_Models.md index cf6624c3..e76b8a3a 100644 --- a/Time_Series/State_Space_Models.md +++ b/Time_Series/State_Space_Models.md @@ -8,18 +8,18 @@ nav_order: 1 # Linear Gaussian State Space Models -The state space model can be used to represent a variety of dynamic processes, including standard ARMA processes. +The state space model can be used to represent a variety of dynamic processes, including standard ARMA processes. It has two main components: (1) a hidden/latent $$x_t$$ process referred to as the state process, and (2) an observed process $$y_t$$ that is independent conditional on $$x_t$$. Let us consider the most basic state space model -- the linear Gaussian model -- in which $$x_t$$ follows a linear autoregressive process and $$y_t$$ is a linear mapping of $x_t$ with added noise. The linear Gaussian state space model is characterized by the following state equation: $$ x_{t+1} + F \, x_{t} + u_{t+1} \, ,$$ -where $$x_t$$ and $$u_t$$ are both $$p \times 1$$ vectors, such that $$u_t \sim i.i.d. N(0,Q)$$. -It is assumed that the initial state vector $$x_0$$ is drawn from a normal distribution. -The observation equation is expressed as +where $$x_t$$ and $$u_t$$ are both $$p \times 1$$ vectors, such that $$u_t \sim i.i.d. N(0,Q)$$. +It is assumed that the initial state vector $$x_0$$ is drawn from a normal distribution. +The observation equation is expressed as -$$y_t = A_t \, x_t + v_t \, ,$$ +$$y_t = A_t \, x_t + v_t \, ,$$ where $$y_t$$ is a $q \times 1$ observed vector, $$A_t$$ is a $$q \times p$$ observation matrix, and $$v_t \sim i.i.d. N(0,R)$$ is a $$q \times 1$$ noise vector. @@ -30,14 +30,14 @@ For additional information about the state-space repsentation, refer to [Wikiped - Expressing a dynamic process in state-space form allows us to apply the Kalman filter and smoother. - The parameters of a linear Gaussian state space model can be estimated using a maximum likelihood approach. This is made possible by the fact that the innovation vectors $$u_t$$ and $$v_t$$ are assumed to be multivariate standard normal. -The Kalman filter can be used to construct the likelihood function, which can be transformed into a log-likelihood function and simply optimized with respect to the parameters. +The Kalman filter can be used to construct the likelihood function, which can be transformed into a log-likelihood function and simply optimized with respect to the parameters. - If the innovations are assumed to be non-Gaussian, then we may still apply the maximum likelihood procedure to yield quasi-maximum likelihood parameter estimates that are consistent and asymptotically normal. -- Unless appropriate restrictions are placed on the parameter matrices, the parameter matrices obtained from the above-mentioned optimization procedure will not be unique. -In other words, in the absence of restrictions, the parameters of the state space model are unidentified. +- Unless appropriate restrictions are placed on the parameter matrices, the parameter matrices obtained from the above-mentioned optimization procedure will not be unique. +In other words, in the absence of restrictions, the parameters of the state space model are unidentified. - The Kalman filter can be used to recursively generate forecasts of the state vector within a sample period given information up to time $$t \in \{t_0,\ldots,T\}$$, where $$t_0$$ and $$T$$ represent the initial and final periods of a sample, respectively. - The Kalman smoother can be used to generate historical estimates of the state vector throughout the entire sample period given all available information in the sample (information up to time $$T$$). -# Also Consider +# Also Consider - Recall that a stationary ARMA process can be expressed as a state space model. This may not be necessary, however, unless the given data has missing observations. @@ -45,21 +45,21 @@ If there are no missing data, then one can defer to the standard method of estim # Implementations -First, follow the [instructions]({{ "/Time_Series/creating_time_series_dataset.html" | relative_url }}) for creating and formatting time-series data using your software of choice. +First, follow the [instructions]({{ "/Time_Series/creating_time_series_dataset.html" | relative_url }}) for creating and formatting time-series data using your software of choice. We will again use quarterly US GDP data downloaded from [FRED](https://fred.stlouisfed.org/series/GDPC1) as an example. -We estimate the quarterly log change in GDP using an ARMA(3,1) model in state space form to follow the [ARMA implementation]({{ "/Time_Series/ARMA-models.html" | relative_url }}). +We estimate the quarterly log change in GDP using an ARMA(3,1) model in state space form to follow the [ARMA implementation]({{ "/Time_Series/ARMA-models.html" | relative_url }}). -An ARMA($$p,q$$) process +An ARMA($$p,q$$) process $$ y_t = c + \sum_{i = 1}^{3} \phi_i Y_{t-i} + \sum_{j = 1}^{1} \theta_j \varepsilon_{t-j} + \varepsilon_t $$ may be expressed in state-space form in a variety of ways -- the following is an example of a common parsimonious approach. -The state equation (also referred to as the transition equation) may be expressed as +The state equation (also referred to as the transition equation) may be expressed as -$$ +$$ \begin{bmatrix} y_t \\ y_{t-1} \\ y_{t-2} \\ \varepsilon_t \end{bmatrix} -= -\begin{bmatrix} c \\ 0 \\ 0 \\ 0 \end{bmatrix} += +\begin{bmatrix} c \\ 0 \\ 0 \\ 0 \end{bmatrix} + \begin{bmatrix} \phi_1 & \phi_2 & \phi_3 & \theta \\ 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 \end{bmatrix} \, \begin{bmatrix} y_{t-1} \\ y_{t-2} \\ y_{t-3} \\ \varepsilon_{t-1} \end{bmatrix} @@ -67,12 +67,12 @@ $$ \begin{bmatrix} \varepsilon_t \\ 0 \\ 0 \\ \varepsilon_t \end{bmatrix} \, , $$ -while the observation equation (also referred to as the measurement equation) may be expressed as -$$ +while the observation equation (also referred to as the measurement equation) may be expressed as +$$ y_t = \begin{bmatrix} 1&0&0&0 \end{bmatrix} \, \begin{bmatrix} y_t \\ y_{t-1} \\ y_{t-2} \\ \varepsilon_t \end{bmatrix} \, . $$ -The observation matrix $$A_t$$ in our implementation will be time-invariant ($$A_t = A, \forall t$$). +The observation matrix $$A_t$$ in our implementation will be time-invariant ($$A_t = A, \forall t$$). ## R @@ -96,20 +96,21 @@ library(dlm) # Prepare the data ## Load data -gdp = read.csv("https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Time_Series/Data/GDPC1.csv") +gdp <- read.csv("https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Time_Series/Data/GDPC1.csv") ## Set our data up as a time-series gdp$DATE <- as.Date(gdp$DATE) gdp_ts <- as_tsibble(gdp, - index = DATE, - regular = FALSE) %>% + index = DATE, + regular = FALSE +) %>% index_by(qtr = ~ yearquarter(.)) ## Construct our first difference of log gdp variable -gdp_ts$lgdp=log(gdp_ts$GDPC1) +gdp_ts$lgdp <- log(gdp_ts$GDPC1) -gdp_ts$ldiffgdp=difference(gdp_ts$lgdp, lag=1, difference=1) +gdp_ts$ldiffgdp <- difference(gdp_ts$lgdp, lag = 1, difference = 1) # Estimate ARMA(3,1) using the above data @@ -118,13 +119,15 @@ y <- gdp_ts$ldiffgdp ## Build ARMA(3,1) model fn <- function(parm) { - dlmModARMA(ar = c(parm[1], parm[2], parm[3]), - ma = parm[4], - sigma2 = parm[5]) + dlmModARMA( + ar = c(parm[1], parm[2], parm[3]), + ma = parm[4], + sigma2 = parm[5] + ) } ## Fit the model to the data -fit <- dlmMLE(y, c(rep(0, 4),1), build = fn, hessian = TRUE) +fit <- dlmMLE(y, c(rep(0, 4), 1), build = fn, hessian = TRUE) (conv <- fit$convergence) ## Store var-cov stats @@ -137,5 +140,5 @@ filtered <- dlmFilter(y, mod = mod) # Apply the Kalman smoother smoothed <- dlmSmooth(filtered) - ``` + diff --git a/Time_Series/Time_Series.md b/Time_Series/Time_Series.md index 3cb744b6..3fe44939 100644 --- a/Time_Series/Time_Series.md +++ b/Time_Series/Time_Series.md @@ -5,3 +5,4 @@ nav_order: 8 --- # Time Series + diff --git a/Time_Series/creating_time_series_dataset.md b/Time_Series/creating_time_series_dataset.md index 1e53c495..090742b6 100644 --- a/Time_Series/creating_time_series_dataset.md +++ b/Time_Series/creating_time_series_dataset.md @@ -26,13 +26,15 @@ Time-series estimators are, by definition, a function of the temporal ordering o import pandas as pd # Read in data -gdp = pd.read_csv("https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Time_Series/Data/GDPC1.csv") +gdp = pd.read_csv( + "https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Time_Series/Data/GDPC1.csv" +) # Convert date column to be of data type datetime64 -gdp['DATE'] = pd.to_datetime(gdp['DATE']) +gdp["DATE"] = pd.to_datetime(gdp["DATE"]) # Create a column with quarter-year combinations -gdp['yr-qtr'] = gdp['DATE'].apply(lambda x: str(x.year) + '-' + str(x.quarter)) +gdp["yr-qtr"] = gdp["DATE"].apply(lambda x: str(x.year) + "-" + str(x.quarter)) ``` ## R @@ -66,9 +68,10 @@ STEP 3) Convert a date variable formats to quarter ```r?example=tsibble gdp_ts <- as_tsibble(gdp, - index = DATE, - regular = FALSE) %>% - index_by(qtr = ~ yearquarter(.)) + index = DATE, + regular = FALSE +) %>% + index_by(qtr = ~ yearquarter(.)) ``` By applying `yearmonth()` to the index variable (referred to as `.`), it creates new variable named `qtr` with a quarter interval which corresponds to the year-quarter for the original variable `DATE`. @@ -117,3 +120,4 @@ tsset date_index ``` Now, we have a quarterly Stata time-series dataset. Any data you add to this file in the future will be interpreted as time-series data. + diff --git a/_config.yml b/_config.yml index 3c14569c..9b56530f 100644 --- a/_config.yml +++ b/_config.yml @@ -13,8 +13,11 @@ heading_anchors: true exclude: - NewPageTemplate.md - - test_samples.py - conftest.py + - lostutils + - tests + - pyproject.toml + - poetry.lock plugins: - jekyll-sitemap diff --git a/conftest.py b/conftest.py index d387382b..ac6ebd3a 100644 --- a/conftest.py +++ b/conftest.py @@ -1,3 +1,8 @@ +from pathlib import Path + +import pytest + + def pytest_addoption(parser): parser.addoption( "--mdpath", @@ -19,3 +24,8 @@ def pytest_addoption(parser): default=[], help="List of languages whose code samples to run (default is all)", ) + + +@pytest.fixture(scope="session") +def fixtures_path() -> Path: + return Path(__file__).parent / "tests" / "fixtures" diff --git a/fix_links.py b/fix_links.py deleted file mode 100644 index 01d894a3..00000000 --- a/fix_links.py +++ /dev/null @@ -1,65 +0,0 @@ -import re -import sys -import textwrap -from pathlib import Path - -USAGE = textwrap.dedent( - """\ - python fix_links.py FILENAME [FILENAME...] - - Attempt to fix links in accordance with a predefined list of rules. - You must specify at least one FILENAME. Note that filenames are - treated as glob patterns relative to the working directory. After - any glob returns, we will filter for filenames that end in `.md`. -""" -) - - -def fix_md(path: Path) -> str: - """ - Given a file, read it and change all the links that look like:: - - http[s]://lost-stats.github.io/blah - - into:: - - {{ "/blah" | relative_url }} - - Args: - path: The path to transform - - Returns: - The transformed md file - """ - with open(path, "rt") as infile: - md_file = infile.read() - - return re.sub( - r"http[s]://lost-stats.github.io(/[a-zA-Z0-9/#-&=+_%\.]*)", - r'{{ "\1" | relative_url }}', - md_file, - ) - - -def main(): - if len(sys.argv) < 2: - print(USAGE, file=sys.stderr) - sys.exit(1) - - cwd = Path(".") - for pattern in sys.argv[1:]: - for path in cwd.glob(pattern): - if path.is_dir(): - # We skip directories - continue - - if path.suffix == ".md": - fixed_md = fix_md(path) - with open(path, "wt") as outfile: - outfile.write(fixed_md) - else: - print(f"Skipping {path} as it is not an md") - - -if __name__ == "__main__": - main() diff --git a/index.md b/index.md index 634dc448..caa898fc 100644 --- a/index.md +++ b/index.md @@ -19,3 +19,4 @@ allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen> LOST was originated in 2019 by Nick Huntington-Klein and is maintained by volunteer contributors. The project's GitHub page is [here](https://github.com/LOST-STATS/lost-stats.github.io). + diff --git a/lostutils/__init__.py b/lostutils/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/lostutils/cli.py b/lostutils/cli.py new file mode 100644 index 00000000..750b7022 --- /dev/null +++ b/lostutils/cli.py @@ -0,0 +1,69 @@ +from typing import List + +import click +from tqdm.cli import tqdm + +from .constants import R_DOCKER_IMAGE +from .fix_links import fix_md +from .pathutils import expand_and_filter_filenames +from .style import format_file + + +@click.group() +def cli(): + """Utilities for testing and cleaning LOST""" + pass + + +@cli.command("style") +@click.argument("filename", nargs=-1, type=click.Path()) +@click.option( + "--skip", + multiple=True, + type=click.Path(), + help="Files to skip. Follows same expansion rules as FILENAME", +) +@click.option( + "--docker-r", + type=str, + default=R_DOCKER_IMAGE, + help="Tag of the docker image in which to run styler for R code", +) +def style_command(filename: List[str], skip: List[str], docker_r: str): + """ + Attempt to style the code samples using black and styler. + You must specify at least one FILENAME. Note that filenames are + treated as glob patterns relative to the working directory. After + any glob returns, we will filter for filenames that end in `.md`. + """ + filenames = expand_and_filter_filenames(filename, skip) + for filename in tqdm(filenames): + fixed_md = format_file(filename, r_docker_image=docker_r) + with open(filename, "wt") as outfile: + print(fixed_md, file=outfile) + + +@cli.command("links") +@click.argument("filename", nargs=-1, type=click.Path()) +@click.option( + "--skip", + multiple=True, + type=click.Path(), + help="Files to skip. Follows same expansion rules as FILENAME", +) +def links_command(filename: List[str], skip: List[str]): + """ + Attempt to fix links in accordance with a predefined list of rules. + You must specify at least one FILENAME. Note that filenames are + treated as glob patterns relative to the working directory. After + any glob returns, we will filter for filenames that end in `.md`. + """ + filenames = expand_and_filter_filenames(filename, skip) + for filename in tqdm(filenames): + fixed_md = fix_md(filename) + with open(filename, "wt") as outfile: + outfile.write(fixed_md) + + +if __name__ == "__main__": + cli() diff --git a/lostutils/constants.py b/lostutils/constants.py new file mode 100644 index 00000000..7e168dae --- /dev/null +++ b/lostutils/constants.py @@ -0,0 +1,12 @@ +import os + + +# What docker image should Python code be executed in? +PYTHON_DOCKER_IMAGE = os.environ.get( + "LOST_PYTHON_DOCKER_IMAGE", "ghcr.io/lost-stats/docker-images/tester-python:latest" +) + +# What docker image should R code be executed in? +R_DOCKER_IMAGE = os.environ.get( + "LOST_R_DOCKER_IMAGE", "ghcr.io/lost-stats/docker-images/tester-r:latest" +) diff --git a/lostutils/fix_links.py b/lostutils/fix_links.py new file mode 100644 index 00000000..134ec516 --- /dev/null +++ b/lostutils/fix_links.py @@ -0,0 +1,30 @@ +import re +import sys +import textwrap +from pathlib import Path + + +def fix_md(path: Path) -> str: + """ + Given a file, read it and change all the links that look like:: + + http[s]://lost-stats.github.io/blah + + into:: + + {{ "/blah" | relative_url }} + + Args: + path: The path to transform + + Returns: + The transformed md file + """ + with open(path, "rt") as infile: + md_file = infile.read() + + return re.sub( + r"http[s]://lost-stats.github.io(/[a-zA-Z0-9/#-&=+_%\.]*)", + r'{{ "\1" | relative_url }}', + md_file, + ) diff --git a/lostutils/pathutils.py b/lostutils/pathutils.py new file mode 100644 index 00000000..0ed7903f --- /dev/null +++ b/lostutils/pathutils.py @@ -0,0 +1,46 @@ +from pathlib import Path +from typing import List, Union + + +def expand_filenames( + filenames: List[Union[str, Path]], cwd: Path = Path(".") +) -> List[Path]: + """ + Given a list of filenames, initially treat them as glob patterns. If a + directory appears among them, glob it again with the pattern '**/*.md'. + Finally, filter out the list for only .md files. + + Args: + filenames: The list of filenames to expand + cwd: The current working diretory relative to which the filenames are + + Returns: + The list of expanded filenames, sorted. + """ + new_filenames = [] + for filename in filenames: + if any(char in str(filename) for char in ["*", "?", "["]): + # This is a glob + new_filenames.extend(cwd.glob(filename)) + else: + new_filenames.append(cwd / filename) + + output: List[Path] = [] + for filename in new_filenames: + if filename.is_dir(): + output.extend(filename.glob("**/*.md")) + elif filename.exists() and filename.suffix == ".md": + output.append(filename) + + output = [filename for filename in output if filename.exists()] + return sorted(output) + + +def expand_and_filter_filenames( + filenames: List[Union[str, Path]], + skip_filenames: List[Union[str, Path]], + cwd: Path = Path("."), +) -> List[Path]: + filenames = expand_filenames(filenames, cwd=cwd) + skip_filenames = expand_filenames(skip_filenames, cwd=cwd) + return sorted(set(filenames) - set(skip_filenames)) diff --git a/lostutils/style.py b/lostutils/style.py new file mode 100644 index 00000000..a8e1c0fa --- /dev/null +++ b/lostutils/style.py @@ -0,0 +1,89 @@ +import subprocess +from pathlib import Path +from typing import List, Union + +import black + +from .constants import R_DOCKER_IMAGE + + +def format_str( + src_string: str, parameters: str, r_docker_image: str = R_DOCKER_IMAGE +) -> str: + parameters = parameters.strip().lower() + if parameters.startswith("py"): + return "\n".join( + [ + "```" + parameters, + black.format_str(src_string, mode=black.Mode()).rstrip(), + "```", + ] + ) + + if parameters.startswith("r"): + # Should do something here + proc = subprocess.run( + [ + "docker", + "run", + "--rm", + "-i", + r_docker_image, + "Rscript", + "--vanilla", + "-e", + 'styler::style_text(readr::read_file(file("stdin")))', + ], + input=src_string.encode("utf8"), + capture_output=True, + ) + return "\n".join( + ["```" + parameters, proc.stdout.decode("utf8").rstrip(), "```"] + ) + + return "\n".join(["```" + parameters, src_string, "```"]) + + +def format_file( + filename: Union[str, Path], r_docker_image: str = R_DOCKER_IMAGE +) -> str: + filename = Path(filename) + with open(filename, "rt") as infile: + is_in_fence = False + fence_parameters = None + fenced_lines = [] + + final_lines = [] + + for line in infile: + line = line.rstrip() + if line.startswith("```"): + if is_in_fence: + # End the fence + final_lines.append( + format_str( + "\n".join(fenced_lines), + fence_parameters, + r_docker_image=r_docker_image, + ) + ) + is_in_fence = False + fenced_lines = [] + fence_parameters = None + + else: + is_in_fence = True + fence_parameters = line.strip()[3:] + + elif is_in_fence: + fenced_lines.append(line) + + else: + final_lines.append(line.rstrip()) + + # Merge lines together and remove extra whitespace + output = "\n".join(final_lines).rstrip() + + # Make sure the file ends in a newline + output += "\n" + return output diff --git a/poetry.lock b/poetry.lock index ff78f1db..0c87dcb2 100644 --- a/poetry.lock +++ b/poetry.lock @@ -24,21 +24,21 @@ python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*" [[package]] name = "attrs" -version = "20.3.0" +version = "21.1.0" description = "Classes Without Boilerplate" category = "dev" optional = false python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*" [package.extras] -dev = ["coverage[toml] (>=5.0.2)", "hypothesis", "pympler", "pytest (>=4.3.0)", "six", "zope.interface", "furo", "sphinx", "pre-commit"] -docs = ["furo", "sphinx", "zope.interface"] -tests = ["coverage[toml] (>=5.0.2)", "hypothesis", "pympler", "pytest (>=4.3.0)", "six", "zope.interface"] -tests_no_zope = ["coverage[toml] (>=5.0.2)", "hypothesis", "pympler", "pytest (>=4.3.0)", "six"] +dev = ["coverage[toml] (>=5.0.2)", "hypothesis", "pympler", "pytest (>=4.3.0)", "six", "mypy", "pytest-mypy-plugins", "zope.interface", "furo", "sphinx", "sphinx-notfound-page", "pre-commit"] +docs = ["furo", "sphinx", "zope.interface", "sphinx-notfound-page"] +tests = ["coverage[toml] (>=5.0.2)", "hypothesis", "pympler", "pytest (>=4.3.0)", "six", "mypy", "pytest-mypy-plugins", "zope.interface"] +tests_no_zope = ["coverage[toml] (>=5.0.2)", "hypothesis", "pympler", "pytest (>=4.3.0)", "six", "mypy", "pytest-mypy-plugins"] [[package]] name = "black" -version = "21.4b2" +version = "21.5b0" description = "The uncompromising code formatter." category = "dev" optional = false @@ -61,7 +61,7 @@ python2 = ["typed-ast (>=1.4.2)"] name = "click" version = "7.1.2" description = "Composable command line interface toolkit" -category = "dev" +category = "main" optional = false python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.*" @@ -95,6 +95,19 @@ category = "dev" optional = false python-versions = "*" +[[package]] +name = "isort" +version = "5.8.0" +description = "A Python utility / library to sort Python imports." +category = "main" +optional = false +python-versions = ">=3.6,<4.0" + +[package.extras] +pipfile_deprecated_finder = ["pipreqs", "requirementslib"] +requirements_deprecated_finder = ["pipreqs", "pip-api"] +colors = ["colorama (>=0.4.3,<0.5.0)"] + [[package]] name = "mistune" version = "2.0.0rc1" @@ -159,7 +172,7 @@ python-versions = ">=2.6, !=3.0.*, !=3.1.*, !=3.2.*" [[package]] name = "pytest" -version = "6.2.3" +version = "6.2.4" description = "pytest: simple powerful testing with Python" category = "dev" optional = false @@ -223,10 +236,23 @@ category = "dev" optional = false python-versions = ">=2.6, !=3.0.*, !=3.1.*, !=3.2.*" +[[package]] +name = "tqdm" +version = "4.60.0" +description = "Fast, Extensible Progress Meter" +category = "main" +optional = false +python-versions = "!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*,>=2.7" + +[package.extras] +dev = ["py-make (>=0.1.0)", "twine", "wheel"] +notebook = ["ipywidgets (>=6)"] +telegram = ["requests"] + [metadata] lock-version = "1.1" python-versions = "^3.8" -content-hash = "56fa5ec4352be6624f773f473f4846a551d7f5a0352b858f23fd15b2b7e76940" +content-hash = "c937c3d305ddfc235fdfdc7e73ccc48291d29c3682a824cde716844dec8ee103" [metadata.files] apipkg = [ @@ -242,12 +268,12 @@ atomicwrites = [ {file = "atomicwrites-1.4.0.tar.gz", hash = "sha256:ae70396ad1a434f9c7046fd2dd196fc04b12f9e91ffb859164193be8b6168a7a"}, ] attrs = [ - {file = "attrs-20.3.0-py2.py3-none-any.whl", hash = "sha256:31b2eced602aa8423c2aea9c76a724617ed67cf9513173fd3a4f03e3a929c7e6"}, - {file = "attrs-20.3.0.tar.gz", hash = "sha256:832aa3cde19744e49938b91fea06d69ecb9e649c93ba974535d08ad92164f700"}, + {file = "attrs-21.1.0-py2.py3-none-any.whl", hash = "sha256:8ee1e5f5a1afc5b19bdfae4fdf0c35ed324074bdce3500c939842c8f818645d9"}, + {file = "attrs-21.1.0.tar.gz", hash = "sha256:3901be1cb7c2a780f14668691474d9252c070a756be0a9ead98cfeabfa11aeb8"}, ] black = [ - {file = "black-21.4b2-py3-none-any.whl", hash = "sha256:bff7067d8bc25eb21dcfdbc8c72f2baafd9ec6de4663241a52fb904b304d391f"}, - {file = "black-21.4b2.tar.gz", hash = "sha256:fc9bcf3b482b05c1f35f6a882c079dc01b9c7795827532f4cc43c0ec88067bbc"}, + {file = "black-21.5b0-py3-none-any.whl", hash = "sha256:0e80435b8a88f383c9149ae89d671eb2095b72344b0fe8a1d61d2ff5110ed173"}, + {file = "black-21.5b0.tar.gz", hash = "sha256:9dc2042018ca10735366d944c2c12d9cad6dec74a3d5f679d09384ea185d9943"}, ] click = [ {file = "click-7.1.2-py2.py3-none-any.whl", hash = "sha256:dacca89f4bfadd5de3d7489b7c8a566eee0d3676333fbb50030263894c38c0dc"}, @@ -264,6 +290,10 @@ execnet = [ iniconfig = [ {file = "iniconfig-1.1.1.tar.gz", hash = "sha256:bc3af051d7d14b2ee5ef9969666def0cd1a000e121eaea580d4a313df4b37f32"}, ] +isort = [ + {file = "isort-5.8.0-py3-none-any.whl", hash = "sha256:2bb1680aad211e3c9944dbce1d4ba09a989f04e238296c87fe2139faa26d655d"}, + {file = "isort-5.8.0.tar.gz", hash = "sha256:0a943902919f65c5684ac4e0154b1ad4fac6dcaa5d9f3426b732f1c8b5419be6"}, +] mistune = [ {file = "mistune-2.0.0rc1-py2.py3-none-any.whl", hash = "sha256:02437870a8d594e61e4f6cff2f56f5104d17d56e3fea5fb234070ede5e7c2eae"}, {file = "mistune-2.0.0rc1.tar.gz", hash = "sha256:452bdba97c27efc7c1a83823ed0d6a08a9ad6ea5c67648248509f618b68aa7d9"}, @@ -293,8 +323,8 @@ pyparsing = [ {file = "pyparsing-2.4.7.tar.gz", hash = "sha256:c203ec8783bf771a155b207279b9bccb8dea02d8f0c9e5f8ead507bc3246ecc1"}, ] pytest = [ - {file = "pytest-6.2.3-py3-none-any.whl", hash = "sha256:6ad9c7bdf517a808242b998ac20063c41532a570d088d77eec1ee12b0b5574bc"}, - {file = "pytest-6.2.3.tar.gz", hash = "sha256:671238a46e4df0f3498d1c3270e5deb9b32d25134c99b7d75370a68cfbe9b634"}, + {file = "pytest-6.2.4-py3-none-any.whl", hash = "sha256:91ef2131a9bd6be8f76f1f08eac5c5317221d6ad1e143ae03894b862e8976890"}, + {file = "pytest-6.2.4.tar.gz", hash = "sha256:50bcad0a0b9c5a72c8e4e7c9855a3ad496ca6a881a3641b4260605450772c54b"}, ] pytest-forked = [ {file = "pytest-forked-1.3.0.tar.gz", hash = "sha256:6aa9ac7e00ad1a539c41bec6d21011332de671e938c7637378ec9710204e37ca"}, @@ -351,3 +381,7 @@ toml = [ {file = "toml-0.10.2-py2.py3-none-any.whl", hash = "sha256:806143ae5bfb6a3c6e736a764057db0e6a0e05e338b5630894a5f779cabb4f9b"}, {file = "toml-0.10.2.tar.gz", hash = "sha256:b3bda1d108d5dd99f4a20d24d9c348e91c4db7ab1b749200bded2f839ccbe68f"}, ] +tqdm = [ + {file = "tqdm-4.60.0-py2.py3-none-any.whl", hash = "sha256:daec693491c52e9498632dfbe9ccfc4882a557f5fa08982db1b4d3adbe0887c3"}, + {file = "tqdm-4.60.0.tar.gz", hash = "sha256:ebdebdb95e3477ceea267decfc0784859aa3df3e27e22d23b83e9b272bf157ae"}, +] diff --git a/pyproject.toml b/pyproject.toml index 58b05f01..764871d9 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -3,15 +3,42 @@ name = "lost-stats.github.io" version = "0.1.0" description = "" authors = ["Kevin Wilson "] +packages = [ + { include = "lostutils", from = "." } +] [tool.poetry.dependencies] python = "^3.8" +click = "^7.1.2" +tqdm = "^4.60.0" +isort = "^5.8.0" [tool.poetry.dev-dependencies] pytest = "^6.2.3" mistune = "2.0.0rc1" pytest-xdist = "^2.2.1" black = "^21.4b2" +isort = "^5.8.0" + +[tool.poetry.scripts] +lost = "lostutils.cli:cli" + +[tool.isort] +multi_line_output = 3 +include_trailing_comma = true +force_grid_wrap = 0 +use_parentheses = true +ensure_newline_before_comments = true +line_length = 88 + +[tool.pylint.basic] +good-names = "i,j,k,ex,Run,_,df,pc" + +[tool.pylint.messages_control] +disable = "C0330, C0326, R0912, R0913, R0914, R0915" + +[tool.pylint.format] +max-line-length = "88" [build-system] requires = ["poetry-core>=1.0.0"] diff --git a/tests/fixtures/input_bad_style.md b/tests/fixtures/input_bad_style.md new file mode 100644 index 00000000..a97e282e --- /dev/null +++ b/tests/fixtures/input_bad_style.md @@ -0,0 +1,16 @@ +# Here is a test + +```python?example=something +import pandas as pd + +df = pd.DataFrame({'a': [1, 2,3]}) + +df['a'] +=3 +``` + +```r?something +library(dplyr) +data(mtcars) + +mtcars %>% filter(a == b) %>% mutate(b = 10, c = 20, d=30) +``` \ No newline at end of file diff --git a/tests/fixtures/output_bad_style.md b/tests/fixtures/output_bad_style.md new file mode 100644 index 00000000..78ca8682 --- /dev/null +++ b/tests/fixtures/output_bad_style.md @@ -0,0 +1,18 @@ +# Here is a test + +```python?example=something +import pandas as pd + +df = pd.DataFrame({"a": [1, 2, 3]}) + +df["a"] += 3 +``` + +```r?something +library(dplyr) +data(mtcars) + +mtcars %>% + filter(a == b) %>% + mutate(b = 10, c = 20, d = 30) +``` diff --git a/tests/test_fix_md.py b/tests/test_fix_md.py new file mode 100644 index 00000000..829b1cd7 --- /dev/null +++ b/tests/test_fix_md.py @@ -0,0 +1,12 @@ +from pathlib import Path + +from lostutils.constants import R_DOCKER_IMAGE +from lostutils.style import format_file + + +def test_fix_md(fixtures_path: Path): + actual = format_file(fixtures_path / "input_bad_style.md") + with open(fixtures_path / "output_bad_style.md") as infile: + expected = infile.read() + + assert actual == expected diff --git a/test_samples.py b/tests/test_samples.py similarity index 95% rename from test_samples.py rename to tests/test_samples.py index 2ad89da5..092b91e2 100644 --- a/test_samples.py +++ b/tests/test_samples.py @@ -8,13 +8,7 @@ import mistune - -PYTHON_DOCKER_IMAGE = os.environ.get( - "LOST_PYTHON_DOCKER_IMAGE", "ghcr.io/lost-stats/docker-images/tester-python:latest" -) -R_DOCKER_IMAGE = os.environ.get( - "LOST_R_DOCKER_IMAGE", "ghcr.io/lost-stats/docker-images/tester-r:latest" -) +from lostutils.constants import PYTHON_DOCKER_IMAGE, R_DOCKER_IMAGE @dataclass(frozen=True) @@ -134,7 +128,7 @@ def _expand_paths(paths: List[Path]) -> List[Path]: def pytest_generate_tests(metafunc): base_paths = list(map(Path, metafunc.config.getoption("mdpath"))) - base_paths = base_paths or [Path(__file__).parent] + base_paths = base_paths or [Path(__file__).parent.parent] base_exclude_paths = list(map(Path, metafunc.config.getoption("xmdpath")))