|
| 1 | +--- |
| 2 | +title: "assignment-b4" |
| 3 | +author: "Alex Pieters" |
| 4 | +date: "2024-12-03" |
| 5 | +output: github_document |
| 6 | +--- |
| 7 | + |
| 8 | +## [Instructions]{.underline} |
| 9 | + |
| 10 | +The aim of the assignment is completing 2 of 3 exercises using concepts and tools covered in class (i.e. `stringr`, `purrr`, `regex`, `tidyverse`, etc.). I choose exercises 1 and 3 with the following instructions: |
| 11 | + |
| 12 | +### Exercise 1 |
| 13 | + |
| 14 | +Take a Jane Austen book contained in the `janeaustenr` package, or another book from some other source, such as one of the many freely available books from [Project Gutenberg](https://dev.gutenberg.org/) (be sure to indicate where you got the book from). Make a plot of the most common words in the book, removing “stop words” of your choosing (words like “the”, “a”, etc.) or stopwords from a pre-defined source, like the `stopwords` package or `tidytext::stop_words`. |
| 15 | + |
| 16 | +If you use any resources for helping you remove stopwords, or some other resource besides the `janeaustenr` R package for accessing your book, please indicate the source. We aren’t requiring any formal citation styles, just make sure you name the source and link to it. |
| 17 | + |
| 18 | +### Exercise 3 |
| 19 | + |
| 20 | +For this exercise, you’ll be evaluating a model that’s fit separately for each group in some dataset. You should fit these models with some question in mind. |
| 21 | + |
| 22 | +Examples (do not use these examples): |
| 23 | + |
| 24 | +- Maybe your model is a linear model (using `lm()`) for each country’s life expectancy over time in the `gapminder::gapminder` dataset, where you are interested in each country’s overall trend in life expectancy. |
| 25 | + |
| 26 | +- Maybe your model is a distribution of body mass (using the `distplyr` [package](https://distplyr.netlify.app/)) for each penguin species in the palmerpenguins::penguins dataset, so that you can produce parametric prediction intervals (such as from a Normal distribution instead of using the `quantile()` function) for each species. |
| 27 | + |
| 28 | +Your tasks are as follows. |
| 29 | + |
| 30 | +1. Make a column of model objects. Do this using the appropriate mapping function from the `purrr` package. Note: it’s possible you’ll have to make use of nesting, here. |
| 31 | + |
| 32 | +2. Evaluate the model in a way that interests you. But, you should evaluate something other than a single number for each group. Hint: you’ll need to use another `purrr` mapping function again. |
| 33 | + |
| 34 | +3. Print out this intermediate tibble for inspection (perhaps others as well, if it makes sense to do so). |
| 35 | + |
| 36 | +4. Unnest the resulting calculations, and print your final tibble to screen. Make sure your tibble makes sense: column names are appropriate, and you’ve gotten rid of columns that no longer make sense. |
| 37 | + |
| 38 | +5. Produce a plot communicating something about the result. |
| 39 | + |
| 40 | +6. Walk a reader through this analysis by providing an explanation as to what’s going on (in terms of a data analysis question and results, not necessarily in terms of what’s happening in the code). |
| 41 | + |
| 42 | +Grading is as follows: |
| 43 | + |
| 44 | +- 7.5 points: the “story” behind your data analysis. |
| 45 | + |
| 46 | +- 7.5 points: the plot. |
| 47 | + |
| 48 | +- 22.5 points: the calculations. |
| 49 | + |
| 50 | +## [Solutions]{.underline} |
| 51 | + |
| 52 | +The exercises require the following packages to run the code: |
| 53 | + |
| 54 | +```{r, message = FALSE, warning = FALSE} |
| 55 | +## Load required packages |
| 56 | +# if needed, install the packages before loading them using install.packages('name') in the console |
| 57 | +library(janeaustenr) |
| 58 | +library(stopwords) |
| 59 | +library(tidyverse) |
| 60 | +library(broom) |
| 61 | +``` |
| 62 | + |
| 63 | +### Exercise 1 |
| 64 | + |
| 65 | +In this exercise, I will plot the most common words in Jane Austen's book *Emma* from the `janeaustenr` package without stop words found in the `stopwords` package. |
| 66 | + |
| 67 | +First, let's find all of the words in the book ***Emma*** without stop words and create a tibble. |
| 68 | + |
| 69 | +```{r} |
| 70 | +words_without_stopwords <- emma %>% |
| 71 | + str_replace_all("[[:punct:]&&[^-']]|\\d|--+", " ") %>% # replace all punctuation except for apostrophes and hyphens (e.g., Emma's, to-day), or digits, or more than one hyphens with a space |
| 72 | + |
| 73 | + str_split(" ") %>% # split string by space |
| 74 | + unlist() %>% # create an atomic vector |
| 75 | + str_remove_all("^'|'$") %>% # remove leading or trailing apostrophes |
| 76 | + str_to_lower() %>% # convert string to lower case |
| 77 | + tibble(word=.) %>% # create a tibble with a column named word |
| 78 | + filter(!word %in% stopwords("en") & word!="") %>% # remove all of the stop words using the stopwords package and removes empty strings |
| 79 | + |
| 80 | + arrange(word) # arrange alphabetically |
| 81 | + |
| 82 | + |
| 83 | +``` |
| 84 | + |
| 85 | +Now let's count the number of repetitions of each word and sort them in descending order. |
| 86 | + |
| 87 | +```{r} |
| 88 | +word_repetition <- words_without_stopwords %>% |
| 89 | + count(word, sort=TRUE) # counting the repetition of each word and sort in descending order |
| 90 | +
|
| 91 | +length(word_repetition$word) # Number of unique words in the book without stop words |
| 92 | +``` |
| 93 | + |
| 94 | +There are 7259 unique words, not including stop words, in the book Emma. Therefore, plotting the frequency of all the words in the book would be messy. Instead, let's plot the frequency of the thirty most used words in a barplot. |
| 95 | + |
| 96 | +```{r} |
| 97 | +word_repetition %>% |
| 98 | + head(30) %>% # select the thirty most commonly used words |
| 99 | + ggplot(aes(x=n,y=reorder(word,n))) + # reorder words based on frequency |
| 100 | + geom_bar(stat="identity", fill ="darkblue") + # set length of bars equal to count value and set color |
| 101 | + labs(x="Count",y="Word",title="Frequency of the thirty most used words in Jane Austen's book Emma") + |
| 102 | + theme_bw() |
| 103 | +``` |
| 104 | + |
| 105 | +### Exercise 3 |
| 106 | + |
| 107 | +In this exercise, I will evaluate a linear model for each car’s horsepower versus mileage using the `mtcars` dataset for assessing the hypothesis that these variables are inversely proportional. My research question therefore becomes: does a car's mileage increase with decreasing horsepower and how does the transmission type affect these results? |
| 108 | + |
| 109 | +First, let's apply a linear model to evaluate mileage (mpg) and horsepower (hp) for each transmission type (automatic versus manual). |
| 110 | + |
| 111 | +```{r} |
| 112 | +compact_hp_vs_mpg <- mtcars %>% |
| 113 | + select(am,mpg,hp) %>% # select columns of interest: transmission (am where 0=automatic and 1=manual), mileage, and horsepower |
| 114 | + |
| 115 | + nest(data = c(mpg,hp)) %>% # nest the mpg and hp for each transmission type |
| 116 | + mutate(model = map(data,~lm(hp~mpg,data=.x))) # apply a linear model with y ~ x for each transmission type |
| 117 | +print(compact_hp_vs_mpg) |
| 118 | +``` |
| 119 | + |
| 120 | +Now, let's extract the model output using `broom::augment()`. The output includes the selected data, the fitted data, and the standard error of fitted values. |
| 121 | + |
| 122 | +```{r} |
| 123 | +expanded_hp_vs_mpg <- compact_hp_vs_mpg %>% |
| 124 | + mutate(am, yhat = map(model, ~augment(.x,se_fit = TRUE)), .keep = "none") %>% # extract the model output and the standard error of the fitted values |
| 125 | + |
| 126 | + unnest(yhat) # expand the output for every row |
| 127 | +print(expanded_hp_vs_mpg) |
| 128 | +``` |
| 129 | + |
| 130 | +Next, let's calculate the 95% confidence interval (CI) of the model and save the columns of interest (transmission type, horsepower, mileage, CI) before plotting. |
| 131 | + |
| 132 | +```{r} |
| 133 | +hp_vs_mpg <- expanded_hp_vs_mpg %>% |
| 134 | + select(am,hp,mpg,.fitted,.se.fit) %>% |
| 135 | + mutate(lower_95p_ci = .fitted - 1.96 *.se.fit) %>% # calculate the 95% lower confidence interval (z-score = 1.96) |
| 136 | + mutate(upper_95p_ci = .fitted + 1.96 *.se.fit) %>% # calculate the 95% upper confidence interval (z-score = 1.96) |
| 137 | + rename(transmission=am) %>% # Rename am to transmission |
| 138 | + mutate(transmission = case_when( |
| 139 | + transmission == 0 ~ "A", |
| 140 | + transmission == 1 ~ "M")) %>% # change 0 and 1 to automatic (A) and manual (M) |
| 141 | + select(-.se.fit) # remove column no longer needed for plotting |
| 142 | +
|
| 143 | +print(hp_vs_mpg) |
| 144 | +``` |
| 145 | + |
| 146 | +Finally, let's plot the horsepower against the mileage and overlay the regression line per transmission type and the 95% confidence interval of the linear prediction. |
| 147 | + |
| 148 | +```{r} |
| 149 | +hp_vs_mpg %>% |
| 150 | + ggplot(aes(mpg,hp,colour=transmission)) + # plot hp (y) versus mpg (x) |
| 151 | + geom_point(alpha = 0.4) + # adjust alpha to show the overlap region between the transmission types |
| 152 | + geom_line(aes(y=.fitted), linewidth =0.8) + # add regression lines |
| 153 | + geom_ribbon(aes(ymin=lower_95p_ci, # lower 95% ci |
| 154 | + ymax=upper_95p_ci), # upper 95% ci |
| 155 | + alpha = 0.1) + |
| 156 | + scale_colour_manual(values = c("lightblue", "darkred"), # change the colour for each group |
| 157 | + name = "Transmission", |
| 158 | + labels = c("Automatic","Manual")) + |
| 159 | + labs(x="Miles/(US) gallon",y="Gross horsepower",title="Horsepower versus mileage for different transmission cars")+ |
| 160 | + theme_bw() |
| 161 | +``` |
| 162 | + |
| 163 | +[Key takeaways:]{.underline} |
| 164 | + |
| 165 | +- The figure shows that generally the horsepower (hp) decreases as a function of mileage (mpg) and that manual cars have higher horsepower per mileage than automatic cars. |
| 166 | + |
| 167 | +- The confidence interval of the automatic cars is narrower than their manual counterpart, suggesting that the linear model is more precise in predicting the hp for a given mpg in the former. |
0 commit comments