assignment-b4-jeenatm/Exercise 1.Rmd at main · stat545ubc-2024/assignment-b4-jeenatm · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
---
title: "Exercise 1"
author: "Jeenat Mehareen"
date: "2024-12-03"
output: md_document
---

```{r setup, include=FALSE}
library(dplyr)
library(ggplot2)
library(janeaustenr)
library(tidytext)
library(ggplot2)
```

## Background
Using the *janeaustenr* package, I will filter for the book "Emma", I will be counting the number of words, and plotting the frequency of the most common words in bar chart using ggplot2. I will also remove stop words using the tidytext, stop_words() function.


```{r}
# Filtering for "Emma" specifically from the package "janeaustenr"
emma_book <- austen_books() %>%
    filter(book == "Emma") %>%
    mutate(line = row_number())


# breaking text into words
emma_tidied <- emma_book %>%
    unnest_tokens(word, text)


# removing stop words using tidytext::stop_words
emma_filtered <- emma_tidied %>%
    anti_join(tidytext::stop_words, by = "word")

# counting frequency of most common words used in the book in descending order
emma_word_count <- emma_filtered %>%
    count(word, sort = TRUE)

# extracting the top 15 most frequently occurring words for visualization.
emma_top_words <- emma_word_count[1:15, ]
```

```{r}

# Create the plot
graph_exercise1 <- ggplot(emma_top_words, aes(x = reorder(word, n), y = n)) +
    geom_col(fill = "pink") +
    coord_flip() +
    labs(
        title = "Top 15 Common Words in 'Emma' by Jane Austen",
        x = "Words",
        y = "Frequency"
    ) +
    theme_minimal()

# plot
graph_exercise1
```