Skip to content

Commit d3b8d7b

Browse files
author
Youzhi Yu
committed
added vignette
1 parent ea31a25 commit d3b8d7b

File tree

7 files changed

+223
-15
lines changed

7 files changed

+223
-15
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,3 +3,4 @@
33
.RData
44
.Ruserdata
55
inst/doc
6+
inst/extdata/ata_tweets.csv

DESCRIPTION

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,5 +23,11 @@ Imports:
2323
tidyr,
2424
utils
2525
Suggests:
26-
testthat (>= 3.0.0)
26+
rmarkdown,
27+
knitr,
28+
testthat (>= 3.0.0),
29+
ggplot2,
30+
readr,
31+
forcats
2732
Config/testthat/edition: 3
33+
VignetteBuilder: knitr

R/emoji-extraction.R

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ emoji_extract_unnest <- function(tweet_tbl, tweet_text){
3131
dplyr::mutate(row_number = dplyr::row_number()) %>%
3232
tidyr::unnest(.emoji_unicode) %>%
3333
dplyr::group_by(row_number, .emoji_unicode) %>%
34-
dplyr::summarize(emoji_count = dplyr::n()) %>%
34+
dplyr::summarize(.emoji_count = dplyr::n()) %>%
3535
dplyr::ungroup()
3636

3737
}

README.Rmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -100,7 +100,7 @@ tweet_df %>%
100100
emoji_extract_unnest(tweets)
101101
```
102102

103-
When looking at the tibble above, it has three columns: `row_number`, `.emoji_unicode`, and `emoji_count`. `row_number` is which row each Tweet is located in the raw data. This can give users a global overview of Emoji and counts.
103+
When looking at the tibble above, it has three columns: `row_number`, `.emoji_unicode`, and `.emoji_count`. `row_number` is which row each Tweet is located in the raw data. This can give users a global overview of Emoji and count.
104104

105105

106106
- `emoji_extract_nest()`:

README.md

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -110,22 +110,22 @@ output. By default, it is 20.
110110
tweet_df %>%
111111
emoji_extract_unnest(tweets)
112112
#> # A tibble: 8 x 3
113-
#> row_number .emoji_unicode emoji_count
114-
#> <int> <chr> <int>
115-
#> 1 1 "\U0001f600" 1
116-
#> 2 1 "\U0001f603" 2
117-
#> 3 2 "\U0001f601" 1
118-
#> 4 2 "\U0001f605" 1
119-
#> 5 2 "\U0001f606" 1
120-
#> 6 4 "\U0001f637" 4
121-
#> 7 6 "\U0001f3c1" 1
122-
#> 8 6 "\U0001f600" 1
113+
#> row_number .emoji_unicode .emoji_count
114+
#> <int> <chr> <int>
115+
#> 1 1 "\U0001f600" 1
116+
#> 2 1 "\U0001f603" 2
117+
#> 3 2 "\U0001f601" 1
118+
#> 4 2 "\U0001f605" 1
119+
#> 5 2 "\U0001f606" 1
120+
#> 6 4 "\U0001f637" 4
121+
#> 7 6 "\U0001f3c1" 1
122+
#> 8 6 "\U0001f600" 1
123123
```
124124

125125
When looking at the tibble above, it has three columns: `row_number`,
126-
`.emoji_unicode`, and `emoji_count`. `row_number` is which row each
126+
`.emoji_unicode`, and `.emoji_count`. `row_number` is which row each
127127
Tweet is located in the raw data. This can give users a global overview
128-
of Emoji and counts.
128+
of Emoji and count.
129129

130130
- `emoji_extract_nest()`:
131131

vignettes/.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
*.html
2+
*.R

vignettes/introduction.Rmd

Lines changed: 199 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,199 @@
1+
---
2+
title: "Introduction to tidyEmoji"
3+
author: "Youzhi Yu"
4+
date: "`r Sys.Date()`"
5+
output: rmarkdown::html_vignette
6+
vignette: >
7+
%\VignetteIndexEntry{Introduction to tidyEmoji}
8+
%\VignetteEngine{knitr::rmarkdown}
9+
%\VignetteEncoding{UTF-8}
10+
---
11+
12+
```{r, include = FALSE}
13+
knitr::opts_chunk$set(
14+
collapse = TRUE,
15+
message = FALSE,
16+
warning = FALSE,
17+
fig.width = 8,
18+
fig.height = 5,
19+
comment = "#>"
20+
)
21+
```
22+
23+
## Why to use the package?
24+
25+
Extracting Emoji from text might not be an easy task. This is especially the case when researchers want to understand the Emoji distribution presented in the full corpus of text data, as Unicode does not work well in conjunction with regular expression. (If you use the stringr package, the filter process is: `str_detect(text, "\Uhhhhhhhh")`). Part of the difficulty arises when we have to input each Emoji Unicode specifically to find out how many pieces of text contain this Emoji, and it would be daunting to input all existing Unicodes one by one by the user. Another challenge is that not all Unicodes are Emoji. In other words, even if we can find a way to filter all text containing some kind of Unicode, but not all of it might have Emoji per se.
26+
27+
## When to use the package?
28+
29+
This package is specifically designed for working Emoji-related text. The ideal case is analyzing Tweets, which comprise Emoji from time to time.
30+
31+
## How to use the package?
32+
33+
In this section, I would like to use 10000 Tweets from Atlanta, Georgia to give a comprehensive introduction of tidyEmoji.
34+
35+
A few extra packages are loaded to help make the plots presented in the vignette more organized.
36+
37+
```{r setup}
38+
library(tidyEmoji)
39+
library(ggplot2)
40+
library(dplyr)
41+
```
42+
43+
Load the data:
44+
45+
```{r}
46+
ata_tweets <- readr::read_csv(system.file("extdata", "ata_tweets.csv", package = "tidyEmoji"))
47+
```
48+
The `full_text` column in `ata_tweets` is where the actual Tweets are located.
49+
50+
First off, we can use `emoji_summary()` to see how many Emoji Tweets the data has.
51+
52+
- `emoji_summary()`:
53+
54+
```{r}
55+
ata_tweets %>%
56+
emoji_summary(full_text)
57+
```
58+
59+
The raw data has 10000 tweets in total, 2841 of which have at least one Emoji.
60+
61+
62+
If users want to filter the Emoji Tweets, the `emoji_tweets()` function is specifically designed for this purpose. Researchers might be interested in finding differences (such as sentiment or timestamp differences) between Emoji and non-Emoji Tweets.
63+
64+
- `emoji_extract_nest/unnest()`
65+
66+
If users would like to see how many Emoji each Tweet has, `emoji_extract_nest()` can help achieve the task. The function preserves the raw data, in this case `ata_tweets`. The only change is it adds an extra list column `.emoji_unicode` to let the users see how many Emoji each Tweet has.
67+
68+
```{r}
69+
ata_tweets %>%
70+
emoji_extract_nest(full_text) %>%
71+
select(.emoji_unicode)
72+
```
73+
When looking at the output above, we know immediately that the first two Tweets do not have any Emoji, but the third and fourth one have 1. If users want to see what exactly each Emoji Unicode is, they can either `unnest(.emoji_unicode)` or simply use `emoji_extract_unnest()` as follows:
74+
75+
76+
```{r}
77+
emoji_count_per_tweet <- ata_tweets %>%
78+
emoji_extract_unnest(full_text)
79+
80+
emoji_count_per_tweet
81+
```
82+
83+
`emoji_extract_unnest()` filters out non-Emoji Tweets and outputs the row number of each Emoji Tweet in the `row_number` column, and the Emoji Unicode(s) presented in each Tweet. `.emoji_count` counts how many Emoji with such `.emoji_unicode` in each Tweet.
84+
85+
The following plot shows the distribution of Emoji Tweets in terms of how many Emoji shown.
86+
87+
```{r}
88+
emoji_count_per_tweet %>%
89+
group_by(.emoji_count) %>%
90+
summarize(n = n()) %>%
91+
ggplot(aes(.emoji_count, n)) +
92+
geom_col() +
93+
scale_x_continuous(breaks = seq(1,15)) +
94+
ggtitle("How many Emoji does each Emoji Tweet have?")
95+
```
96+
97+
As we can conclude from the plot above, most of the Emoji Tweets only have 1 Emoji, and much fewer Tweets have more than 1 Emoji.
98+
99+
- `top_n_emojis`:
100+
101+
```{r}
102+
top_20_emojis <- ata_tweets %>%
103+
top_n_emojis(full_text)
104+
105+
top_20_emojis
106+
```
107+
108+
`top_n_emojis()` counts all Emojis presented in the entire text corpus and outputs the top `n` ones. By default, `n` is 20.
109+
110+
Here are the top 20 Emojis from `ata_tweets`:
111+
112+
```{r}
113+
top_20_emojis %>%
114+
ggplot(aes(n, emoji_name, fill = emoji_category)) +
115+
geom_col()
116+
```
117+
118+
Tidy up the plot:
119+
120+
```{r}
121+
top_20_emojis %>%
122+
mutate(emoji_name = stringr::str_replace_all(emoji_name, "_", " "),
123+
emoji_name = forcats::fct_reorder(emoji_name, n)) %>%
124+
ggplot(aes(n, emoji_name, fill = emoji_category)) +
125+
geom_col() +
126+
labs(x = "# of Emoji",
127+
y = "Emoji name",
128+
fill = "Emoji category",
129+
title = "The 20 most popular Emojis")
130+
```
131+
132+
Besides having Emoji names, users can put the actual Emoji on the plot:
133+
134+
```{r}
135+
top_20_emojis %>%
136+
mutate(emoji_name = stringr::str_replace_all(emoji_name, "_", " "),
137+
emoji_name = forcats::fct_reorder(emoji_name, n)) %>%
138+
ggplot(aes(n, emoji_name, fill = emoji_category)) +
139+
geom_col() +
140+
geom_text(aes(label = unicode), hjust = 0.1) +
141+
labs(x = "# of Emoji",
142+
y = "Emoji name",
143+
fill = "Emoji category",
144+
title = "The 20 most popular Emojis")
145+
```
146+
147+
With the presence of Emoji, the Emoji names are more concrete and easier to be understood of what they stand for.
148+
149+
Users can choose `n` based on their preferences. Here we would like to output the 10 most popular Emojis from `ata_tweets`:
150+
151+
```{r}
152+
ata_tweets %>%
153+
top_n_emojis(full_text, n = 10) %>%
154+
ggplot(aes(n, emoji_name, fill = emoji_category)) +
155+
geom_col()
156+
```
157+
158+
- `emoji_categorize()`:
159+
160+
```{r}
161+
ata_emoji_category <- ata_tweets %>%
162+
emoji_categorize(full_text) %>%
163+
select(.emoji_category)
164+
165+
ata_emoji_category
166+
```
167+
168+
Emojis can be categorized into 10 different categories. For more information in this regard, just simply type `category_unicode_crosswalk` at the console.
169+
170+
If users want to classify each Emoji Tweet for its category/categories, `emoji_categorize()` is the right function to use. `.emoji_category` is an added column indicating the Emoji category for each Tweet. If a Tweet has more than one category, `|` is used to separate various categories.
171+
172+
The following plot shows the Emoji categories who appear more than 20 times among all Tweets:
173+
174+
```{r}
175+
ata_emoji_category %>%
176+
count(.emoji_category) %>%
177+
filter(n > 20) %>%
178+
mutate(.emoji_category = forcats::fct_reorder(.emoji_category, n)) %>%
179+
ggplot(aes(n, .emoji_category)) +
180+
geom_col()
181+
```
182+
183+
If users want to see each of the 10 total categories only, `separate_rows()` from the tidyr package is used to separate categories based on `|`.
184+
185+
```{r}
186+
ata_emoji_category %>%
187+
tidyr::separate_rows(.emoji_category, sep = "\\|") %>%
188+
count(.emoji_category) %>%
189+
mutate(.emoji_category = forcats::fct_reorder(.emoji_category, n)) %>%
190+
ggplot(aes(n, .emoji_category)) +
191+
geom_col()
192+
```
193+
194+
Here we see more than 2000 Tweets fall into the "Smileys & Emotion" category, and the second popular category is "People & Body". One caveat for this plot is that some Tweets have been double counted, as once they fall into various categories, they are counted in each respective category.
195+
196+
To shed a bit more light on how users may use `emoji_categorize()` for further analysis, they can look at how each category is corrleated with each other. In other words, if Emoji from one category appears in a Tweet, what Emoji from other categories would be more likely to appear in the same Tweet. To visualize the result, a graph/network visualization is appropriate.
197+
198+
199+

0 commit comments

Comments
 (0)