-
Notifications
You must be signed in to change notification settings - Fork 2
Expand file tree
/
Copy pathIntroToDataVis_r.Rmd
More file actions
232 lines (182 loc) · 9.05 KB
/
IntroToDataVis_r.Rmd
File metadata and controls
232 lines (182 loc) · 9.05 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
---
title: An R Markdown document converted from "IntroToDataVis_r.ipynb"
output: html_document
---
# Hands-on with R+ ggplot2
# 1. Improving Pie Charts
*What is wrong with this figure?*

## Let's agree that this is a monstrosity. Now, how do we improve it?
```{r}
# import the necessary library
library(ggplot2)
```
## 1.1. Read in the data
*This is a made up data set from a colleague of mine. We have 10 items, each with a text label and a numeric value.*
*I'm using `read.csv` to read in the data.*
```{r}
url <- 'https://drive.google.com/file/d/1iWAtKk7aOinwb-pJ-Cy-hiB5xBv3Z-b5/view?usp=sharing'
# Extract file ID from the URL
file_id <- strsplit(url, "/")[[1]][6]
# Construct the direct download link
direct_url <- paste0('https://drive.google.com/uc?id=', file_id)
data <- read.csv(direct_url)
data
```
## 1.2. For many uses cases (including this) a bar chart is a better option than a pie chart.
*Humans can more easily interpret differences in bar charts. Pie charts require us to interpret areas = slow, while bar charts use position = fast. Generally, you should choose a bar chart over a pie chart when:*
- *There are too many categories to easily distinguish between pie chart areas (as we have here).*
- *Slice sizes in the pie chart are too similar (as we have here).*
- *You have multiple data sets (which we do not have here).*
- *When the raw percentages can provide as much (or more) meaning than fraction of a whole (as we have here).*
*Pie charts are only useful when there are few categories, each category has a very different percentage, AND the purpose of your visualization is to show fractions of a whole.*
*Here is the default bar chart from ggplot. Leaves lots to be desired...*
```{r}
ggplot(data, aes(x = Label, y = Value)) +
geom_bar(stat = "identity") # use stat = "identity" because we are supplying the actual bar values
```
## 1.3. Improve the axis labels and add a plot title
*The text for the bars are unreadable. How should we fix that?*
```{r}
ggplot(data, aes(x = Label, y = Value)) +
geom_bar(stat = "identity") + # use stat = "identity" because we are supplying the actual bar values
labs(title = "Percentage of Poor Usage", x = "", y = "Percent")
```
## 1.4. Fix the bar text, sort the data, add the percentage values to each bar
```{r}
ggplot(data, aes(x = reorder(Label, Value), y = Value)) +
geom_bar(stat = "identity") + # use stat = "identity" because we are supplying the actual bar values
labs(title = "Percentage of Poor Usage", x = "", y = "") +
coord_flip() + # this flips the plot to horizontal
geom_text(aes(label = paste0(Value,"%")), vjust = 0, hjust = -0.2) + # add labels
ylim(0,11) # add some space for the text labels; since we flipped the plot we use "ylim" (instead of "xlim")
```
## 1.5. Clean this up a bit
- *I don't want the grid lines anymore*
- *We can remove the axes entirely*
- *Make the font larger*
- *Let's change the colors, and highlight one of them*
- *Save the plot*
```{r fig.height=8, fig.width=15, message=TRUE}
# Make plot wider for display
options(repr.plot.width = 15, repr.plot.height = 8)
ggplot(data,
aes(
x = reorder(Label, Value),
y = Value,
fill = factor(ifelse(Label == "Color Choice", "Highlighted", "Normal")) # to highlight one bar
)
) +
geom_bar(stat = "identity", show.legend = FALSE) + # use stat = "identity" because we are supplying the actual bar values
labs(title = "Percentage of Poor Usage in Data Visualization", x = "", y = "") +
coord_flip() + # this flips the plot to horizontal
geom_text(aes(label = paste0(Value,"%")), vjust = 0, hjust = -0.2, size = 6) + # add labels
ylim(0,11) + # add some space for the text labels; since we flipped the plot we use "ylim" (instead of "xlim")
scale_fill_manual(name = "", values = c("orange","grey50")) + # set the colors for highlighting
theme_classic() + # there are many themes to choose from : https://ggplot2.tidyverse.org/reference/ggtheme.html
theme(
axis.line = element_blank(), # remove the remaining axis lines
axis.text.x = element_blank(), # remove x axis labels
axis.ticks.x = element_blank(), # remove x axis ticks
axis.ticks.y = element_blank(), # remove y axis ticks
axis.text = element_text(size = 20), # increase the font size of the labels
plot.title = element_text(size = 30) # increase the font size of the title
)
# save the figure (have to specify size here again)
# ggsave("bar_r.pdf", device = "pdf", width = 15, height = 8)
```
# 2. Scatter Plots
```{r}
# import the necessary library
library(ggplot2)
```
## 2.1. Read in the data
*I downloaded [2024 Chicago taxi data](https://data.cityofchicago.org/Transportation/Taxi-Trips-2024-/ajtu-isnz/about_data) from the [Chicago data portal](https://data.cityofchicago.org/). This dataset has millions rows and many columns (and is about 1.3G), and therefore may take some time to load and visualize.*
*If you want to run this code locally, please either download the data from the Chicago Data Portal linked above, or the version that I have on Google Drive [here](https://drive.google.com/file/d/1QPS8DY2bDCbttMf4dEIIC3LOdYlph7sJ/view?usp=sharing). (The dataset is too large to host on GitHub.)*
*Here, we will look at columns for `Fare` and `Tips`.*
```{r}
df <- read.csv('data/Taxi_Trips__2024-__20240731.csv')
head(df)
```
## 2.2 Let's plot the `Fare` vs. `Tips` data as a scatter plot.
*Is there anything that we should improve upon here?*
```{r}
# create the scatter plot
ggplot(data = df, aes(x = Fare, y = Tips)) +
geom_point()
```
## 2.3 Let's improve this
- *Change the axis range.*
- *Try open circles as symbols.*
- *Add a title and some descriptive labels to the axes.*
- *Increase the font sizes.*
```{r fig.height=8, fig.width=14, message=TRUE}
ggplot(data = df, aes(x = Fare, y = Tips)) +
geom_point(shape = 1, size = 2) +
labs(
title = "How Chicagoans Tipped their Cab Drivers in 2024",
x = "Fare ($)",
y = "Tip ($)"
) +
xlim(0,150) + ylim(0,150) +
theme(
panel.grid.major = element_blank(), # remove the grid
panel.grid.minor = element_blank(), # remove the grid
axis.title = element_text(size = 20), # increase the font size of the axis titles
plot.title = element_text(size = 30), # increase the font size of the title
axis.text = element_text(size = 16), # increase the font size of the tick labels
aspect.ratio = 1, # so it's not as wide as the default
)
# save the figure (have to specify size here)
# ggsave("scatter_r.pdf", device = "pdf", width = 8, height = 5)
```
## 2.4 Can we improve this more?
- *Use a 2d histogram instead. (Often when you have so much overlapping data, it is easier for the view to switch to a 2d histogram or contour plot, or similar).*
- *Include a colorbar.*
- *Add lines at typical tip rates and label them?*
```{r}
library(scales) # mostly used for the colormap
```
```{r fig.height=8, fig.width=12, message=TRUE}
# Create the plot
p <- ggplot(df, aes(x = Fare, y = Tips)) +
geom_bin2d(bins = 60) +
scale_fill_continuous(
trans = 'log',
low = "white", high = "darkblue",
limits = c(1, 1e4),# Set the minimum and maximum values for the colormap
oob = scales::squish, # Map out-of-bounds values to the maximum color
breaks = c(1, 10, 100, 1e3, 1e4), # Define the legend values to show
guide = guide_colourbar(
title = "Number of Rides",
title.theme = element_text(size = 20), # Title font size
label.theme = element_text(size = 16), # Label font size
)
) +
labs(
title = "How Chicagoans Tipped\ntheir Cab Drivers in 2024",
x = "Fare ($)",
y = "Tip ($)"
) +
scale_x_continuous(limits = c(0, 150), expand = c(0, 0)) +
scale_y_continuous(limits = c(0, 60), expand = c(0, 0)) +
theme_bw() + # remove the gray background
theme(
panel.grid.major = element_blank(), # remove the grid
panel.grid.minor = element_blank(), # remove the grid
axis.title = element_text(size = 20), # increase the font size of the axis titles
plot.title = element_text(size = 30), # increase the font size of the title
axis.text = element_text(size = 16), # increase the font size of the tick labels
aspect.ratio = 1, # so it's not as wide as the default
legend.key.height = unit(3, "cm")
)
# add lines at standard tip rates (uncomment below to include the lines in the plot)
# tip_pcts <- c(0.2, 0.25, 0.30, 0.4)
# p <- p +
# geom_abline(data = data.frame(slope = tip_pcts), aes(intercept = 0, slope = slope), color = 'black', linetype = 'dashed', alpha = 0.7) +
# geom_text(data = data.frame(pct = tip_pcts), aes(x = 120, y = pct*130, label = paste0(pct*100, "%"), angle = 100*pct), color = 'black', hjust = 0, alpha = 0.7)
# show the plot
print(p)
# save the figure (have to specify size here)
# ggsave("hist_r.png", device = "png", width = 12, height = 8)
```