You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Describe the difference in raster and vector output formats
47
-
- Use `ggsave` to save visualizations in `.png` and `.svg` format
46
+
- Describe the difference in raster and vector output formats.
47
+
- Use `ggsave` to save visualizations in `.png` and `.svg` format.
48
48
49
49
## Choosing the visualization
50
50
#### *Ask a question, and answer it* {-}
@@ -65,7 +65,7 @@ from Chapter \@ref(intro).
65
65
With the visualizations we will cover in this chapter,
66
66
we will be able to answer *only descriptive and exploratory* questions.
67
67
Be careful to not answer any *predictive, inferential, causal*
68
-
*or mechanistic* questions with visualizations presented here,
68
+
*or mechanistic* questions with the visualizations presented here,
69
69
as we have not learned the tools necessary to do that properly just yet.
70
70
71
71
As with most coding tasks, it is totally fine (and quite common) to make
@@ -200,11 +200,11 @@ options(warn = -1)
200
200
201
201
The [Mauna Loa CO$_{\text{2}}$ data set](https://www.esrl.noaa.gov/gmd/ccgg/trends/data.html),
202
202
curated by [Dr. Pieter Tans, NOAA/GML](https://www.esrl.noaa.gov/gmd/staff/Pieter.Tans/)
203
-
and [Dr. Ralph Keeling, Scripps Institution of Oceanography](https://scrippsco2.ucsd.edu/)
203
+
and [Dr. Ralph Keeling, Scripps Institution of Oceanography,](https://scrippsco2.ucsd.edu/)
204
204
records the atmospheric concentration of carbon dioxide
205
205
(CO$_{\text{2}}$, in parts per million)
206
206
at the Mauna Loa research station in \index{Mauna Loa CO2} Hawaii
207
-
from 1959 onwards[@maunadata].
207
+
from 1959 onward[@maunadata].
208
208
For this book, we are going to focus on the last 40 years of the data set,
209
209
1980-2020.
210
210
@@ -247,7 +247,7 @@ that was measured on each date, and is type `double`.
247
247
> For example, `date` type vectors allow functions like `ggplot`
248
248
> to treat them as numeric dates and not as character vectors,
249
249
> even though they contain non-numeric characters
250
-
> (e.g., `-`in the `date_measured` column in the `co2_df` data frame).
250
+
> (e.g., in the `date_measured` column in the `co2_df` data frame).
251
251
> This means R will not accidentally plot the dates in the wrong order
252
252
> (i.e., not alphanumerically as would happen if it was a character vector).
253
253
> An in-depth study of dates and times is beyond the scope of the book,
@@ -268,15 +268,15 @@ There are a few basic aspects of a plot that we need to specify:
268
268
\index{ggplot!aesthetic mapping}
269
269
\index{ggplot!geometric object}
270
270
271
-
-the name of the data frame object to visualize
272
-
-here, we specify the `co2_df` data frame
273
-
-the**aesthetic mapping**, which tells \index{aesthetic mapping} `ggplot` how the columns in the data frame map to properties of the visualization
274
-
-to create an aesthetic mapping, we use the `aes` function
275
-
-here, we set the plot `x` axis to the `date_measured` variable, and the plot `y` axis to the `ppm` variable
276
-
-the`+` operator, which tells `ggplot` that we would like to add another layer to the plot.\index{aaaplussymb@$+$|see{ggplot!add layer}}\index{ggplot!add layer}
277
-
-the**geometric object**, which specifies \index{aesthetic mapping} how the mapped data should be displayed
278
-
-to create a geometric object, we use a `geom_*` function (see the [ggplot reference](https://ggplot2.tidyverse.org/reference/) for a list of geometric objects)
279
-
-here, we use the `geom_point` function to visualize our data as a scatter plot
271
+
-The name of the data frame object to visualize.
272
+
-Here, we specify the `co2_df` data frame.
273
+
-The**aesthetic mapping**, which tells \index{aesthetic mapping} `ggplot` how the columns in the data frame map to properties of the visualization.
274
+
-To create an aesthetic mapping, we use the `aes` function.
275
+
-Here, we set the plot `x` axis to the `date_measured` variable, and the plot `y` axis to the `ppm` variable.
276
+
-The`+` operator, which tells `ggplot` that we would like to add another layer to the plot.\index{aaaplussymb@$+$|see{ggplot!add layer}}\index{ggplot!add layer}
277
+
-The**geometric object**, which specifies \index{aesthetic mapping} how the mapped data should be displayed.
278
+
-To create a geometric object, we use a `geom_*` function (see the [ggplot reference](https://ggplot2.tidyverse.org/reference/) for a list of geometric objects).
279
+
-Here, we use the `geom_point` function to visualize our data as a scatter plot.
280
280
281
281
Figure \@ref(fig:03-ggplot-function-scatter)
282
282
shows how each of these aspects map to code
@@ -352,12 +352,6 @@ change the font size, we use the `theme` function with the `text` argument:
352
352
\index{ggplot!xlab,ylab}
353
353
\index{ggplot!theme}
354
354
355
-
> **Note:** The `theme` function is quite complex and has many arguments
356
-
> that can be specified to control many non-data aspects of a visualization.
357
-
> An in-depth discussion of the `theme` function is beyond the scope of this book.
358
-
> Interested readers may consult the `theme` function documentation;
359
-
> see the additional resources section at the end of this chapter.
360
-
361
355
```{r 03-data-co2-line-2, warning=FALSE, message=FALSE, fig.height = 3.5, fig.width = 4.5, fig.align = "center", fig.cap = "Line plot of atmospheric concentration of CO$_{2}$ over time with clearer axes and labels."}
362
356
co2_line <- ggplot(co2_df, aes(x = date_measured, y = ppm)) +
We can see in Figure \@ref(fig:03-data-faithful-scatter) the data tend to fall
470
+
We can see in Figure \@ref(fig:03-data-faithful-scatter) that the data tend to fall
471
471
into two groups: one with short waiting and eruption times, and one with long
472
472
waiting and eruption times. Note that in this case, there is no overplotting:
473
473
the points are generally nicely visually separated, and the pattern they form
@@ -1045,7 +1045,7 @@ minus 299,000; this ensures it is coded the same way as the
1045
1045
measurements in the `morley` data frame.
1046
1046
We would also like to fine tune this vertical line,
1047
1047
styling it so that it is dashed and 1 point in thickness.
1048
-
A point is a measurement unit commonly used with font,
1048
+
A point is a measurement unit commonly used with fonts,
1049
1049
and 1 point is about 0.353 mm.
1050
1050
We do this by setting `linetype = "dashed"` and `size = 1`, respectively.
1051
1051
There is a similar function, `geom_hline`,
@@ -1099,6 +1099,7 @@ with the data types in the `morley` data frame. In particular, the `Expt` column
1099
1099
is currently an *integer* (you can see the label `<int>` underneath the `Expt` column in \index{integer} the printed
1100
1100
data frame at the start of this section). But we want to treat it as a
1101
1101
*category*, i.e., there should be one category per type of experiment.
1102
+
1102
1103
To fix this issue we can convert the `Expt` variable into a *factor* by \index{factor}
1103
1104
passing it to `as_factor` in the `fill` aesthetic mapping.
1104
1105
Recall that factor is a data type in R that is often used to represent
@@ -1124,7 +1125,7 @@ morley_hist
1124
1125
1125
1126
Unfortunately, the attempt to separate out the experiment number visually has
1126
1127
created a bit of a mess. All of the colors in Figure
1127
-
\@ref(fig:03-data-morley-hist-3) are blending together, and although it is
1128
+
\@ref(fig:03-data-morley-hist-with-factor) are blending together, and although it is
1128
1129
possible to derive *some* insight from this (e.g., experiments 1 and 3 had some
1129
1130
of the most incorrect measurements), it isn't the clearest way to convey our
1130
1131
message and answer the question. Let's try a different strategy of creating
@@ -1139,8 +1140,7 @@ If the plot is to be split horizontally, into rows,
1139
1140
then the `rows` argument is used.
1140
1141
If the plot is to be split vertically, into columns,
1141
1142
then the `columns` argument is used.
1142
-
Both the `rows` and `columns` argument take the column names to split the data
1143
-
on when creating the subplots.
1143
+
Both the `rows` and `columns` arguments take the column names on which to split the data when creating the subplots.
1144
1144
One key thing is that the column names must be surrounded by the `vars` function.
1145
1145
This function allows the column names to be correctly evaluated
1146
1146
in the context of the data frame.
@@ -1161,7 +1161,7 @@ with respect to one another.
1161
1161
The most variable measurements came from Experiment 1.
1162
1162
There the measurements ranged from about 650 - 1050 km / sec.
1163
1163
The least variable measurements came from Experiment 2.
1164
-
There the measurements ranged from about 750 - 950 km / sec.
1164
+
There, the measurements ranged from about 750 - 950 km / sec.
1165
1165
The most different experiments still obtained quite similar results!
1166
1166
1167
1167
There are two finishing touches to make this visualization even clearer. First and foremost, we need to add informative axis labels
@@ -1320,14 +1320,14 @@ suggest directions for future work.
1320
1320
Regardless of where it appears, a good way to discuss your visualization \index{visualization!explanation} is as
1321
1321
a story:
1322
1322
1323
-
1) Establish the setting and scope, and motivate why you did what you did.
1323
+
1) Establish the setting and scope, and describe why you did what you did.
1324
1324
2) Pose the question that your visualization answers. Justify why the question is important to answer.
1325
1325
3) Answer the question using your visualization. Make sure you describe *all* aspects of the visualization (including describing the axes). But you
1326
1326
can emphasize different aspects based on what is important to answer your question:
1327
1327
-**trends (lines):** Does a line describe the trend well? If so, the trend is *linear*, and if not, the trend is *nonlinear*. Is the trend increasing, decreasing, or neither?
1328
1328
Is there a periodic oscillation (wiggle) in the trend? Is the trend noisy (does the line "jump around" a lot) or smooth?
1329
-
-**distributions (scatters, histograms):** How spread out are the data? Where are they centered, roughly? Are there any obvious "clusters" or "subgroups", which would be visible as multiple bumps in the histogram?
1330
-
-**distributions of two variables (scatters):**is there a clear / strong relationship between the variables (points fall in a distinct pattern), a weak one (points fall in a pattern but there is some noise), or no discernible
1329
+
-**distributions (scatters, histograms):** How spread out are the data? Where are they centered, roughly? Are there any obvious "clusters" or "subgroups", which would be visible as multiple bumps in the histogram?
1330
+
-**distributions of two variables (scatters):**Is there a clear / strong relationship between the variables (points fall in a distinct pattern), a weak one (points fall in a pattern but there is some noise), or no discernible
1331
1331
relationship (the data are too noisy to make any conclusion)?
1332
1332
-**amounts (bars):** How large are the bars relative to one another? Are there patterns in different groups of bars?
1333
1333
4) Summarize your findings, and use them to motivate whatever you will discuss next.
@@ -1342,7 +1342,7 @@ greenhouse gases, typically primarily carbon dioxide (CO$_{\text{2}}$), as a
1342
1342
byproduct. Too much of these gases in the Earth's atmosphere will cause it to
1343
1343
trap more heat from the sun, leading to global warming. (2) In order to assess
1344
1344
how quickly the atmospheric concentration of CO$_{\text{2}}$ is increasing over
1345
-
time, we (3) used a data set from the Mauna Loa observatory from Hawaii,
1345
+
time, we (3) used a data set from the Mauna Loa observatory in Hawaii,
1346
1346
consisting of CO$_{\text{2}}$ measurements from 1980 to 2020. We plotted the
1347
1347
measured concentration of CO$_{\text{2}}$ (on the vertical axis) over time (on
1348
1348
the horizontal axis). From this plot, you can see a clear, increasing, and
@@ -1355,10 +1355,10 @@ perhaps worth investigating more into the causes.
1355
1355
**Michelson Light Speed Experiments:** (1) \index{Michelson speed of light} Our
1356
1356
modern understanding of the physics of light has advanced significantly from
1357
1357
the late 1800s when Michelson and Morley's experiments first demonstrated that
1358
-
it had a finite speed. We now know based on modern experiments that it moves at
1359
-
roughly 299,792.458 kilometres per second. (2) But how accurately were we first
1358
+
it had a finite speed. We now know, based on modern experiments, that it moves at
1359
+
roughly 299,792.458 kilometers per second. (2) But how accurately were we first
1360
1360
able to measure this fundamental physical constant, and did certain experiments
1361
-
produce more accurate results than others? (3) To better understand this we
1361
+
produce more accurate results than others? (3) To better understand this, we
1362
1362
plotted data from 5 experiments by Michelson in 1879, each with 20 trials, as
1363
1363
histograms stacked on top of one another. The horizontal axis shows the
1364
1364
accuracy of the measurements relative to the true speed of light as we know it
@@ -1384,7 +1384,7 @@ and *vector* \index{vector graphics} formats.
1384
1384
**Raster** images are represented as a 2-D grid of square pixels, each
1385
1385
with its own color. Raster images are often *compressed* before storing so they
1386
1386
take up less space. A compressed format is *lossy* if the image cannot be
1387
-
perfectly recreated when loading and displaying, with the hope that the change
1387
+
perfectly re-created when loading and displaying, with the hope that the change
1388
1388
is not noticeable. *Lossless* formats, on the other hand, allow a perfect
1389
1389
display of the original image.
1390
1390
\index{raster graphics!file types}
@@ -1415,7 +1415,7 @@ computer has to draw all the elements each time it is displayed. For example,
1415
1415
if you have a scatter plot with 1 million points stored as an SVG file, it may
1416
1416
take your computer some time to open the image. On the other hand, you can zoom
1417
1417
into / scale up vector graphics as much as you like without the image looking
1418
-
bad, while raster images eventually start to look "pixellated."
1418
+
bad, while raster images eventually start to look "pixelated."
1419
1419
1420
1420
> **Note:** The portable document format [PDF](https://en.wikipedia.org/wiki/PDF) (`.pdf`) is commonly used to
1421
1421
> store *both* raster and vector formats. If you try to open a PDF and it's taking a long time
@@ -1447,7 +1447,7 @@ This can include the path to the directory where you would like to save the file
1447
1447
and the name of the plot object to save as its second argument.
1448
1448
The kind of image to save is specified by the file extension.
1449
1449
For example,
1450
-
to create a PNG image file we specify that the file extension is `.png`.
1450
+
to create a PNG image file, we specify that the file extension is `.png`.
1451
1451
Below we demonstrate how to save PNG, JPG, BMP, TIFF and SVG file types
1452
1452
for the `faithful_plot`:
1453
1453
@@ -1495,8 +1495,9 @@ based on mathematical formulas, vector graphics can be scaled up to arbitrary
1495
1495
sizes. This makes them great for presentation media of all sizes, from papers
1496
1496
to posters to billboards.
1497
1497
1498
+
(ref:03-raster-image) Zoomed in `faithful`, raster (PNG, left) and vector (SVG, right) formats.
@@ -1513,11 +1514,11 @@ found in Chapter \@ref(move-to-your-own-machine).
1513
1514
## Additional resources
1514
1515
- The [`ggplot2` page on the tidyverse website](https://ggplot2.tidyverse.org) is where you should look if you want to learn more about the functions in this chapter, the full set of arguments you can use, and other related functions. The site also provides a very nice cheat sheet that summarizes many of the data wrangling functions from this chapter.
1515
1516
- The [Fundamentals of Data Visualization](https://serialmentor.com/dataviz/) has a wealth of information on designing effective visualizations. It is not specific to any particular programming language or library. If you want to improve your visualization skills, this is the next place to look.
1516
-
- [R for Data Science](https://r4ds.had.co.nz/) has a chapter on [creating visualizations using `ggplot2`](https://r4ds.had.co.nz/data-visualisation.html). This reference is specific to R and `ggplot2`, but provides a much more detailed introduction to the full set of tools that `ggplot2` provides. This chapter is where you should look if you want to learn how to make more intricate visualizations in `ggplot2` than what is included in this chapter.
1517
+
-[*R for Data Science*](https://r4ds.had.co.nz/) has a chapter on [creating visualizations using `ggplot2`](https://r4ds.had.co.nz/data-visualisation.html). This reference is specific to R and `ggplot2`, but provides a much more detailed introduction to the full set of tools that `ggplot2` provides. This chapter is where you should look if you want to learn how to make more intricate visualizations in `ggplot2` than what is included in this chapter.
1517
1518
- The [`theme` function documentation](https://ggplot2.tidyverse.org/reference/theme.html)
1518
1519
is an excellent reference to see how you can fine tune the non-data aspects
1519
1520
of your visualization.
1520
-
- [R for Data Science](https://r4ds.had.co.nz/) has a chapter on
1521
+
-[*R for Data Science*](https://r4ds.had.co.nz/) has a chapter on
1521
1522
[dates and times](https://r4ds.had.co.nz/dates-and-times.html).
1522
1523
This chapter is where you should look if you want to learn about `date` vectors,
0 commit comments