minor changes

leem44 · leem44 · commit b4593d84ead3 · 2021-12-03T16:57:46.000-08:00
diff --git a/viz.Rmd b/viz.Rmd
@@ -21,17 +21,17 @@ plots, line plots, and histograms) for data using R.
 
 ## Chapter learning objectives
 
-By the end of the chapter, readers will be able to:
+By the end of the chapter, readers will be able to do the following:
 
 - Describe when to use the following kinds of visualizations to answer specific questions using a data set:
     - scatter plots
     - line plots
     - bar plots 
     - histogram plots
-- Given a data set and a question, select from the above plot types and use R to create a visualization that best answers the question
-- Given a visualization and a question, evaluate the effectiveness of the visualization and suggest improvements to better answer the question
-- Referring to the visualization, communicate the conclusions in non-technical terms
-- Identify rules of thumb for creating effective visualizations 
+- Given a data set and a question, select from the above plot types and use R to create a visualization that best answers the question.
+- Given a visualization and a question, evaluate the effectiveness of the visualization and suggest improvements to better answer the question.
+- Referring to the visualization, communicate the conclusions in non-technical terms.
+- Identify rules of thumb for creating effective visualizations. 
 - Define the three key aspects of ggplot objects:
     - aesthetic mappings
     - geometric objects
@@ -40,11 +40,11 @@ By the end of the chapter, readers will be able to:
     - geometric objects: `geom_point`, `geom_line`, `geom_histogram`, `geom_bar`, `geom_vline`, `geom_hline`
     - scales: `xlim`, `ylim`
     - aesthetic mappings: `x`, `y`, `fill`, `color`, `shape`
-    - labelling: `xlab`, `ylab`, `labs`
+    - labeling: `xlab`, `ylab`, `labs`
     - font control and legend positioning: `theme`
     - subplots: `facet_grid`
-- Describe the difference in raster and vector output formats
-- Use `ggsave` to save visualizations in `.png` and `.svg` format
+- Describe the difference in raster and vector output formats.
+- Use `ggsave` to save visualizations in `.png` and `.svg` format.
 
 ## Choosing the visualization
 #### *Ask a question, and answer it* {-}
@@ -65,7 +65,7 @@ from Chapter \@ref(intro).
 With the visualizations we will cover in this chapter, 
 we will be able to answer *only descriptive and exploratory* questions. 
 Be careful to not answer any *predictive, inferential, causal* 
-*or mechanistic* questions with visualizations presented here, 
+*or mechanistic* questions with the visualizations presented here, 
 as we have not learned the tools necessary to do that properly just yet.  
 
 As with most coding tasks, it is totally fine (and quite common) to make
@@ -200,11 +200,11 @@ options(warn = -1)
 
 The [Mauna Loa CO$_{\text{2}}$ data set](https://www.esrl.noaa.gov/gmd/ccgg/trends/data.html), 
 curated by [Dr. Pieter Tans, NOAA/GML](https://www.esrl.noaa.gov/gmd/staff/Pieter.Tans/) 
-and [Dr. Ralph Keeling, Scripps Institution of Oceanography](https://scrippsco2.ucsd.edu/)
+and [Dr. Ralph Keeling, Scripps Institution of Oceanography,](https://scrippsco2.ucsd.edu/)
 records the atmospheric concentration of carbon dioxide 
 (CO$_{\text{2}}$, in parts per million) 
 at the Mauna Loa research station in \index{Mauna Loa CO2} Hawaii 
-from 1959 onwards [@maunadata].
+from 1959 onward [@maunadata].
 For this book, we are going to focus on the last 40 years of the data set,
 1980-2020.
 
@@ -247,7 +247,7 @@ that was measured on each date, and is type `double`.
 > For example, `date` type vectors allow functions like `ggplot` 
 > to treat them as numeric dates and not as character vectors, 
 > even though they contain non-numeric characters 
-> (e.g., `-` in the `date_measured` column in the `co2_df` data frame).
+> (e.g., in the `date_measured` column in the `co2_df` data frame).
 > This means R will not accidentally plot the dates in the wrong order 
 > (i.e., not alphanumerically as would happen if it was a character vector). 
 > An in-depth study of dates and times is beyond the scope of the book, 
@@ -268,15 +268,15 @@ There are a few basic aspects of a plot that we need to specify:
 \index{ggplot!aesthetic mapping}
 \index{ggplot!geometric object}
 
-- the name of the data frame object to visualize 
-    - here, we specify the `co2_df` data frame
-- the **aesthetic mapping**, which tells \index{aesthetic mapping} `ggplot` how the columns in the data frame map to properties of the visualization
-    - to create an aesthetic mapping, we use the `aes` function
-    - here, we set the plot `x` axis to the `date_measured` variable, and the plot `y` axis to the `ppm` variable
-- the `+` operator, which tells `ggplot` that we would like to add another layer to the plot.\index{aaaplussymb@$+$|see{ggplot!add layer}}\index{ggplot!add layer}
-- the **geometric object**, which specifies \index{aesthetic mapping} how the mapped data should be displayed
-    - to create a geometric object, we use a `geom_*` function (see the [ggplot reference](https://ggplot2.tidyverse.org/reference/) for a list of geometric objects)
-    - here, we use the `geom_point` function to visualize our data as a scatter plot
+- The name of the data frame object to visualize.
+    - Here, we specify the `co2_df` data frame.
+- The **aesthetic mapping**, which tells \index{aesthetic mapping} `ggplot` how the columns in the data frame map to properties of the visualization.
+    - To create an aesthetic mapping, we use the `aes` function.
+    - Here, we set the plot `x` axis to the `date_measured` variable, and the plot `y` axis to the `ppm` variable.
+- The `+` operator, which tells `ggplot` that we would like to add another layer to the plot.\index{aaaplussymb@$+$|see{ggplot!add layer}}\index{ggplot!add layer}
+- The **geometric object**, which specifies \index{aesthetic mapping} how the mapped data should be displayed.
+    - To create a geometric object, we use a `geom_*` function (see the [ggplot reference](https://ggplot2.tidyverse.org/reference/) for a list of geometric objects).
+    - Here, we use the `geom_point` function to visualize our data as a scatter plot.
 
 Figure \@ref(fig:03-ggplot-function-scatter) 
 shows how each of these aspects map to code
@@ -352,12 +352,6 @@ change the font size, we use the `theme` function with the `text` argument:
 \index{ggplot!xlab,ylab}
 \index{ggplot!theme}
 
-> **Note:** The `theme` function is quite complex and has many arguments 
-> that can be specified to control many non-data aspects of a visualization.
-> An in-depth discussion of the `theme` function is beyond the scope of this book.
-> Interested readers may consult the `theme` function documentation;
-> see the additional resources section at the end of this chapter.
-
 ```{r 03-data-co2-line-2, warning=FALSE, message=FALSE, fig.height = 3.5, fig.width = 4.5, fig.align = "center",  fig.cap = "Line plot of atmospheric concentration of CO$_{2}$ over time with clearer axes and labels."}
 co2_line <- ggplot(co2_df, aes(x = date_measured, y = ppm)) +
   geom_line() +
@@ -368,6 +362,12 @@ co2_line <- ggplot(co2_df, aes(x = date_measured, y = ppm)) +
 co2_line
 ```
 
+> **Note:** The `theme` function is quite complex and has many arguments 
+> that can be specified to control many non-data aspects of a visualization.
+> An in-depth discussion of the `theme` function is beyond the scope of this book.
+> Interested readers may consult the `theme` function documentation;
+> see the additional resources section at the end of this chapter.
+
 Finally, let's see if we can better understand the oscillation by changing the
 visualization slightly. Note that it is totally fine to use a small number of
 visualizations to answer different aspects of the question you are trying to
@@ -467,7 +467,7 @@ faithful_scatter <- ggplot(faithful, aes(x = waiting, y = eruptions)) +
 faithful_scatter
 ```
 
-We can see in Figure \@ref(fig:03-data-faithful-scatter) the data tend to fall
+We can see in Figure \@ref(fig:03-data-faithful-scatter) that the data tend to fall
 into two groups: one with short waiting and eruption times, and one with long
 waiting and eruption times. Note that in this case, there is no overplotting:
 the points are generally nicely visually separated, and the pattern they form
@@ -1045,7 +1045,7 @@ minus 299,000; this ensures it is coded the same way as the
 measurements in the `morley` data frame.
 We would also like to fine tune this vertical line, 
 styling it so that it is dashed and 1 point in thickness.
-A point is a measurement unit commonly used with font, 
+A point is a measurement unit commonly used with fonts, 
 and 1 point is about 0.353 mm. 
 We do this by setting `linetype = "dashed"` and `size = 1`, respectively. 
 There is a similar function, `geom_hline`, 
@@ -1099,6 +1099,7 @@ with the data types in the `morley` data frame. In particular, the `Expt` column
 is currently an *integer* (you can see the label `<int>` underneath the `Expt` column in \index{integer} the printed
 data frame at the start of this section). But we want to treat it as a
 *category*, i.e., there should be one category per type of experiment.  
+
 To fix this issue we can convert the `Expt` variable into a *factor* by \index{factor}
 passing it to `as_factor` in the `fill` aesthetic mapping.
 Recall that factor is a data type in R that is often used to represent
@@ -1124,7 +1125,7 @@ morley_hist
  
 Unfortunately, the attempt to separate out the experiment number visually has
 created a bit of a mess. All of the colors in Figure
-\@ref(fig:03-data-morley-hist-3) are blending together, and although it is
+\@ref(fig:03-data-morley-hist-with-factor) are blending together, and although it is
 possible to derive *some* insight from this (e.g., experiments 1 and 3 had some
 of the most incorrect measurements), it isn't the clearest way to convey our
 message and answer the question. Let's try a different strategy of creating
@@ -1139,8 +1140,7 @@ If the plot is to be split horizontally, into rows,
 then the `rows` argument is used.
 If the plot is to be split vertically, into columns, 
 then the `columns` argument is used.
-Both the `rows` and `columns` argument take the column names to split the data 
-on when creating the subplots. 
+Both the `rows` and `columns` arguments take the column names on which to split the data when creating the subplots. 
 One key thing is that the column names must be surrounded by the `vars` function.
 This function allows the column names to be correctly evaluated 
 in the context of the data frame.
@@ -1161,7 +1161,7 @@ with respect to one another.
 The most variable measurements came from Experiment 1. 
 There the measurements ranged from about 650 - 1050 km / sec.
 The least variable measurements came from Experiment 2.
-There the measurements ranged from about 750 - 950 km / sec.
+There, the measurements ranged from about 750 - 950 km / sec.
 The most different experiments still obtained quite similar results!
 
 There are two finishing touches to make this visualization even clearer. First and foremost, we need to add informative axis labels
@@ -1320,14 +1320,14 @@ suggest directions for future work.
 Regardless of where it appears, a good way to discuss your visualization \index{visualization!explanation} is as
 a story: 
 
-1) Establish the setting and scope, and motivate why you did what you did. 
+1) Establish the setting and scope, and describe why you did what you did. 
 2) Pose the question that your visualization answers. Justify why the question is important to answer.
 3) Answer the question using your visualization. Make sure you describe *all* aspects of the visualization (including describing the axes). But you 
    can emphasize different aspects based on what is important to answer your question:
     - **trends (lines):** Does a line describe the trend well? If so, the trend is *linear*, and if not, the trend is *nonlinear*. Is the trend increasing, decreasing, or neither?
                         Is there a periodic oscillation (wiggle) in the trend? Is the trend noisy (does the line "jump around" a lot) or smooth?
-    - **distributions (scatters, histograms):** How spread out are the data? Where are they centered, roughly? Are there any obvious "clusters" or "subgroups", which would be visible as multiple bumps in the histogram?  
-    - **distributions of two variables (scatters):** is there a clear / strong relationship between the variables (points fall in a distinct pattern), a weak one (points fall in a pattern but there is some noise), or no discernible
+    - **distributions (scatters, histograms):** How spread out are the data? Where are they centered, roughly? Are there any obvious "clusters" or "subgroups", which would be visible as multiple bumps in the histogram?
+    - **distributions of two variables (scatters):** Is there a clear / strong relationship between the variables (points fall in a distinct pattern), a weak one (points fall in a pattern but there is some noise), or no discernible
       relationship (the data are too noisy to make any conclusion)?
     - **amounts (bars):** How large are the bars relative to one another? Are there patterns in different groups of bars? 
 4) Summarize your findings, and use them to motivate whatever you will discuss next.
@@ -1342,7 +1342,7 @@ greenhouse gases, typically primarily carbon dioxide (CO$_{\text{2}}$), as a
 byproduct. Too much of these gases in the Earth's atmosphere will cause it to
 trap more heat from the sun, leading to global warming. (2) In order to assess
 how quickly the atmospheric concentration of CO$_{\text{2}}$ is increasing over
-time, we (3) used a data set from the Mauna Loa observatory from Hawaii,
+time, we (3) used a data set from the Mauna Loa observatory in Hawaii,
 consisting of CO$_{\text{2}}$ measurements from 1980 to 2020. We plotted the
 measured concentration of CO$_{\text{2}}$ (on the vertical axis) over time (on
 the horizontal axis). From this plot, you can see a clear, increasing, and
@@ -1355,10 +1355,10 @@ perhaps worth investigating more into the causes.
 **Michelson Light Speed Experiments:** (1) \index{Michelson speed of light} Our
 modern understanding of the physics of light has advanced significantly from
 the late 1800s when Michelson and Morley's experiments first demonstrated that
-it had a finite speed. We now know based on modern experiments that it moves at
-roughly 299,792.458 kilometres per second. (2) But how accurately were we first
+it had a finite speed. We now know, based on modern experiments, that it moves at
+roughly 299,792.458 kilometers per second. (2) But how accurately were we first
 able to measure this fundamental physical constant, and did certain experiments
-produce more accurate results than others?  (3) To better understand this we
+produce more accurate results than others?  (3) To better understand this, we
 plotted data from 5 experiments by Michelson in 1879, each with 20 trials, as
 histograms stacked on top of one another.  The horizontal axis shows the
 accuracy of the measurements relative to the true speed of light as we know it
@@ -1384,7 +1384,7 @@ and *vector* \index{vector graphics} formats.
 **Raster** images are represented as a 2-D grid of square pixels, each
 with its own color. Raster images are often *compressed* before storing so they
 take up less space. A compressed format is *lossy* if the image cannot be
-perfectly recreated when loading and displaying, with the hope that the change
+perfectly re-created when loading and displaying, with the hope that the change
 is not noticeable. *Lossless* formats, on the other hand, allow a perfect
 display of the original image.
 \index{raster graphics!file types}
@@ -1415,7 +1415,7 @@ computer has to draw all the elements each time it is displayed. For example,
 if you have a scatter plot with 1 million points stored as an SVG file, it may
 take your computer some time to open the image. On the other hand, you can zoom
 into / scale up vector graphics as much as you like without the image looking
-bad, while raster images eventually start to look "pixellated." 
+bad, while raster images eventually start to look "pixelated." 
 
 > **Note:** The portable document format [PDF](https://en.wikipedia.org/wiki/PDF) (`.pdf`) is commonly used to
 > store *both* raster and vector formats. If you try to open a PDF and it's taking a long time
@@ -1447,7 +1447,7 @@ This can include the path to the directory where you would like to save the file
 and the name of the plot object to save as its second argument.
 The kind of image to save is specified by the file extension.
 For example, 
-to create a PNG image file we specify that the file extension is `.png`.
+to create a PNG image file, we specify that the file extension is `.png`.
 Below we demonstrate how to save PNG, JPG, BMP, TIFF and SVG file types 
 for the `faithful_plot`:
 
@@ -1495,8 +1495,9 @@ based on mathematical formulas, vector graphics can be scaled up to arbitrary
 sizes.  This makes them great for presentation media of all sizes, from papers
 to posters to billboards.
 
+(ref:03-raster-image) Zoomed in `faithful`, raster (PNG, left) and vector (SVG, right) formats.
 
-```{r 03-raster-image, echo=FALSE, fig.cap = "Zoomed in `faithful`, raster (PNG, left) and vector (SVG, right) formats.", fig.show="hold", fig.align= "center", message =F, out.width="100%"}
+```{r 03-raster-image, echo=FALSE, fig.cap = "(ref:03-raster-image)", fig.show="hold", fig.align= "center", message =F, out.width="100%"}
 knitr::include_graphics("img/png-vs-svg.png")
 ```
 
@@ -1513,11 +1514,11 @@ found in Chapter \@ref(move-to-your-own-machine).
 ## Additional resources
 - The [`ggplot2` page on the tidyverse website](https://ggplot2.tidyverse.org) is where you should look if you want to learn more about the functions in this chapter, the full set of arguments you can use, and other related functions. The site also provides a very nice cheat sheet that summarizes many of the data wrangling functions from this chapter.
 - The [Fundamentals of Data Visualization](https://serialmentor.com/dataviz/) has a wealth of information on designing effective visualizations. It is not specific to any particular programming language or library. If you want to improve your visualization skills, this is the next place to look.
-- [R for Data Science](https://r4ds.had.co.nz/) has a chapter on [creating visualizations using `ggplot2`](https://r4ds.had.co.nz/data-visualisation.html). This reference is specific to R and `ggplot2`, but provides a much more detailed introduction to the full set of tools that `ggplot2` provides. This chapter is where you should look if you want to learn how to make more intricate visualizations in `ggplot2` than what is included in this chapter.
+- [*R for Data Science*](https://r4ds.had.co.nz/) has a chapter on [creating visualizations using `ggplot2`](https://r4ds.had.co.nz/data-visualisation.html). This reference is specific to R and `ggplot2`, but provides a much more detailed introduction to the full set of tools that `ggplot2` provides. This chapter is where you should look if you want to learn how to make more intricate visualizations in `ggplot2` than what is included in this chapter.
 - The [`theme` function documentation](https://ggplot2.tidyverse.org/reference/theme.html)
 is an excellent reference to see how you can fine tune the non-data aspects 
 of your visualization.
-- [R for Data Science](https://r4ds.had.co.nz/) has a chapter on 
+- [*R for Data Science*](https://r4ds.had.co.nz/) has a chapter on 
 [dates and times](https://r4ds.had.co.nz/dates-and-times.html). 
 This chapter is where you should look if you want to learn about `date` vectors,
 including how to create them,