Merge pull request #386 from UBC-DSCI/reading

trevorcampbell · web-flow · commit ea1c31733c55 · 2021-12-02T20:49:08.000-08:00
reading copyedit pass
diff --git a/reading.Rmd b/reading.Rmd
@@ -24,35 +24,35 @@ second step to all data analyses). It’s like making sure your shoelaces are
 tied well before going for a run so that you don’t trip later on!
 
 ## Chapter learning objectives
-By the end of the chapter, readers will be able to:
+By the end of the chapter, readers will be able to do the following:
 
-- define the following:
+- Define the following:
     - absolute file path
     - relative file path
     - **U**niform **R**esource **L**ocator (URL)
-- read data into R using a relative path and a URL
-- compare and contrast the following functions:
+- Read data into R using a relative path and a URL.
+- Compare and contrast the following functions:
     - `read_csv` 
     - `read_tsv`
     - `read_csv2`
     - `read_delim`
     - `read_excel`
-- match the following `tidyverse` `read_*` function arguments to their descriptions:
+- Match the following `tidyverse` `read_*` function arguments to their descriptions:
     - `file` 
     - `delim`
     - `col_names`
     - `skip`
-- choose the appropriate `tidyverse` `read_*` function and function arguments to load a given plain text tabular data set into R
-- use `readxl` package's `read_excel` function and arguments to load a sheet from an excel file into R
-- connect to a database using the `DBI` package's `dbConnect` function
-- list the tables in a database using the `DBI` package's `dbListTables` function
-- create a reference to a database table that is queriable using the `tbl` from the `dbplyr` package
-- retrieve data from a database query and bring it into R using the `collect` function from the `dbplyr` package
-- use `write_csv` to save a data frame to a `.csv` file
-- (*optional*) obtain data using **a**pplication **p**rogramming **i**nterfaces (APIs) and web scraping
-    - read HTML source code from a URL using the `rvest` package
-    - read data from the Twitter API using the `rtweet` package
-    - compare downloading tabular data from a plain text file (e.g., `.csv`), accessing data from an API, and scraping the HTML source code from a website
+- Choose the appropriate `tidyverse` `read_*` function and function arguments to load a given plain text tabular data set into R.
+- Use `readxl` package's `read_excel` function and arguments to load a sheet from an excel file into R.
+- Connect to a database using the `DBI` package's `dbConnect` function.
+- List the tables in a database using the `DBI` package's `dbListTables` function.
+- Create a reference to a database table that is queriable using the `tbl` from the `dbplyr` package.
+- Retrieve data from a database query and bring it into R using the `collect` function from the `dbplyr` package.
+- Use `write_csv` to save a data frame to a `.csv` file.
+- (*Optional*) Obtain data using **a**pplication **p**rogramming **i**nterfaces (APIs) and web scraping.
+    - Read HTML source code from a URL using the `rvest` package.
+    - Read data from the Twitter API using the `rtweet` package.
+    - Compare downloading tabular data from a plain text file (e.g., `.csv`), accessing data from an API, and scraping the HTML source code from a website.
 
 ## Absolute and relative file paths
 
@@ -69,12 +69,12 @@ think of the path as directions to the file. There are two kinds of paths:
 *relative* paths and *absolute* paths. A relative path is where the file is
 with respect to where you currently are on the computer (e.g., where the file
 you're working in is). On the other hand, an absolute path is where the file is
-in respect to the computer's filesystem's base (or root) folder.
+in respect to the computer's filesystem base (or root) folder.
 
 Suppose our computer's filesystem looks like the picture in Figure
 \@ref(fig:file-system-for-export-to-intro-datascience), and we are working in a
 file titled `worksheet_02.ipynb`. If we want to 
-read in the `.csv` file named `happiness_report.csv` into R, we could do this
+read the `.csv` file named `happiness_report.csv` into R, we could do this
 using either a relative or an absolute path.  We show both choices
 below.\index{Happiness Report}
 
@@ -135,7 +135,7 @@ Now that we have learned about *where* data could be, we will learn about *how*
 to import data into R using various functions. Specifically, we will learn how 
 to *read* tabular data from a plain text file (a document containing only text)
 *into* R and *write* tabular data to a file *out of* R. The function we use to do this
-depends on the file's format. For example, the last chapter, we learned about using
+depends on the file's format. For example, in the last chapter, we learned about using
 the `tidyverse` `read_csv` function when reading .csv (**c**omma-**s**eparated **v**alues)
 files. \index{csv} In that case, the separator or *delimiter* \index{reading!delimiter} that divided our columns was a
 comma (`,`). We only learned the case where the data matched the expected defaults 
@@ -187,8 +187,8 @@ canlang_data <- read_csv("data/can_lang.csv")
 > **Note:** It is also normal and expected that \index{warning} a message is
 > printed out after using
 > the `read_csv` and related functions. This message lets you know the data types
-> of each of the columns that R inferred while reading the data into R.  In
-> future when we use this and related functions to load data in this book we will
+> of each of the columns that R inferred while reading the data into R.  In the
+> future when we use this and related functions to load data in this book, we will
 > silence these messages to help with the readability of the book.
 
 ```{r view-data}
@@ -197,7 +197,7 @@ canlang_data
 
 ### Skipping rows when reading in data
 
-Often times information about how data was collected, or other relevant
+Oftentimes, information about how data was collected, or other relevant
 information, is included at the top of the data file. This information is
 usually written in sentence and paragraph form, with no delimiter because it is
 not organized into columns. An example of this is shown below. This information
@@ -289,7 +289,7 @@ canlang_data <- read_tsv("data/can_lang_tab.tsv")
 canlang_data
 ```
 
-Let's compare the data frame here to the resulting data frame in section
+Let's compare the data frame here to the resulting data frame in Section
 \@ref(readcsv) after using `read_csv`. Notice anything? They look the same! The
 same number of columns/rows and column names! So we needed to use different
 tools for the job depending on the file format and our resulting table
@@ -395,7 +395,7 @@ There are many other ways to store tabular data sets beyond plain text files,
 and similarly, many ways to load those data sets into R. For example, it is
 very common to encounter, and need to load into R, data stored as a Microsoft
 Excel \index{Excel spreadsheet}\index{Microsoft Excel|see{Excel
-spreadsheet}}\index{xlsx|see{Excel spreadsheet}} spreadsheet (with the filename
+spreadsheet}}\index{xlsx|see{Excel spreadsheet}} spreadsheet (with the file name
 extension `.xlsx`).  To be able to do this, a key thing to know is that even
 though `.csv` and `.xlsx` files look almost identical when loaded into Excel,
 the data themselves are stored completely differently.  While `.csv` files are
@@ -753,10 +753,10 @@ databases at all?
 
 Databases are beneficial in a large-scale setting:
 
-- they enable storing large data sets across multiple computers with backups
-- they provide mechanisms for ensuring data integrity and validating input
-- they provide security and data access control
-- they allow multiple users to access data simultaneously and remotely without conflicts and errors.
+- They enable storing large data sets across multiple computers with backups.
+- They provide mechanisms for ensuring data integrity and validating input.
+- They provide security and data access control.
+- They allow multiple users to access data simultaneously and remotely without conflicts and errors.
   For example, [there are billions of Google searches conducted daily](https://www.internetlivestats.com/google-search-statistics/). 
   Can you imagine if Google stored all of the data from those searches in a single `.csv
   file`!? Chaos would ensue! 
@@ -786,7 +786,7 @@ write_csv(no_official_lang_data, "data/no_official_languages.csv")
 
 Data doesn't just magically appear on your computer; you need to get it from
 somewhere. Earlier in the chapter we showed you how to access data stored in a
-plaintext, spreadsheet-like format (e.g., comma- or tab-separated) from a web
+plain text, spreadsheet-like format (e.g., comma- or tab-separated) from a web
 URL using one of the `read_*` functions from the `tidyverse`. But as time goes
 on, it is increasingly uncommon to find data (especially large amounts of data)
 in this format available for download from a URL. Instead, websites now often
@@ -816,7 +816,7 @@ see, you can collect that data programmatically---in the form of
 and **c**ascading **s**tyle **s**heet (CSS) code---and process it 
 to extract useful information. HTML provides the
 basic structure of a site and tells the webpage how to display the content
-(e.g., titles, paragraphs, bullet lists etc.). Whereas CSS helps style the
+(e.g., titles, paragraphs, bullet lists etc.), whereas CSS helps style the
 content and tells the webpage how the HTML elements should 
 be presented (e.g., colors, layouts, fonts etc.). 
 
@@ -931,7 +931,7 @@ websites are quite a bit larger and more complex, and so is their website
 source code. Fortunately, there are tools available to make this process
 easier. For example, 
 [SelectorGadget from the Chrome Web Store](https://chrome.google.com/webstore/detail/SelectorGadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb) is 
-an open-source tool that simplifies identifying the generating and finding CSS selectors. 
+an open-source tool that simplifies identifying the generating and finding of CSS selectors. 
 At the end of the chapter in the additional resources section, we include a link to
 a short video on how to install and use the SelectorGadget tool to 
 obtain CSS selectors for use in web scraping. 
@@ -965,7 +965,7 @@ knitr::include_graphics("img/sg2.png")
 So to scrape information about the square footage and rental price
 of apartment listings, we need to use
 the two CSS selectors `.housing` and `.result-price`, respectively.
-The selector gadget returns them to us as a comma separated list (here
+The selector gadget returns them to us as a comma-separated list (here
 `.housing , .result-price`), which is exactly the format we need to provide to
 R if we are using more than one CSS selector.
 
@@ -1031,7 +1031,7 @@ instead of  displaying the website to you, the `read_html` function just returns
 the HTML source code itself, which we have
 stored in the `page` variable. Next, we send the page object to the `html_nodes`
 function, along with the CSS selectors we obtained from
-the SelectorGadget tool. Make sure to surround the selectors with quotations; the function, `html_nodes`, expects that
+the SelectorGadget tool. Make sure to surround the selectors with quotation marks; the function, `html_nodes`, expects that
 argument is a string. The `html_nodes` function then selects *nodes* from the HTML document that 
 match the CSS selectors you specified.  A *node* is an HTML tag pair (e.g.,
 `<td>` and `</td>` which defines the cell of a table) combined with the content
@@ -1047,7 +1047,7 @@ node that would be selected would be:
 We store the result of the `html_nodes` function in the `population_nodes` variable.
 Note that below we use the `paste` function with a comma separator (`sep=","`)
 to build the list of selectors. The `paste` function converts 
-elements to character and combines the values into a list. We use this function to 
+elements to characters and combines the values into a list. We use this function to 
 build the list of selectors to maintain code readability; this avoids
 having one very long line of code with the string
 `"td:nth-child(5),td:nth-child(7),.infobox:nth-child(122) td:nth-child(1),.infobox td:nth-child(3)"`
@@ -1081,7 +1081,7 @@ data frame with one character column for city and one numeric column for
 population (like a spreadsheet).
 Additionally, the populations contain commas (not useful for programmatically
 dealing with numbers), and some even contain a line break character at the end
-(`\n`). In chapter \@ref(wrangling), we will learn more about how to *wrangle* data
+(`\n`). In Chapter \@ref(wrangling), we will learn more about how to *wrangle* data
 such as this into a more useful format for data analysis using R.
 
 ### Using an API
@@ -1090,7 +1090,7 @@ Rather than posting a data file at a URL for you to download, many websites thes
 provide an API \index{API} that must be accessed through a programming language like R. The benefit of this
 is that data owners have much more control over the data they provide to users. However, unlike
 web scraping, there is no consistent way to access an API across websites. Every website typically
-has its own API designed specially for its own use-case. Therefore we will just provide one example
+has its own API designed especially for its own use-case. Therefore we will just provide one example
 of accessing data through an API in this book, with the hope that it gives you enough of a basic
 idea that you can learn how to use another API if needed.
 
@@ -1192,7 +1192,7 @@ tidyverse_tweets <- select(tidyverse_tweets,
 tidyverse_tweets
 ```
 
-If you look back up at the image of the Tidyverse twitter page, you will
+If you look back up at the image of the Tidyverse Twitter page, you will
 recognize the text of the most recent few tweets in the above data frame.  In
 other words, we have successfully created a small data set using the Twitter
 API---neat! This data is also quite different from what we obtained from web scraping;
@@ -1228,6 +1228,6 @@ found in Chapter \@ref(move-to-your-own-machine).
 - The [`rio` package](https://github.com/leeper/rio) provides an alternative set of tools for reading and writing data in R. It aims to be a "Swiss army knife" for data reading/writing/converting, and supports a wide variety of data types (including data formats generated by other statistical software like SPSS and SAS).
 - This [video](https://www.youtube.com/embed/ephId3mYu9o) from the [Udacity course "Linux Command Line Basics"](https://www.udacity.com/course/linux-command-line-basics--ud595) provides a good explanation of absolute versus relative paths.
 - If you read the subsection on obtaining data from the web via scraping and APIs, we provide two companion tutorial video links:
-    - [A brief video tutorial](https://www.youtube.com/embed/YdIWI6K64zo) on using the SelectorGadget tool to obtain desired CSS selectors for extracting the price and size data for apartment listings on CraigsList
+    - [A brief video tutorial](https://www.youtube.com/embed/YdIWI6K64zo) on using the SelectorGadget tool to obtain desired CSS selectors for extracting the price and size data for apartment listings on Craigslist
     - [Another brief video tutorial](https://www.youtube.com/embed/O9HKbdhqYzk) on using the SelectorGadget tool to obtain desired CSS selectors for extracting Canadian city names and 2016 census populations from Wikipedia
 - The [`polite` package](https://cran.r-project.org/web/packages/polite/index.html) provides a set of tools for responsibly scraping data from websites.