Skip to content

Commit ea1c317

Browse files
Merge pull request #386 from UBC-DSCI/reading
reading copyedit pass
2 parents 29f548d + 8f80e59 commit ea1c317

File tree

1 file changed

+38
-38
lines changed

1 file changed

+38
-38
lines changed

reading.Rmd

Lines changed: 38 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -24,35 +24,35 @@ second step to all data analyses). It’s like making sure your shoelaces are
2424
tied well before going for a run so that you don’t trip later on!
2525

2626
## Chapter learning objectives
27-
By the end of the chapter, readers will be able to:
27+
By the end of the chapter, readers will be able to do the following:
2828

29-
- define the following:
29+
- Define the following:
3030
- absolute file path
3131
- relative file path
3232
- **U**niform **R**esource **L**ocator (URL)
33-
- read data into R using a relative path and a URL
34-
- compare and contrast the following functions:
33+
- Read data into R using a relative path and a URL.
34+
- Compare and contrast the following functions:
3535
- `read_csv`
3636
- `read_tsv`
3737
- `read_csv2`
3838
- `read_delim`
3939
- `read_excel`
40-
- match the following `tidyverse` `read_*` function arguments to their descriptions:
40+
- Match the following `tidyverse` `read_*` function arguments to their descriptions:
4141
- `file`
4242
- `delim`
4343
- `col_names`
4444
- `skip`
45-
- choose the appropriate `tidyverse` `read_*` function and function arguments to load a given plain text tabular data set into R
46-
- use `readxl` package's `read_excel` function and arguments to load a sheet from an excel file into R
47-
- connect to a database using the `DBI` package's `dbConnect` function
48-
- list the tables in a database using the `DBI` package's `dbListTables` function
49-
- create a reference to a database table that is queriable using the `tbl` from the `dbplyr` package
50-
- retrieve data from a database query and bring it into R using the `collect` function from the `dbplyr` package
51-
- use `write_csv` to save a data frame to a `.csv` file
52-
- (*optional*) obtain data using **a**pplication **p**rogramming **i**nterfaces (APIs) and web scraping
53-
- read HTML source code from a URL using the `rvest` package
54-
- read data from the Twitter API using the `rtweet` package
55-
- compare downloading tabular data from a plain text file (e.g., `.csv`), accessing data from an API, and scraping the HTML source code from a website
45+
- Choose the appropriate `tidyverse` `read_*` function and function arguments to load a given plain text tabular data set into R.
46+
- Use `readxl` package's `read_excel` function and arguments to load a sheet from an excel file into R.
47+
- Connect to a database using the `DBI` package's `dbConnect` function.
48+
- List the tables in a database using the `DBI` package's `dbListTables` function.
49+
- Create a reference to a database table that is queriable using the `tbl` from the `dbplyr` package.
50+
- Retrieve data from a database query and bring it into R using the `collect` function from the `dbplyr` package.
51+
- Use `write_csv` to save a data frame to a `.csv` file.
52+
- (*Optional*) Obtain data using **a**pplication **p**rogramming **i**nterfaces (APIs) and web scraping.
53+
- Read HTML source code from a URL using the `rvest` package.
54+
- Read data from the Twitter API using the `rtweet` package.
55+
- Compare downloading tabular data from a plain text file (e.g., `.csv`), accessing data from an API, and scraping the HTML source code from a website.
5656

5757
## Absolute and relative file paths
5858

@@ -69,12 +69,12 @@ think of the path as directions to the file. There are two kinds of paths:
6969
*relative* paths and *absolute* paths. A relative path is where the file is
7070
with respect to where you currently are on the computer (e.g., where the file
7171
you're working in is). On the other hand, an absolute path is where the file is
72-
in respect to the computer's filesystem's base (or root) folder.
72+
in respect to the computer's filesystem base (or root) folder.
7373

7474
Suppose our computer's filesystem looks like the picture in Figure
7575
\@ref(fig:file-system-for-export-to-intro-datascience), and we are working in a
7676
file titled `worksheet_02.ipynb`. If we want to
77-
read in the `.csv` file named `happiness_report.csv` into R, we could do this
77+
read the `.csv` file named `happiness_report.csv` into R, we could do this
7878
using either a relative or an absolute path. We show both choices
7979
below.\index{Happiness Report}
8080

@@ -135,7 +135,7 @@ Now that we have learned about *where* data could be, we will learn about *how*
135135
to import data into R using various functions. Specifically, we will learn how
136136
to *read* tabular data from a plain text file (a document containing only text)
137137
*into* R and *write* tabular data to a file *out of* R. The function we use to do this
138-
depends on the file's format. For example, the last chapter, we learned about using
138+
depends on the file's format. For example, in the last chapter, we learned about using
139139
the `tidyverse` `read_csv` function when reading .csv (**c**omma-**s**eparated **v**alues)
140140
files. \index{csv} In that case, the separator or *delimiter* \index{reading!delimiter} that divided our columns was a
141141
comma (`,`). We only learned the case where the data matched the expected defaults
@@ -187,8 +187,8 @@ canlang_data <- read_csv("data/can_lang.csv")
187187
> **Note:** It is also normal and expected that \index{warning} a message is
188188
> printed out after using
189189
> the `read_csv` and related functions. This message lets you know the data types
190-
> of each of the columns that R inferred while reading the data into R. In
191-
> future when we use this and related functions to load data in this book we will
190+
> of each of the columns that R inferred while reading the data into R. In the
191+
> future when we use this and related functions to load data in this book, we will
192192
> silence these messages to help with the readability of the book.
193193
194194
```{r view-data}
@@ -197,7 +197,7 @@ canlang_data
197197

198198
### Skipping rows when reading in data
199199

200-
Often times information about how data was collected, or other relevant
200+
Oftentimes, information about how data was collected, or other relevant
201201
information, is included at the top of the data file. This information is
202202
usually written in sentence and paragraph form, with no delimiter because it is
203203
not organized into columns. An example of this is shown below. This information
@@ -289,7 +289,7 @@ canlang_data <- read_tsv("data/can_lang_tab.tsv")
289289
canlang_data
290290
```
291291

292-
Let's compare the data frame here to the resulting data frame in section
292+
Let's compare the data frame here to the resulting data frame in Section
293293
\@ref(readcsv) after using `read_csv`. Notice anything? They look the same! The
294294
same number of columns/rows and column names! So we needed to use different
295295
tools for the job depending on the file format and our resulting table
@@ -395,7 +395,7 @@ There are many other ways to store tabular data sets beyond plain text files,
395395
and similarly, many ways to load those data sets into R. For example, it is
396396
very common to encounter, and need to load into R, data stored as a Microsoft
397397
Excel \index{Excel spreadsheet}\index{Microsoft Excel|see{Excel
398-
spreadsheet}}\index{xlsx|see{Excel spreadsheet}} spreadsheet (with the filename
398+
spreadsheet}}\index{xlsx|see{Excel spreadsheet}} spreadsheet (with the file name
399399
extension `.xlsx`). To be able to do this, a key thing to know is that even
400400
though `.csv` and `.xlsx` files look almost identical when loaded into Excel,
401401
the data themselves are stored completely differently. While `.csv` files are
@@ -753,10 +753,10 @@ databases at all?
753753

754754
Databases are beneficial in a large-scale setting:
755755

756-
- they enable storing large data sets across multiple computers with backups
757-
- they provide mechanisms for ensuring data integrity and validating input
758-
- they provide security and data access control
759-
- they allow multiple users to access data simultaneously and remotely without conflicts and errors.
756+
- They enable storing large data sets across multiple computers with backups.
757+
- They provide mechanisms for ensuring data integrity and validating input.
758+
- They provide security and data access control.
759+
- They allow multiple users to access data simultaneously and remotely without conflicts and errors.
760760
For example, [there are billions of Google searches conducted daily](https://www.internetlivestats.com/google-search-statistics/).
761761
Can you imagine if Google stored all of the data from those searches in a single `.csv
762762
file`!? Chaos would ensue!
@@ -786,7 +786,7 @@ write_csv(no_official_lang_data, "data/no_official_languages.csv")
786786
787787
Data doesn't just magically appear on your computer; you need to get it from
788788
somewhere. Earlier in the chapter we showed you how to access data stored in a
789-
plaintext, spreadsheet-like format (e.g., comma- or tab-separated) from a web
789+
plain text, spreadsheet-like format (e.g., comma- or tab-separated) from a web
790790
URL using one of the `read_*` functions from the `tidyverse`. But as time goes
791791
on, it is increasingly uncommon to find data (especially large amounts of data)
792792
in this format available for download from a URL. Instead, websites now often
@@ -816,7 +816,7 @@ see, you can collect that data programmatically---in the form of
816816
and **c**ascading **s**tyle **s**heet (CSS) code---and process it
817817
to extract useful information. HTML provides the
818818
basic structure of a site and tells the webpage how to display the content
819-
(e.g., titles, paragraphs, bullet lists etc.). Whereas CSS helps style the
819+
(e.g., titles, paragraphs, bullet lists etc.), whereas CSS helps style the
820820
content and tells the webpage how the HTML elements should
821821
be presented (e.g., colors, layouts, fonts etc.).
822822

@@ -931,7 +931,7 @@ websites are quite a bit larger and more complex, and so is their website
931931
source code. Fortunately, there are tools available to make this process
932932
easier. For example,
933933
[SelectorGadget from the Chrome Web Store](https://chrome.google.com/webstore/detail/SelectorGadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb) is
934-
an open-source tool that simplifies identifying the generating and finding CSS selectors.
934+
an open-source tool that simplifies identifying the generating and finding of CSS selectors.
935935
At the end of the chapter in the additional resources section, we include a link to
936936
a short video on how to install and use the SelectorGadget tool to
937937
obtain CSS selectors for use in web scraping.
@@ -965,7 +965,7 @@ knitr::include_graphics("img/sg2.png")
965965
So to scrape information about the square footage and rental price
966966
of apartment listings, we need to use
967967
the two CSS selectors `.housing` and `.result-price`, respectively.
968-
The selector gadget returns them to us as a comma separated list (here
968+
The selector gadget returns them to us as a comma-separated list (here
969969
`.housing , .result-price`), which is exactly the format we need to provide to
970970
R if we are using more than one CSS selector.
971971
@@ -1031,7 +1031,7 @@ instead of displaying the website to you, the `read_html` function just returns
10311031
the HTML source code itself, which we have
10321032
stored in the `page` variable. Next, we send the page object to the `html_nodes`
10331033
function, along with the CSS selectors we obtained from
1034-
the SelectorGadget tool. Make sure to surround the selectors with quotations; the function, `html_nodes`, expects that
1034+
the SelectorGadget tool. Make sure to surround the selectors with quotation marks; the function, `html_nodes`, expects that
10351035
argument is a string. The `html_nodes` function then selects *nodes* from the HTML document that
10361036
match the CSS selectors you specified. A *node* is an HTML tag pair (e.g.,
10371037
`<td>` and `</td>` which defines the cell of a table) combined with the content
@@ -1047,7 +1047,7 @@ node that would be selected would be:
10471047
We store the result of the `html_nodes` function in the `population_nodes` variable.
10481048
Note that below we use the `paste` function with a comma separator (`sep=","`)
10491049
to build the list of selectors. The `paste` function converts
1050-
elements to character and combines the values into a list. We use this function to
1050+
elements to characters and combines the values into a list. We use this function to
10511051
build the list of selectors to maintain code readability; this avoids
10521052
having one very long line of code with the string
10531053
`"td:nth-child(5),td:nth-child(7),.infobox:nth-child(122) td:nth-child(1),.infobox td:nth-child(3)"`
@@ -1081,7 +1081,7 @@ data frame with one character column for city and one numeric column for
10811081
population (like a spreadsheet).
10821082
Additionally, the populations contain commas (not useful for programmatically
10831083
dealing with numbers), and some even contain a line break character at the end
1084-
(`\n`). In chapter \@ref(wrangling), we will learn more about how to *wrangle* data
1084+
(`\n`). In Chapter \@ref(wrangling), we will learn more about how to *wrangle* data
10851085
such as this into a more useful format for data analysis using R.
10861086
10871087
### Using an API
@@ -1090,7 +1090,7 @@ Rather than posting a data file at a URL for you to download, many websites thes
10901090
provide an API \index{API} that must be accessed through a programming language like R. The benefit of this
10911091
is that data owners have much more control over the data they provide to users. However, unlike
10921092
web scraping, there is no consistent way to access an API across websites. Every website typically
1093-
has its own API designed specially for its own use-case. Therefore we will just provide one example
1093+
has its own API designed especially for its own use-case. Therefore we will just provide one example
10941094
of accessing data through an API in this book, with the hope that it gives you enough of a basic
10951095
idea that you can learn how to use another API if needed.
10961096
@@ -1192,7 +1192,7 @@ tidyverse_tweets <- select(tidyverse_tweets,
11921192
tidyverse_tweets
11931193
```
11941194
1195-
If you look back up at the image of the Tidyverse twitter page, you will
1195+
If you look back up at the image of the Tidyverse Twitter page, you will
11961196
recognize the text of the most recent few tweets in the above data frame. In
11971197
other words, we have successfully created a small data set using the Twitter
11981198
API---neat! This data is also quite different from what we obtained from web scraping;
@@ -1228,6 +1228,6 @@ found in Chapter \@ref(move-to-your-own-machine).
12281228
- The [`rio` package](https://github.com/leeper/rio) provides an alternative set of tools for reading and writing data in R. It aims to be a "Swiss army knife" for data reading/writing/converting, and supports a wide variety of data types (including data formats generated by other statistical software like SPSS and SAS).
12291229
- This [video](https://www.youtube.com/embed/ephId3mYu9o) from the [Udacity course "Linux Command Line Basics"](https://www.udacity.com/course/linux-command-line-basics--ud595) provides a good explanation of absolute versus relative paths.
12301230
- If you read the subsection on obtaining data from the web via scraping and APIs, we provide two companion tutorial video links:
1231-
- [A brief video tutorial](https://www.youtube.com/embed/YdIWI6K64zo) on using the SelectorGadget tool to obtain desired CSS selectors for extracting the price and size data for apartment listings on CraigsList
1231+
- [A brief video tutorial](https://www.youtube.com/embed/YdIWI6K64zo) on using the SelectorGadget tool to obtain desired CSS selectors for extracting the price and size data for apartment listings on Craigslist
12321232
- [Another brief video tutorial](https://www.youtube.com/embed/O9HKbdhqYzk) on using the SelectorGadget tool to obtain desired CSS selectors for extracting Canadian city names and 2016 census populations from Wikipedia
12331233
- The [`polite` package](https://cran.r-project.org/web/packages/polite/index.html) provides a set of tools for responsibly scraping data from websites.

0 commit comments

Comments
 (0)