@@ -24,35 +24,35 @@ second step to all data analyses). It’s like making sure your shoelaces are
24
24
tied well before going for a run so that you don’t trip later on!
25
25
26
26
## Chapter learning objectives
27
- By the end of the chapter, readers will be able to:
27
+ By the end of the chapter, readers will be able to do the following :
28
28
29
- - define the following:
29
+ - Define the following:
30
30
- absolute file path
31
31
- relative file path
32
32
- ** U** niform ** R** esource ** L** ocator (URL)
33
- - read data into R using a relative path and a URL
34
- - compare and contrast the following functions:
33
+ - Read data into R using a relative path and a URL.
34
+ - Compare and contrast the following functions:
35
35
- ` read_csv `
36
36
- ` read_tsv `
37
37
- ` read_csv2 `
38
38
- ` read_delim `
39
39
- ` read_excel `
40
- - match the following ` tidyverse ` ` read_* ` function arguments to their descriptions:
40
+ - Match the following ` tidyverse ` ` read_* ` function arguments to their descriptions:
41
41
- ` file `
42
42
- ` delim `
43
43
- ` col_names `
44
44
- ` skip `
45
- - choose the appropriate ` tidyverse ` ` read_* ` function and function arguments to load a given plain text tabular data set into R
46
- - use ` readxl ` package's ` read_excel ` function and arguments to load a sheet from an excel file into R
47
- - connect to a database using the ` DBI ` package's ` dbConnect ` function
48
- - list the tables in a database using the ` DBI ` package's ` dbListTables ` function
49
- - create a reference to a database table that is queriable using the ` tbl ` from the ` dbplyr ` package
50
- - retrieve data from a database query and bring it into R using the ` collect ` function from the ` dbplyr ` package
51
- - use ` write_csv ` to save a data frame to a ` .csv ` file
52
- - (* optional * ) obtain data using ** a** pplication ** p** rogramming ** i** nterfaces (APIs) and web scraping
53
- - read HTML source code from a URL using the ` rvest ` package
54
- - read data from the Twitter API using the ` rtweet ` package
55
- - compare downloading tabular data from a plain text file (e.g., ` .csv ` ), accessing data from an API, and scraping the HTML source code from a website
45
+ - Choose the appropriate ` tidyverse ` ` read_* ` function and function arguments to load a given plain text tabular data set into R.
46
+ - Use ` readxl ` package's ` read_excel ` function and arguments to load a sheet from an excel file into R.
47
+ - Connect to a database using the ` DBI ` package's ` dbConnect ` function.
48
+ - List the tables in a database using the ` DBI ` package's ` dbListTables ` function.
49
+ - Create a reference to a database table that is queriable using the ` tbl ` from the ` dbplyr ` package.
50
+ - Retrieve data from a database query and bring it into R using the ` collect ` function from the ` dbplyr ` package.
51
+ - Use ` write_csv ` to save a data frame to a ` .csv ` file.
52
+ - (* Optional * ) Obtain data using ** a** pplication ** p** rogramming ** i** nterfaces (APIs) and web scraping.
53
+ - Read HTML source code from a URL using the ` rvest ` package.
54
+ - Read data from the Twitter API using the ` rtweet ` package.
55
+ - Compare downloading tabular data from a plain text file (e.g., ` .csv ` ), accessing data from an API, and scraping the HTML source code from a website.
56
56
57
57
## Absolute and relative file paths
58
58
@@ -69,12 +69,12 @@ think of the path as directions to the file. There are two kinds of paths:
69
69
* relative* paths and * absolute* paths. A relative path is where the file is
70
70
with respect to where you currently are on the computer (e.g., where the file
71
71
you're working in is). On the other hand, an absolute path is where the file is
72
- in respect to the computer's filesystem's base (or root) folder.
72
+ in respect to the computer's filesystem base (or root) folder.
73
73
74
74
Suppose our computer's filesystem looks like the picture in Figure
75
75
\@ ref(fig: file-system-for-export-to-intro-datascience ), and we are working in a
76
76
file titled ` worksheet_02.ipynb ` . If we want to
77
- read in the ` .csv ` file named ` happiness_report.csv ` into R, we could do this
77
+ read the ` .csv ` file named ` happiness_report.csv ` into R, we could do this
78
78
using either a relative or an absolute path. We show both choices
79
79
below.\index{Happiness Report}
80
80
@@ -135,7 +135,7 @@ Now that we have learned about *where* data could be, we will learn about *how*
135
135
to import data into R using various functions. Specifically, we will learn how
136
136
to * read* tabular data from a plain text file (a document containing only text)
137
137
* into* R and * write* tabular data to a file * out of* R. The function we use to do this
138
- depends on the file's format. For example, the last chapter, we learned about using
138
+ depends on the file's format. For example, in the last chapter, we learned about using
139
139
the ` tidyverse ` ` read_csv ` function when reading .csv (** c** omma-** s** eparated ** v** alues)
140
140
files. \index{csv} In that case, the separator or * delimiter* \index{reading!delimiter} that divided our columns was a
141
141
comma (` , ` ). We only learned the case where the data matched the expected defaults
@@ -187,8 +187,8 @@ canlang_data <- read_csv("data/can_lang.csv")
187
187
> ** Note:** It is also normal and expected that \index{warning} a message is
188
188
> printed out after using
189
189
> the ` read_csv ` and related functions. This message lets you know the data types
190
- > of each of the columns that R inferred while reading the data into R. In
191
- > future when we use this and related functions to load data in this book we will
190
+ > of each of the columns that R inferred while reading the data into R. In the
191
+ > future when we use this and related functions to load data in this book, we will
192
192
> silence these messages to help with the readability of the book.
193
193
194
194
``` {r view-data}
@@ -197,7 +197,7 @@ canlang_data
197
197
198
198
### Skipping rows when reading in data
199
199
200
- Often times information about how data was collected, or other relevant
200
+ Oftentimes, information about how data was collected, or other relevant
201
201
information, is included at the top of the data file. This information is
202
202
usually written in sentence and paragraph form, with no delimiter because it is
203
203
not organized into columns. An example of this is shown below. This information
@@ -289,7 +289,7 @@ canlang_data <- read_tsv("data/can_lang_tab.tsv")
289
289
canlang_data
290
290
```
291
291
292
- Let's compare the data frame here to the resulting data frame in section
292
+ Let's compare the data frame here to the resulting data frame in Section
293
293
\@ ref(readcsv) after using ` read_csv ` . Notice anything? They look the same! The
294
294
same number of columns/rows and column names! So we needed to use different
295
295
tools for the job depending on the file format and our resulting table
@@ -395,7 +395,7 @@ There are many other ways to store tabular data sets beyond plain text files,
395
395
and similarly, many ways to load those data sets into R. For example, it is
396
396
very common to encounter, and need to load into R, data stored as a Microsoft
397
397
Excel \index{Excel spreadsheet}\index{Microsoft Excel|see{Excel
398
- spreadsheet}}\index{xlsx|see{Excel spreadsheet}} spreadsheet (with the filename
398
+ spreadsheet}}\index{xlsx|see{Excel spreadsheet}} spreadsheet (with the file name
399
399
extension ` .xlsx ` ). To be able to do this, a key thing to know is that even
400
400
though ` .csv ` and ` .xlsx ` files look almost identical when loaded into Excel,
401
401
the data themselves are stored completely differently. While ` .csv ` files are
@@ -753,10 +753,10 @@ databases at all?
753
753
754
754
Databases are beneficial in a large-scale setting:
755
755
756
- - they enable storing large data sets across multiple computers with backups
757
- - they provide mechanisms for ensuring data integrity and validating input
758
- - they provide security and data access control
759
- - they allow multiple users to access data simultaneously and remotely without conflicts and errors.
756
+ - They enable storing large data sets across multiple computers with backups.
757
+ - They provide mechanisms for ensuring data integrity and validating input.
758
+ - They provide security and data access control.
759
+ - They allow multiple users to access data simultaneously and remotely without conflicts and errors.
760
760
For example, [ there are billions of Google searches conducted daily] ( https://www.internetlivestats.com/google-search-statistics/ ) .
761
761
Can you imagine if Google stored all of the data from those searches in a single `.csv
762
762
file`!? Chaos would ensue!
@@ -786,7 +786,7 @@ write_csv(no_official_lang_data, "data/no_official_languages.csv")
786
786
787
787
Data doesn't just magically appear on your computer; you need to get it from
788
788
somewhere. Earlier in the chapter we showed you how to access data stored in a
789
- plaintext , spreadsheet-like format (e.g., comma- or tab-separated) from a web
789
+ plain text , spreadsheet-like format (e.g., comma- or tab-separated) from a web
790
790
URL using one of the ` read_* ` functions from the ` tidyverse ` . But as time goes
791
791
on, it is increasingly uncommon to find data (especially large amounts of data)
792
792
in this format available for download from a URL. Instead, websites now often
@@ -816,7 +816,7 @@ see, you can collect that data programmatically---in the form of
816
816
and ** c** ascading ** s** tyle ** s** heet (CSS) code---and process it
817
817
to extract useful information. HTML provides the
818
818
basic structure of a site and tells the webpage how to display the content
819
- (e.g., titles, paragraphs, bullet lists etc.). Whereas CSS helps style the
819
+ (e.g., titles, paragraphs, bullet lists etc.), whereas CSS helps style the
820
820
content and tells the webpage how the HTML elements should
821
821
be presented (e.g., colors, layouts, fonts etc.).
822
822
@@ -931,7 +931,7 @@ websites are quite a bit larger and more complex, and so is their website
931
931
source code. Fortunately, there are tools available to make this process
932
932
easier. For example,
933
933
[SelectorGadget from the Chrome Web Store](https://chrome.google.com/webstore/detail/SelectorGadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb) is
934
- an open-source tool that simplifies identifying the generating and finding CSS selectors.
934
+ an open-source tool that simplifies identifying the generating and finding of CSS selectors.
935
935
At the end of the chapter in the additional resources section, we include a link to
936
936
a short video on how to install and use the SelectorGadget tool to
937
937
obtain CSS selectors for use in web scraping.
@@ -965,7 +965,7 @@ knitr::include_graphics("img/sg2.png")
965
965
So to scrape information about the square footage and rental price
966
966
of apartment listings, we need to use
967
967
the two CSS selectors `.housing` and `.result-price`, respectively.
968
- The selector gadget returns them to us as a comma separated list (here
968
+ The selector gadget returns them to us as a comma- separated list (here
969
969
`.housing , .result-price`), which is exactly the format we need to provide to
970
970
R if we are using more than one CSS selector.
971
971
@@ -1031,7 +1031,7 @@ instead of displaying the website to you, the `read_html` function just returns
1031
1031
the HTML source code itself, which we have
1032
1032
stored in the `page` variable. Next, we send the page object to the `html_nodes`
1033
1033
function, along with the CSS selectors we obtained from
1034
- the SelectorGadget tool. Make sure to surround the selectors with quotations ; the function, `html_nodes`, expects that
1034
+ the SelectorGadget tool. Make sure to surround the selectors with quotation marks ; the function, `html_nodes`, expects that
1035
1035
argument is a string. The `html_nodes` function then selects *nodes* from the HTML document that
1036
1036
match the CSS selectors you specified. A *node* is an HTML tag pair (e.g.,
1037
1037
`<td>` and `</td>` which defines the cell of a table) combined with the content
@@ -1047,7 +1047,7 @@ node that would be selected would be:
1047
1047
We store the result of the `html_nodes` function in the `population_nodes` variable.
1048
1048
Note that below we use the `paste` function with a comma separator (`sep=" , " `)
1049
1049
to build the list of selectors. The `paste` function converts
1050
- elements to character and combines the values into a list. We use this function to
1050
+ elements to characters and combines the values into a list. We use this function to
1051
1051
build the list of selectors to maintain code readability; this avoids
1052
1052
having one very long line of code with the string
1053
1053
`" td:nth-child(5),td:nth-child(7),.infobox:nth-child(122) td:nth-child(1),.infobox td:nth-child(3) " `
@@ -1081,7 +1081,7 @@ data frame with one character column for city and one numeric column for
1081
1081
population (like a spreadsheet).
1082
1082
Additionally, the populations contain commas (not useful for programmatically
1083
1083
dealing with numbers), and some even contain a line break character at the end
1084
- (`\n`). In chapter \@ref(wrangling), we will learn more about how to *wrangle* data
1084
+ (`\n`). In Chapter \@ref(wrangling), we will learn more about how to *wrangle* data
1085
1085
such as this into a more useful format for data analysis using R.
1086
1086
1087
1087
### Using an API
@@ -1090,7 +1090,7 @@ Rather than posting a data file at a URL for you to download, many websites thes
1090
1090
provide an API \index{API} that must be accessed through a programming language like R. The benefit of this
1091
1091
is that data owners have much more control over the data they provide to users. However, unlike
1092
1092
web scraping, there is no consistent way to access an API across websites. Every website typically
1093
- has its own API designed specially for its own use-case. Therefore we will just provide one example
1093
+ has its own API designed especially for its own use-case. Therefore we will just provide one example
1094
1094
of accessing data through an API in this book, with the hope that it gives you enough of a basic
1095
1095
idea that you can learn how to use another API if needed.
1096
1096
@@ -1192,7 +1192,7 @@ tidyverse_tweets <- select(tidyverse_tweets,
1192
1192
tidyverse_tweets
1193
1193
```
1194
1194
1195
- If you look back up at the image of the Tidyverse twitter page, you will
1195
+ If you look back up at the image of the Tidyverse Twitter page, you will
1196
1196
recognize the text of the most recent few tweets in the above data frame. In
1197
1197
other words, we have successfully created a small data set using the Twitter
1198
1198
API---neat! This data is also quite different from what we obtained from web scraping;
@@ -1228,6 +1228,6 @@ found in Chapter \@ref(move-to-your-own-machine).
1228
1228
- The [`rio` package](https://github.com/leeper/rio) provides an alternative set of tools for reading and writing data in R. It aims to be a " Swiss army knife " for data reading/writing/converting, and supports a wide variety of data types (including data formats generated by other statistical software like SPSS and SAS).
1229
1229
- This [video](https://www.youtube.com/embed/ephId3mYu9o) from the [Udacity course " Linux Command Line Basics " ](https://www.udacity.com/course/linux-command-line-basics--ud595) provides a good explanation of absolute versus relative paths.
1230
1230
- If you read the subsection on obtaining data from the web via scraping and APIs, we provide two companion tutorial video links:
1231
- - [A brief video tutorial](https://www.youtube.com/embed/YdIWI6K64zo) on using the SelectorGadget tool to obtain desired CSS selectors for extracting the price and size data for apartment listings on CraigsList
1231
+ - [A brief video tutorial](https://www.youtube.com/embed/YdIWI6K64zo) on using the SelectorGadget tool to obtain desired CSS selectors for extracting the price and size data for apartment listings on Craigslist
1232
1232
- [Another brief video tutorial](https://www.youtube.com/embed/O9HKbdhqYzk) on using the SelectorGadget tool to obtain desired CSS selectors for extracting Canadian city names and 2016 census populations from Wikipedia
1233
1233
- The [`polite` package](https://cran.r-project.org/web/packages/polite/index.html) provides a set of tools for responsibly scraping data from websites.
0 commit comments