Skip to content

Commit 41754e7

Browse files
make html scraping section reproducible
1 parent 9cbe2a6 commit 41754e7

File tree

2 files changed

+34
-45
lines changed

2 files changed

+34
-45
lines changed

img/reading/sg4.png

-617 KB
Loading

source/reading.Rmd

Lines changed: 34 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -1033,10 +1033,9 @@ in the additional resources section. SelectorGadget provides in its toolbar
10331033
the following list of CSS selectors to use:
10341034

10351035
```
1036-
td:nth-child(5),
1037-
td:nth-child(7),
1038-
.infobox:nth-child(122) td:nth-child(1),
1039-
.infobox td:nth-child(3)
1036+
td:nth-child(8) ,
1037+
td:nth-child(4) ,
1038+
.largestCities-cell-background+ td a
10401039
```
10411040

10421041
Now that we have the CSS selectors that describe the properties of the elements
@@ -1057,54 +1056,36 @@ Next, we tell R what page we want to scrape by providing the webpage's URL in qu
10571056
page <- read_html("https://en.wikipedia.org/wiki/Canada")
10581057
```
10591058

1059+
```{r echo=FALSE, warning = FALSE}
1060+
# the above cell doesn't actually run; this one does run
1061+
# and loads the html data from a local, static file
1062+
1063+
page <- read_html("data/canada_wiki.html")
1064+
```
1065+
10601066
The `read_html` function \index{read function!read\_html} directly downloads the source code for the page at
10611067
the URL you specify, just like your browser would if you navigated to that site. But
10621068
instead of displaying the website to you, the `read_html` function just returns
10631069
the HTML source code itself, which we have
10641070
stored in the `page` variable. Next, we send the page object to the `html_nodes`
10651071
function, along with the CSS selectors we obtained from
10661072
the SelectorGadget tool. Make sure to surround the selectors with quotation marks; the function, `html_nodes`, expects that
1067-
argument is a string. The `html_nodes` function then selects *nodes* from the HTML document that
1068-
match the CSS selectors you specified. A *node* is an HTML tag pair (e.g.,
1069-
`<td>` and `</td>` which defines the cell of a table) combined with the content
1070-
stored between the tags. For our CSS selector `td:nth-child(5)`, an example
1071-
node that would be selected would be:
1072-
1073-
```html
1074-
<td style="text-align:left;background:#f0f0f0;">
1075-
<a href="/wiki/London,_Ontario" title="London, Ontario">London</a>
1076-
</td>
1077-
```
1078-
1079-
We store the result of the `html_nodes` function in the `population_nodes` variable.
1073+
argument is a string. We store the result of the `html_nodes` function in the `population_nodes` variable.
10801074
Note that below we use the `paste` function with a comma separator (`sep=","`)
10811075
to build the list of selectors. The `paste` function converts
10821076
elements to characters and combines the values into a list. We use this function to
10831077
build the list of selectors to maintain code readability; this avoids
1084-
having one very long line of code with the string
1085-
`"td:nth-child(5),td:nth-child(7),.infobox:nth-child(122) td:nth-child(1),.infobox td:nth-child(3)"`
1086-
as the second argument of `html_nodes`:
1078+
having a very long line of code.
10871079

1088-
```r
1089-
selectors <- paste("td:nth-child(5)",
1090-
"td:nth-child(7)",
1091-
".infobox:nth-child(122) td:nth-child(1)",
1092-
".infobox td:nth-child(3)", sep = ",")
1080+
```{r}
1081+
selectors <- paste("td:nth-child(8)",
1082+
"td:nth-child(4)",
1083+
".largestCities-cell-background+ td a", sep = ",")
10931084
10941085
population_nodes <- html_nodes(page, selectors)
10951086
head(population_nodes)
10961087
```
10971088

1098-
```
1099-
## {xml_nodeset (6)}
1100-
## [1] <td style="text-align:left;background:#f0f0f0;"><a href="/wiki/London,_On ...
1101-
## [2] <td style="text-align:right;">543,551\n</td>
1102-
## [3] <td style="text-align:left;background:#f0f0f0;"><a href="/wiki/Halifax,_N ...
1103-
## [4] <td style="text-align:right;">465,703\n</td>
1104-
## [5] <td style="text-align:left;background:#f0f0f0;">\n<a href="/wiki/St._Cath ...
1105-
## [6] <td style="text-align:right;">433,604\n</td>
1106-
```
1107-
11081089
> **Note:** `head` is a function that is often useful for viewing only a short
11091090
> summary of an R object, rather than the whole thing (which may be quite a lot
11101091
> to look at). For example, here `head` shows us only the first 6 items in the
@@ -1113,19 +1094,27 @@ head(population_nodes)
11131094
> But not *all* R objects do this, and that's where the `head` function helps
11141095
> summarize things for you.
11151096
1116-
Next we extract the meaningful data&mdash;in other words, we get rid of the HTML code syntax and tags&mdash;from
1117-
the nodes using the `html_text`
1118-
function. In the case of the example
1119-
node above, `html_text` function returns `"London"`.
11201097

1121-
```r
1098+
Each of the items in the `population_nodes` list is a *node* from the HTML
1099+
document that matches the CSS selectors you specified. A *node* is an HTML tag
1100+
pair (e.g., `<td>` and `</td>` which defines the cell of a table) combined with
1101+
the content stored between the tags. For our CSS selector `td:nth-child(4)`, an
1102+
example node that would be selected would be:
1103+
1104+
```html
1105+
<td style="text-align:left;background:#f0f0f0;">
1106+
<a href="/wiki/London,_Ontario" title="London, Ontario">London</a>
1107+
</td>
1108+
```
1109+
1110+
Next we extract the meaningful data&mdash;in other words, we get rid of the
1111+
HTML code syntax and tags&mdash;from the nodes using the `html_text` function.
1112+
In the case of the example node above, `html_text` function returns `"London"`.
1113+
1114+
```{r}
11221115
population_text <- html_text(population_nodes)
11231116
head(population_text)
11241117
```
1125-
```
1126-
## [1] "London" "543,551\n" "Halifax"
1127-
## [4] "465,703\n" "St. Catharines–Niagara" "433,604\n"
1128-
```
11291118

11301119
Fantastic! We seem to have extracted the data of interest from the
11311120
raw HTML source code. But we are not quite done; the data
@@ -1306,6 +1295,6 @@ and guidance that the worksheets provide will function as intended.
13061295
APIs, we provide two companion tutorial video links for how to use the
13071296
SelectorGadget tool to obtain desired CSS selectors for:
13081297
- [extracting the data for apartment listings on Craigslist](https://www.youtube.com/embed/YdIWI6K64zo), and
1309-
- [extracting Canadian city names and 2016 populations from Wikipedia](https://www.youtube.com/embed/O9HKbdhqYzk).
1298+
- [extracting Canadian city names and populations from Wikipedia](https://www.youtube.com/embed/O9HKbdhqYzk).
13101299
- The [`polite` R package](https://dmi3kno.github.io/polite/) [@polite] provides
13111300
a set of tools for responsibly scraping data from websites.

0 commit comments

Comments
 (0)