@@ -1033,10 +1033,9 @@ in the additional resources section. SelectorGadget provides in its toolbar
1033
1033
the following list of CSS selectors to use:
1034
1034
1035
1035
```
1036
- td:nth-child(5),
1037
- td:nth-child(7),
1038
- .infobox:nth-child(122) td:nth-child(1),
1039
- .infobox td:nth-child(3)
1036
+ td:nth-child(8) ,
1037
+ td:nth-child(4) ,
1038
+ .largestCities-cell-background+ td a
1040
1039
```
1041
1040
1042
1041
Now that we have the CSS selectors that describe the properties of the elements
@@ -1057,54 +1056,36 @@ Next, we tell R what page we want to scrape by providing the webpage's URL in qu
1057
1056
page <- read_html(" https://en.wikipedia.org/wiki/Canada" )
1058
1057
```
1059
1058
1059
+ ``` {r echo=FALSE, warning = FALSE}
1060
+ # the above cell doesn't actually run; this one does run
1061
+ # and loads the html data from a local, static file
1062
+
1063
+ page <- read_html("data/canada_wiki.html")
1064
+ ```
1065
+
1060
1066
The ` read_html ` function \index{read function!read\_ html} directly downloads the source code for the page at
1061
1067
the URL you specify, just like your browser would if you navigated to that site. But
1062
1068
instead of displaying the website to you, the ` read_html ` function just returns
1063
1069
the HTML source code itself, which we have
1064
1070
stored in the ` page ` variable. Next, we send the page object to the ` html_nodes `
1065
1071
function, along with the CSS selectors we obtained from
1066
1072
the SelectorGadget tool. Make sure to surround the selectors with quotation marks; the function, ` html_nodes ` , expects that
1067
- argument is a string. The ` html_nodes ` function then selects * nodes* from the HTML document that
1068
- match the CSS selectors you specified. A * node* is an HTML tag pair (e.g.,
1069
- ` <td> ` and ` </td> ` which defines the cell of a table) combined with the content
1070
- stored between the tags. For our CSS selector ` td:nth-child(5) ` , an example
1071
- node that would be selected would be:
1072
-
1073
- ``` html
1074
- <td style =" text-align :left ;background :#f0f0f0 ;" >
1075
- <a href =" /wiki/London,_Ontario" title =" London, Ontario" >London</a >
1076
- </td >
1077
- ```
1078
-
1079
- We store the result of the ` html_nodes ` function in the ` population_nodes ` variable.
1073
+ argument is a string. We store the result of the ` html_nodes ` function in the ` population_nodes ` variable.
1080
1074
Note that below we use the ` paste ` function with a comma separator (` sep="," ` )
1081
1075
to build the list of selectors. The ` paste ` function converts
1082
1076
elements to characters and combines the values into a list. We use this function to
1083
1077
build the list of selectors to maintain code readability; this avoids
1084
- having one very long line of code with the string
1085
- ` "td:nth-child(5),td:nth-child(7),.infobox:nth-child(122) td:nth-child(1),.infobox td:nth-child(3)" `
1086
- as the second argument of ` html_nodes ` :
1078
+ having a very long line of code.
1087
1079
1088
- ``` r
1089
- selectors <- paste(" td:nth-child(5)" ,
1090
- " td:nth-child(7)" ,
1091
- " .infobox:nth-child(122) td:nth-child(1)" ,
1092
- " .infobox td:nth-child(3)" , sep = " ," )
1080
+ ``` {r}
1081
+ selectors <- paste("td:nth-child(8)",
1082
+ "td:nth-child(4)",
1083
+ ".largestCities-cell-background+ td a", sep = ",")
1093
1084
1094
1085
population_nodes <- html_nodes(page, selectors)
1095
1086
head(population_nodes)
1096
1087
```
1097
1088
1098
- ```
1099
- ## {xml_nodeset (6)}
1100
- ## [1] <td style="text-align:left;background:#f0f0f0;"><a href="/wiki/London,_On ...
1101
- ## [2] <td style="text-align:right;">543,551\n</td>
1102
- ## [3] <td style="text-align:left;background:#f0f0f0;"><a href="/wiki/Halifax,_N ...
1103
- ## [4] <td style="text-align:right;">465,703\n</td>
1104
- ## [5] <td style="text-align:left;background:#f0f0f0;">\n<a href="/wiki/St._Cath ...
1105
- ## [6] <td style="text-align:right;">433,604\n</td>
1106
- ```
1107
-
1108
1089
> ** Note:** ` head ` is a function that is often useful for viewing only a short
1109
1090
> summary of an R object, rather than the whole thing (which may be quite a lot
1110
1091
> to look at). For example, here ` head ` shows us only the first 6 items in the
@@ -1113,19 +1094,27 @@ head(population_nodes)
1113
1094
> But not * all* R objects do this, and that's where the ` head ` function helps
1114
1095
> summarize things for you.
1115
1096
1116
- Next we extract the meaningful data&mdash ; in other words, we get rid of the HTML code syntax and tags&mdash ; from
1117
- the nodes using the ` html_text `
1118
- function. In the case of the example
1119
- node above, ` html_text ` function returns ` "London" ` .
1120
1097
1121
- ``` r
1098
+ Each of the items in the ` population_nodes ` list is a * node* from the HTML
1099
+ document that matches the CSS selectors you specified. A * node* is an HTML tag
1100
+ pair (e.g., ` <td> ` and ` </td> ` which defines the cell of a table) combined with
1101
+ the content stored between the tags. For our CSS selector ` td:nth-child(4) ` , an
1102
+ example node that would be selected would be:
1103
+
1104
+ ``` html
1105
+ <td style =" text-align :left ;background :#f0f0f0 ;" >
1106
+ <a href =" /wiki/London,_Ontario" title =" London, Ontario" >London</a >
1107
+ </td >
1108
+ ```
1109
+
1110
+ Next we extract the meaningful data&mdash ; in other words, we get rid of the
1111
+ HTML code syntax and tags&mdash ; from the nodes using the ` html_text ` function.
1112
+ In the case of the example node above, ` html_text ` function returns ` "London" ` .
1113
+
1114
+ ``` {r}
1122
1115
population_text <- html_text(population_nodes)
1123
1116
head(population_text)
1124
1117
```
1125
- ```
1126
- ## [1] "London" "543,551\n" "Halifax"
1127
- ## [4] "465,703\n" "St. Catharines–Niagara" "433,604\n"
1128
- ```
1129
1118
1130
1119
Fantastic! We seem to have extracted the data of interest from the
1131
1120
raw HTML source code. But we are not quite done; the data
@@ -1306,6 +1295,6 @@ and guidance that the worksheets provide will function as intended.
1306
1295
APIs, we provide two companion tutorial video links for how to use the
1307
1296
SelectorGadget tool to obtain desired CSS selectors for:
1308
1297
- [ extracting the data for apartment listings on Craigslist] ( https://www.youtube.com/embed/YdIWI6K64zo ) , and
1309
- - [ extracting Canadian city names and 2016 populations from Wikipedia] ( https://www.youtube.com/embed/O9HKbdhqYzk ) .
1298
+ - [ extracting Canadian city names and populations from Wikipedia] ( https://www.youtube.com/embed/O9HKbdhqYzk ) .
1310
1299
- The [ ` polite ` R package] ( https://dmi3kno.github.io/polite/ ) [ @polite ] provides
1311
1300
a set of tools for responsibly scraping data from websites.
0 commit comments