You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: 01_scrape_case-study_exercise.Rmd
+4-5Lines changed: 4 additions & 5 deletions
Original file line number
Diff line number
Diff line change
@@ -206,7 +206,7 @@ After some investigation of the the HTML source, I see the only difference bewte
206
206
207
207
So far we've used `rvest` functions and learned a little HTML. You **also need to know about**[**CSS** (Cascading Style Sheets)](https://en.wikipedia.org/wiki/CSS).
208
208
209
-
You can follow the link above for a definition of CSS. As a quickstart appraoch, I recommend playing this [CSS game](https://flukeout.github.io/) as a fun and quick way to learn just enough CSS. I completed the first 15 levels of the game; that was enough to get a good foundation with CSS. Depending on the complexity of the HTML and CSS, you may need to know a little more or a little less. But you need to know something about HTML and CSS.
209
+
You can follow the link above for a definition of CSS. As a quickstart approach, I recommend playing this [CSS game](https://flukeout.github.io/) as a fun and quick way to learn just enough CSS. I completed the first 15 levels of the game; that was enough to get a good foundation with CSS. Depending on the complexity of the HTML and CSS, you may need to know a little more or a little less. But you need to know something about HTML and CSS.
210
210
211
211
**CSS** will help us subset the HTML for a single target page. I didn't mention it before but, you often need to [view the source of the HTML](https://www.lifewire.com/view-html-source-in-chrome-3466725); use the [chrome browser to inspect elements](https://www.wikihow.com/Inspect-Element-on-Chrome) of pages in a web browser; and use the chrome browser extension, [selector gadget](https://rvest.tidyverse.org/articles/selectorgadget.html), to better understand the HTML and CSS tagging and structure.
212
212
@@ -290,11 +290,11 @@ Anyway...
290
290
291
291
Now all we need to do is expand the sequence of pages. (Hint: `tidyr::expand()` )
292
292
293
-
The maximum number of summary pages in the `navigation$page_no` variable is 22. This should mean the maximum number of URLs to summary results pages will be roughly 22. Regardless of the total number of target names/pages, our task is to build a tibble with a URL for each summary results page, i.e. pages 1 thorough 22. IF we have a link to each sumamry results page, then can we get a link for each of the fifty people listed on each of the summary result pages.
293
+
The maximum number of summary pages in the `navigation$page_no` variable is 22. This should mean the maximum number of URLs to summary results pages will be roughly 22. Regardless of the total number of target names/pages, our task is to build a tibble with a URL for each summary results page, i.e. pages 1 thorough 22. IF we have a link to each summary results page, then can we get a link for each of the fifty people listed on each of the summary result pages.
294
294
295
295
Honestly, building this list of navigation URLs takes some effort in R, especially if you're new to R. So, maybe, there's an easier way. It might be easier to build the range of summary page URLs in Excel, then import the excel file (or CSV file) of URLs into R for crawling via `rvest`. But I want a reproducible, code-based approach.
296
296
297
-
See example code, below, for a reproducible example. With the next code chunk, you can use Tidyverse techniques and build a tibble of urls to iterate over. In this case, the important reproducible step use `stringr::str_extract()` to find and match the URL pattern.
297
+
See example code, below, for a reproducible example. With the next code chunk, you can use Tidyverse techniques and build a tibble of URLs to iterate over. In this case, the important reproducible step use `stringr::str_extract()` to find and match the URL pattern.
298
298
299
299
```{r}
300
300
nav_df <- nav_df %>%
@@ -361,9 +361,8 @@ Note: Below, the final result is a tibble with a vector, `summary_url`, and an
361
361
362
362
```{r}
363
363
nav_results_list <- tibble(
364
-
html_results = map(nav_df$url[1:3],
364
+
html_results = map(nav_df$url[1:3], #url[1:3] - limiting to the first three summary results pages (each page = 50 results)
365
365
~ {
366
-
#url[1:3] - limiting to the first three summary results pages (each page = 50 results)
367
366
Sys.sleep(2)
368
367
# DO THIS! sleep 2 will pause 2 seconds between server requests to avoid being identified and potentially blocked by my target web server that might see my crawling bot as a DNS attack.
0 commit comments