Skip to content

Commit eead8ca

Browse files
committed
f7/f8
1 parent 11a3f67 commit eead8ca

File tree

1 file changed

+4
-5
lines changed

1 file changed

+4
-5
lines changed

01_scrape_case-study_exercise.Rmd

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -206,7 +206,7 @@ After some investigation of the the HTML source, I see the only difference bewte
206206

207207
So far we've used `rvest` functions and learned a little HTML. You **also need to know about** [**CSS** (Cascading Style Sheets)](https://en.wikipedia.org/wiki/CSS).
208208

209-
You can follow the link above for a definition of CSS. As a quickstart appraoch, I recommend playing this [CSS game](https://flukeout.github.io/) as a fun and quick way to learn just enough CSS. I completed the first 15 levels of the game; that was enough to get a good foundation with CSS. Depending on the complexity of the HTML and CSS, you may need to know a little more or a little less. But you need to know something about HTML and CSS.
209+
You can follow the link above for a definition of CSS. As a quickstart approach, I recommend playing this [CSS game](https://flukeout.github.io/) as a fun and quick way to learn just enough CSS. I completed the first 15 levels of the game; that was enough to get a good foundation with CSS. Depending on the complexity of the HTML and CSS, you may need to know a little more or a little less. But you need to know something about HTML and CSS.
210210

211211
**CSS** will help us subset the HTML for a single target page. I didn't mention it before but, you often need to [view the source of the HTML](https://www.lifewire.com/view-html-source-in-chrome-3466725); use the [chrome browser to inspect elements](https://www.wikihow.com/Inspect-Element-on-Chrome) of pages in a web browser; and use the chrome browser extension, [selector gadget](https://rvest.tidyverse.org/articles/selectorgadget.html), to better understand the HTML and CSS tagging and structure.
212212

@@ -290,11 +290,11 @@ Anyway...
290290

291291
Now all we need to do is expand the sequence of pages. (Hint: `tidyr::expand()` )
292292

293-
The maximum number of summary pages in the `navigation$page_no` variable is 22. This should mean the maximum number of URLs to summary results pages will be roughly 22. Regardless of the total number of target names/pages, our task is to build a tibble with a URL for each summary results page, i.e. pages 1 thorough 22. IF we have a link to each sumamry results page, then can we get a link for each of the fifty people listed on each of the summary result pages.
293+
The maximum number of summary pages in the `navigation$page_no` variable is 22. This should mean the maximum number of URLs to summary results pages will be roughly 22. Regardless of the total number of target names/pages, our task is to build a tibble with a URL for each summary results page, i.e. pages 1 thorough 22. IF we have a link to each summary results page, then can we get a link for each of the fifty people listed on each of the summary result pages.
294294

295295
Honestly, building this list of navigation URLs takes some effort in R, especially if you're new to R. So, maybe, there's an easier way. It might be easier to build the range of summary page URLs in Excel, then import the excel file (or CSV file) of URLs into R for crawling via `rvest`. But I want a reproducible, code-based approach.
296296

297-
See example code, below, for a reproducible example. With the next code chunk, you can use Tidyverse techniques and build a tibble of urls to iterate over. In this case, the important reproducible step use `stringr::str_extract()` to find and match the URL pattern.
297+
See example code, below, for a reproducible example. With the next code chunk, you can use Tidyverse techniques and build a tibble of URLs to iterate over. In this case, the important reproducible step use `stringr::str_extract()` to find and match the URL pattern.
298298

299299
```{r}
300300
nav_df <- nav_df %>%
@@ -361,9 +361,8 @@ Note: Below, the final result is a tibble with a vector, `summary_url`, and an
361361

362362
```{r}
363363
nav_results_list <- tibble(
364-
html_results = map(nav_df$url[1:3],
364+
html_results = map(nav_df$url[1:3], #url[1:3] - limiting to the first three summary results pages (each page = 50 results)
365365
~ {
366-
#url[1:3] - limiting to the first three summary results pages (each page = 50 results)
367366
Sys.sleep(2)
368367
# DO THIS! sleep 2 will pause 2 seconds between server requests to avoid being identified and potentially blocked by my target web server that might see my crawling bot as a DNS attack.
369368
.x %>%

0 commit comments

Comments
 (0)