Skip to content

Commit 11a3f67

Browse files
committed
update and refine
- map() and - html_elements()
1 parent ce54807 commit 11a3f67

File tree

2 files changed

+33
-28
lines changed

2 files changed

+33
-28
lines changed

01_scrape_case-study_exercise.Rmd

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -458,10 +458,43 @@ children_name <- emanuel %>%
458458
html_text()
459459
children_name
460460
```
461+
461462
#### Iterate
462463

463464
There now. I just scraped and parsed data for one target, one person in my list of target URLs. Now use purrr to iterate over each target URL in the list. **Do not forget to pause, `Sys.sleep(2)`,** between each iteration of the `read_html()` function.
464465

466+
467+
## Refined Code
468+
469+
O.K. so I didn't really explain iteration with the {purrr} package and the `map()` family of functions. If you want to learn more about that, check out this [workshop on iteration](https://github.com/libjohn/workshop_rfun_iterate). In the meantime, it might be important to point out that the `html_nodes()` function has been renamed as of {rvest} 1.0.0. `html_nodes()` is now `html_elements()`. Anyway, if you want to see how this scraping operation can be done with less code, at least starting with the last manipulation of the `nav_df` tibble, here's some updated and refined code....
470+
471+
```{r}
472+
get_name_html <- function(url) {
473+
Sys.sleep(2)
474+
url |>
475+
read_html()
476+
}
477+
478+
name_urls_df <- nav_df |>
479+
slice(1:3) |>
480+
mutate(html_results = map(url, get_name_html)) |>
481+
mutate(name_url = map(html_results, ~ .x |> html_elements("#setwidth li a") |> html_attr("href"))) |>
482+
mutate(name = map(html_results, ~ .x |> html_elements("#setwidth li a") |> html_text())) |>
483+
unnest(cols = c(name_url, name)) |>
484+
mutate(name_url = str_replace(name_url, "\\.\\.", "ecartico")) |>
485+
mutate(name_full_url = str_glue("http://www.vondel.humanities.uva.nl/{name_url}"))
486+
name_urls_df
487+
```
488+
489+
```{r}
490+
name_urls_df |>
491+
filter(str_detect(name, regex("Boudewijn ", ignore_case = TRUE))) |>
492+
mutate(children_names_html = map(name_full_url, get_name_html)) |>
493+
mutate(children_names = map(children_names_html, ~ .x |> html_elements("ul~ h2+ ul li > a") |> html_text())) |>
494+
unnest(children_names)
495+
```
496+
497+
465498
## Resources
466499

467500
- https://rvest.tidyverse.org

delme_DESCRIPTION_old

Lines changed: 0 additions & 28 deletions
This file was deleted.

0 commit comments

Comments
 (0)