f7/f8

libjohn · libjohn · commit eead8cabd093 · 2023-03-06T11:30:49.000-05:00
diff --git a/01_scrape_case-study_exercise.Rmd b/01_scrape_case-study_exercise.Rmd
@@ -206,7 +206,7 @@ After some investigation of the the HTML source, I see the only difference bewte
 
 So far we've used `rvest` functions and learned a little HTML.  You **also need to know about** [**CSS** (Cascading Style Sheets)](https://en.wikipedia.org/wiki/CSS).  
 
-You can follow the link above for a definition of CSS.  As a quickstart appraoch, I recommend playing this [CSS game](https://flukeout.github.io/) as a fun and quick way to learn just enough CSS.  I completed the first 15 levels of the game; that was enough to get a good foundation with CSS.  Depending on the complexity of the HTML and CSS, you may need to know a little more or a little less.  But you need to know something about HTML and CSS.
+You can follow the link above for a definition of CSS.  As a quickstart approach, I recommend playing this [CSS game](https://flukeout.github.io/) as a fun and quick way to learn just enough CSS.  I completed the first 15 levels of the game; that was enough to get a good foundation with CSS.  Depending on the complexity of the HTML and CSS, you may need to know a little more or a little less.  But you need to know something about HTML and CSS.
 
 **CSS** will help us subset the HTML for a single target page.  I didn't mention it before but, you often need to [view the source of the HTML](https://www.lifewire.com/view-html-source-in-chrome-3466725); use the [chrome browser to inspect elements](https://www.wikihow.com/Inspect-Element-on-Chrome) of pages in a web browser; and use the chrome browser extension, [selector gadget](https://rvest.tidyverse.org/articles/selectorgadget.html), to better understand the HTML and CSS tagging and structure.  
 
@@ -290,11 +290,11 @@ Anyway...
 
 Now all we need to do is expand the sequence of pages.  (Hint: `tidyr::expand()` )
 
-The maximum number of summary pages in the `navigation$page_no` variable is 22.  This should mean the maximum number of URLs to summary results pages will be roughly 22. Regardless of the total number of target names/pages, our task is to build a tibble with a URL for each summary results page, i.e. pages 1 thorough 22.  IF we have a link to each sumamry results page, then can we get a link for each of the fifty people listed on each of the summary result pages.  
+The maximum number of summary pages in the `navigation$page_no` variable is 22.  This should mean the maximum number of URLs to summary results pages will be roughly 22. Regardless of the total number of target names/pages, our task is to build a tibble with a URL for each summary results page, i.e. pages 1 thorough 22.  IF we have a link to each summary results page, then can we get a link for each of the fifty people listed on each of the summary result pages.  
 
 Honestly, building this list of navigation URLs takes some effort in R, especially if you're new to R.  So, maybe, there's an easier way.  It might be easier to build the range of summary page URLs in Excel, then import the excel file (or CSV file) of URLs into R for crawling via `rvest`.  But I want a reproducible, code-based approach.  
 
-See example code, below, for a reproducible example.  With the next code chunk, you can use Tidyverse techniques and build a tibble of urls to iterate over. In this case, the important reproducible step use `stringr::str_extract()` to find and match the URL pattern.  
+See example code, below, for a reproducible example.  With the next code chunk, you can use Tidyverse techniques and build a tibble of URLs to iterate over. In this case, the important reproducible step use `stringr::str_extract()` to find and match the URL pattern.  
 
 ```{r}
 nav_df <- nav_df %>% 
@@ -361,9 +361,8 @@ Note:  Below, the final result is a tibble with a vector, `summary_url`, and an
 
 ```{r}
 nav_results_list <- tibble(
-  html_results = map(nav_df$url[1:3],
+  html_results = map(nav_df$url[1:3],  #url[1:3] - limiting to the first three summary results pages (each page = 50 results)
     ~ {
-      #url[1:3] - limiting to the first three summary results pages (each page = 50 results)
       Sys.sleep(2)
       # DO THIS!  sleep 2 will pause 2 seconds between server requests to avoid being identified and potentially blocked by my target web server that might see my crawling bot as a DNS attack.
       .x %>%