tweak

libjohn · libjohn · commit 1ea234fe2b99 · 2021-03-01T20:50:01.000-05:00
diff --git a/slides/_child-footer.html b/slides/_child-footer.html
diff --git a/slides/index.Rmd b/slides/index.Rmd
@@ -44,22 +44,26 @@ I would like to take a moment to honor the land in Durham, NC.  Duke University
 - Tidyverse iteration with `purrr:map`
 - Point out useful documentation & resources 
 
-.f7.i.moon-gray[This is not an research design or HTML design class. YMMV: data gathering and cleaning are vital and can be complex. This is a demonstration of leveraging the Tidyverse.]
+.f7.i.moon-gray[This is a demonstration of leveraging the Tidyverse.  This is not an research design or HTML design class.  YMMV: data gathering and cleaning are vital and can be complex. ]
+
+```{r child="_child-footer.Rmd", include=FALSE}
+```
+
+--
 
 ### Caveats
 - You will be as successful as the web author(s) were consistent 
 - Read and follow the _Terms of Use_ for any target web host
 - Read and honor the host's robots.txt | https://www.robotstxt.org
     - Always **pause** to avoid the perception of a _Denial of Service_ (DOS) attack
     
-```{r child="_child-footer.Rmd", include=FALSE}
-```
+
 
 ---
 class:middle
 
 .left-column[
-### Scraping 
+### Scraping =
 
 .f6[Gather or ingest web page data for analysis]
 
@@ -70,38 +74,35 @@ class:middle
 
 ]
 
+```{r child="_child-footer.Rmd", include=FALSE}
+```
+
+--
+
 .right-column[
 
-**<span text-align: left;>Crawling<span> <span text-align: center;>+</span> <span text-align:right;>Parsing</span>**
+** Crawling + Parsing**
 
 <div class = "container" id = "imgspcl" style="width: 100%; max-width: 100%;">
 <img src = "images/crawling_med.jpg" width = "50%"> &nbsp; + &nbsp; <img src = "images/strain_comb.jpg" width="50%">
 </div>
 
-
-
 .pull-left[
 .f7[Systematically iterating through a website, gathering data from more than one page (URL)]
 
 `purrr::map()`
 ]
 
+
 .pull-right[
 .f7[Separating the syntactic elements of the HTML.  Keeping only the data you need]
 
 `rvest::html_nodes()`  
 `rvest::html_text()`  
 `rvest::html_attr()`
 ]
-
-
 ]
 
-
-```{r child="_child-footer.Rmd", include=FALSE}
-```
-
-
 ---
 ## HTML
 
@@ -156,26 +157,29 @@ The basic workflow of web scraping is
 
 1. Development
 
-    - Import raw HTML of a single target page (page detail, or “leaf”)
-    - Parse the HTML of the test page to gather the data you want
-    - Check robots.txt, terms of use
-    - In a web browser, manually browse and understand the site navigation of the scrape target (site navigation, or “branches”)
-    - Parse the site navigation and develop an interation plan
-    - Iterate: write code that implements iteration, i.e. automated page crawling
+    - Import raw HTML of a single target page (page detail:  a leaf or node)
+    - Parse the HTML of the test page and gather specific data 
+    - Check _robots.txt_ and _Terms Of Use_ (TOU)
+    - In a web browser, manually browse and understand the target site's navigation (site navigation: branches)
+    - _Parse_ the site navigation and develop an _iteration_ plan
+    - _Iterate_: orchestrate/automate page crawling
     - Perform a dry run with a limited subset of the target web site
-    - Construct time pauses (to avoid DNS attacks)
-    - Production
+    - Construct pauses:  avoid the posture of a DNS attack
 
-1. Iterate
+1. Production
 
-    - Crawl the site navigation (branches)
-    - Parse HTML for each detail page (leaves)
+    - Iterate/Crawl the site (navigation: branches)
+    - Parse HTML for each target page (pages: leaves or nodes)
 
 ```{r child="_child-footer.Rmd", include=FALSE}
 ```
 ---
 background-image: url(images/selector_graph.png)
 
+<!-- an image of branches and nodes -->
+
+?? an image of branches and nodes
+
 
 ```{r child="_child-footer.Rmd", include=FALSE}
 ```
diff --git a/slides/index.html b/slides/index.html
@@ -41,14 +41,8 @@
 - Tidyverse iteration with `purrr:map`
 - Point out useful documentation &amp; resources 
 
-.f7.i.moon-gray[This is not an research design or HTML design class. YMMV: data gathering and cleaning are vital and can be complex. This is a demonstration of leveraging the Tidyverse.]
+.f7.i.moon-gray[This is a demonstration of leveraging the Tidyverse.  This is not an research design or HTML design class.  YMMV: data gathering and cleaning are vital and can be complex. ]
 
-### Caveats
-- You will be as successful as the web author(s) were consistent 
-- Read and follow the _Terms of Use_ for any target web host
-- Read and honor the host's robots.txt | https://www.robotstxt.org
-    - Always **pause** to avoid the perception of a _Denial of Service_ (DOS) attack
-    
 
 
 &lt;div class="footercc"&gt;
@@ -59,11 +53,21 @@
 
 
 
+--
+
+### Caveats
+- You will be as successful as the web author(s) were consistent 
+- Read and follow the _Terms of Use_ for any target web host
+- Read and honor the host's robots.txt | https://www.robotstxt.org
+    - Always **pause** to avoid the perception of a _Denial of Service_ (DOS) attack
+    
+
+
 ---
 class:middle
 
 .left-column[
-### Scraping 
+### Scraping =
 
 .f6[Gather or ingest web page data for analysis]
 
@@ -74,45 +78,42 @@
 
 ]
 
+
+
+&lt;div class="footercc"&gt;
+&lt;i class="fab fa-creative-commons"&gt;&lt;/i&gt;&amp;nbsp; &lt;i class="fab fa-creative-commons-by"&gt;&lt;/i&gt;&lt;i class="fab fa-creative-commons-nc"&gt;&lt;/i&gt;  &lt;a href = "https://JohnLittle.info"&gt;&lt;span class = "opacity30"&gt;https://&lt;/span&gt;JohnLittle&lt;span class = "opacity30"&gt;.info&lt;/span&gt;&lt;/a&gt; 
+&lt;span class = "opacity30"&gt;&lt;a href="https://github.com/libjohn/workshop_webscraping"&gt;https://github.com/libjohn/workshop_webscraping&lt;/a&gt; | 2021-03-01 &lt;/span&gt;
+&lt;/div&gt;
+
+
+
+
+--
+
 .right-column[
 
-**&lt;span text-align: left;&gt;Crawling&lt;span&gt; &lt;span text-align: center;&gt;+&lt;/span&gt; &lt;span text-align:right;&gt;Parsing&lt;/span&gt;**
+** Crawling + Parsing**
 
 &lt;div class = "container" id = "imgspcl" style="width: 100%; max-width: 100%;"&gt;
 &lt;img src = "images/crawling_med.jpg" width = "50%"&gt; &amp;nbsp; + &amp;nbsp; &lt;img src = "images/strain_comb.jpg" width="50%"&gt;
 &lt;/div&gt;
 
-
-
 .pull-left[
 .f7[Systematically iterating through a website, gathering data from more than one page (URL)]
 
 `purrr::map()`
 ]
 
+
 .pull-right[
 .f7[Separating the syntactic elements of the HTML.  Keeping only the data you need]
 
 `rvest::html_nodes()`  
 `rvest::html_text()`  
 `rvest::html_attr()`
 ]
-
-
 ]
 
-
-
-
-&lt;div class="footercc"&gt;
-&lt;i class="fab fa-creative-commons"&gt;&lt;/i&gt;&amp;nbsp; &lt;i class="fab fa-creative-commons-by"&gt;&lt;/i&gt;&lt;i class="fab fa-creative-commons-nc"&gt;&lt;/i&gt;  &lt;a href = "https://JohnLittle.info"&gt;&lt;span class = "opacity30"&gt;https://&lt;/span&gt;JohnLittle&lt;span class = "opacity30"&gt;.info&lt;/span&gt;&lt;/a&gt; 
-&lt;span class = "opacity30"&gt;&lt;a href="https://github.com/libjohn/workshop_webscraping"&gt;https://github.com/libjohn/workshop_webscraping&lt;/a&gt; | 2021-03-01 &lt;/span&gt;
-&lt;/div&gt;
-
-
-
-
-
 ---
 ## HTML
 
@@ -181,20 +182,19 @@
 
 1. Development
 
-    - Import raw HTML of a single target page (page detail, or “leaf”)
-    - Parse the HTML of the test page to gather the data you want
-    - Check robots.txt, terms of use
-    - In a web browser, manually browse and understand the site navigation of the scrape target (site navigation, or “branches”)
-    - Parse the site navigation and develop an interation plan
-    - Iterate: write code that implements iteration, i.e. automated page crawling
+    - Import raw HTML of a single target page (page detail:  a leaf or node)
+    - Parse the HTML of the test page and gather specific data 
+    - Check _robots.txt_ and _Terms Of Use_ (TOU)
+    - In a web browser, manually browse and understand the target site's navigation (site navigation: branches)
+    - _Parse_ the site navigation and develop an _iteration_ plan
+    - _Iterate_: orchestrate/automate page crawling
     - Perform a dry run with a limited subset of the target web site
-    - Construct time pauses (to avoid DNS attacks)
-    - Production
+    - Construct pauses:  avoid the posture of a DNS attack
 
-1. Iterate
+1. Production
 
-    - Crawl the site navigation (branches)
-    - Parse HTML for each detail page (leaves)
+    - Iterate/Crawl the site (navigation: branches)
+    - Parse HTML for each target page (pages: leaves or nodes)
 
 
 
@@ -208,6 +208,10 @@
 ---
 background-image: url(images/selector_graph.png)
 
+&lt;!-- an image of branches and nodes --&gt;
+
+?? an image of branches and nodes
+
 
 
 
diff --git a/slides/styles/my-theme.css b/slides/styles/my-theme.css
@@ -24,6 +24,10 @@ height: auto;
 max-height: 200px;
 }
 
+.myc {
+  text-align: right;
+}
+
 
 
 

Original file line number	Diff line number	Diff line change
`@@ -24,6 +24,10 @@ height: auto;`
`24`	`24`	`max-height: 200px;`
`25`	`25`	`}`
`26`	`26`
	`27`	`+.myc {`
	`28`	`+ text-align: right;`
	`29`	`+}`
	`30`	`+`
`27`	`31`
`28`	`32`
`29`	`33`