Skip to content

Commit 1ea234f

Browse files
committed
tweak
1 parent 51bbb5c commit 1ea234f

File tree

4 files changed

+73
-306
lines changed

4 files changed

+73
-306
lines changed

slides/_child-footer.html

Lines changed: 0 additions & 245 deletions
This file was deleted.

slides/index.Rmd

Lines changed: 29 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -44,22 +44,26 @@ I would like to take a moment to honor the land in Durham, NC. Duke University
4444
- Tidyverse iteration with `purrr:map`
4545
- Point out useful documentation & resources
4646

47-
.f7.i.moon-gray[This is not an research design or HTML design class. YMMV: data gathering and cleaning are vital and can be complex. This is a demonstration of leveraging the Tidyverse.]
47+
.f7.i.moon-gray[This is a demonstration of leveraging the Tidyverse. This is not an research design or HTML design class. YMMV: data gathering and cleaning are vital and can be complex. ]
48+
49+
```{r child="_child-footer.Rmd", include=FALSE}
50+
```
51+
52+
--
4853

4954
### Caveats
5055
- You will be as successful as the web author(s) were consistent
5156
- Read and follow the _Terms of Use_ for any target web host
5257
- Read and honor the host's robots.txt | https://www.robotstxt.org
5358
- Always **pause** to avoid the perception of a _Denial of Service_ (DOS) attack
5459

55-
```{r child="_child-footer.Rmd", include=FALSE}
56-
```
60+
5761

5862
---
5963
class:middle
6064

6165
.left-column[
62-
### Scraping
66+
### Scraping =
6367

6468
.f6[Gather or ingest web page data for analysis]
6569

@@ -70,38 +74,35 @@ class:middle
7074

7175
]
7276

77+
```{r child="_child-footer.Rmd", include=FALSE}
78+
```
79+
80+
--
81+
7382
.right-column[
7483

75-
**<span text-align: left;>Crawling<span> <span text-align: center;>+</span> <span text-align:right;>Parsing</span>**
84+
** Crawling + Parsing**
7685

7786
<div class = "container" id = "imgspcl" style="width: 100%; max-width: 100%;">
7887
<img src = "images/crawling_med.jpg" width = "50%"> &nbsp; + &nbsp; <img src = "images/strain_comb.jpg" width="50%">
7988
</div>
8089

81-
82-
8390
.pull-left[
8491
.f7[Systematically iterating through a website, gathering data from more than one page (URL)]
8592

8693
`purrr::map()`
8794
]
8895

96+
8997
.pull-right[
9098
.f7[Separating the syntactic elements of the HTML. Keeping only the data you need]
9199

92100
`rvest::html_nodes()`
93101
`rvest::html_text()`
94102
`rvest::html_attr()`
95103
]
96-
97-
98104
]
99105

100-
101-
```{r child="_child-footer.Rmd", include=FALSE}
102-
```
103-
104-
105106
---
106107
## HTML
107108

@@ -156,26 +157,29 @@ The basic workflow of web scraping is
156157

157158
1. Development
158159

159-
- Import raw HTML of a single target page (page detail, or “leaf)
160-
- Parse the HTML of the test page to gather the data you want
161-
- Check robots.txt, terms of use
162-
- In a web browser, manually browse and understand the site navigation of the scrape target (site navigation, or “branches)
163-
- Parse the site navigation and develop an interation plan
164-
- Iterate: write code that implements iteration, i.e. automated page crawling
160+
- Import raw HTML of a single target page (page detail: a leaf or node)
161+
- Parse the HTML of the test page and gather specific data
162+
- Check _robots.txt_ and _Terms Of Use_ (TOU)
163+
- In a web browser, manually browse and understand the target site's navigation (site navigation: branches)
164+
- _Parse_ the site navigation and develop an _iteration_ plan
165+
- _Iterate_: orchestrate/automate page crawling
165166
- Perform a dry run with a limited subset of the target web site
166-
- Construct time pauses (to avoid DNS attacks)
167-
- Production
167+
- Construct pauses: avoid the posture of a DNS attack
168168

169-
1. Iterate
169+
1. Production
170170

171-
- Crawl the site navigation (branches)
172-
- Parse HTML for each detail page (leaves)
171+
- Iterate/Crawl the site (navigation: branches)
172+
- Parse HTML for each target page (pages: leaves or nodes)
173173

174174
```{r child="_child-footer.Rmd", include=FALSE}
175175
```
176176
---
177177
background-image: url(images/selector_graph.png)
178178

179+
<!-- an image of branches and nodes -->
180+
181+
?? an image of branches and nodes
182+
179183

180184
```{r child="_child-footer.Rmd", include=FALSE}
181185
```

slides/index.html

Lines changed: 40 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -41,14 +41,8 @@
4141
- Tidyverse iteration with `purrr:map`
4242
- Point out useful documentation &amp; resources
4343

44-
.f7.i.moon-gray[This is not an research design or HTML design class. YMMV: data gathering and cleaning are vital and can be complex. This is a demonstration of leveraging the Tidyverse.]
44+
.f7.i.moon-gray[This is a demonstration of leveraging the Tidyverse. This is not an research design or HTML design class. YMMV: data gathering and cleaning are vital and can be complex. ]
4545

46-
### Caveats
47-
- You will be as successful as the web author(s) were consistent
48-
- Read and follow the _Terms of Use_ for any target web host
49-
- Read and honor the host's robots.txt | https://www.robotstxt.org
50-
- Always **pause** to avoid the perception of a _Denial of Service_ (DOS) attack
51-
5246

5347

5448
&lt;div class="footercc"&gt;
@@ -59,11 +53,21 @@
5953

6054

6155

56+
--
57+
58+
### Caveats
59+
- You will be as successful as the web author(s) were consistent
60+
- Read and follow the _Terms of Use_ for any target web host
61+
- Read and honor the host's robots.txt | https://www.robotstxt.org
62+
- Always **pause** to avoid the perception of a _Denial of Service_ (DOS) attack
63+
64+
65+
6266
---
6367
class:middle
6468

6569
.left-column[
66-
### Scraping
70+
### Scraping =
6771

6872
.f6[Gather or ingest web page data for analysis]
6973

@@ -74,45 +78,42 @@
7478

7579
]
7680

81+
82+
83+
&lt;div class="footercc"&gt;
84+
&lt;i class="fab fa-creative-commons"&gt;&lt;/i&gt;&amp;nbsp; &lt;i class="fab fa-creative-commons-by"&gt;&lt;/i&gt;&lt;i class="fab fa-creative-commons-nc"&gt;&lt;/i&gt; &lt;a href = "https://JohnLittle.info"&gt;&lt;span class = "opacity30"&gt;https://&lt;/span&gt;JohnLittle&lt;span class = "opacity30"&gt;.info&lt;/span&gt;&lt;/a&gt;
85+
&lt;span class = "opacity30"&gt;&lt;a href="https://github.com/libjohn/workshop_webscraping"&gt;https://github.com/libjohn/workshop_webscraping&lt;/a&gt; | 2021-03-01 &lt;/span&gt;
86+
&lt;/div&gt;
87+
88+
89+
90+
91+
--
92+
7793
.right-column[
7894

79-
**&lt;span text-align: left;&gt;Crawling&lt;span&gt; &lt;span text-align: center;&gt;+&lt;/span&gt; &lt;span text-align:right;&gt;Parsing&lt;/span&gt;**
95+
** Crawling + Parsing**
8096

8197
&lt;div class = "container" id = "imgspcl" style="width: 100%; max-width: 100%;"&gt;
8298
&lt;img src = "images/crawling_med.jpg" width = "50%"&gt; &amp;nbsp; + &amp;nbsp; &lt;img src = "images/strain_comb.jpg" width="50%"&gt;
8399
&lt;/div&gt;
84100

85-
86-
87101
.pull-left[
88102
.f7[Systematically iterating through a website, gathering data from more than one page (URL)]
89103

90104
`purrr::map()`
91105
]
92106

107+
93108
.pull-right[
94109
.f7[Separating the syntactic elements of the HTML. Keeping only the data you need]
95110

96111
`rvest::html_nodes()`
97112
`rvest::html_text()`
98113
`rvest::html_attr()`
99114
]
100-
101-
102115
]
103116

104-
105-
106-
107-
&lt;div class="footercc"&gt;
108-
&lt;i class="fab fa-creative-commons"&gt;&lt;/i&gt;&amp;nbsp; &lt;i class="fab fa-creative-commons-by"&gt;&lt;/i&gt;&lt;i class="fab fa-creative-commons-nc"&gt;&lt;/i&gt; &lt;a href = "https://JohnLittle.info"&gt;&lt;span class = "opacity30"&gt;https://&lt;/span&gt;JohnLittle&lt;span class = "opacity30"&gt;.info&lt;/span&gt;&lt;/a&gt;
109-
&lt;span class = "opacity30"&gt;&lt;a href="https://github.com/libjohn/workshop_webscraping"&gt;https://github.com/libjohn/workshop_webscraping&lt;/a&gt; | 2021-03-01 &lt;/span&gt;
110-
&lt;/div&gt;
111-
112-
113-
114-
115-
116117
---
117118
## HTML
118119

@@ -181,20 +182,19 @@
181182

182183
1. Development
183184

184-
- Import raw HTML of a single target page (page detail, or “leaf)
185-
- Parse the HTML of the test page to gather the data you want
186-
- Check robots.txt, terms of use
187-
- In a web browser, manually browse and understand the site navigation of the scrape target (site navigation, or “branches)
188-
- Parse the site navigation and develop an interation plan
189-
- Iterate: write code that implements iteration, i.e. automated page crawling
185+
- Import raw HTML of a single target page (page detail: a leaf or node)
186+
- Parse the HTML of the test page and gather specific data
187+
- Check _robots.txt_ and _Terms Of Use_ (TOU)
188+
- In a web browser, manually browse and understand the target site's navigation (site navigation: branches)
189+
- _Parse_ the site navigation and develop an _iteration_ plan
190+
- _Iterate_: orchestrate/automate page crawling
190191
- Perform a dry run with a limited subset of the target web site
191-
- Construct time pauses (to avoid DNS attacks)
192-
- Production
192+
- Construct pauses: avoid the posture of a DNS attack
193193

194-
1. Iterate
194+
1. Production
195195

196-
- Crawl the site navigation (branches)
197-
- Parse HTML for each detail page (leaves)
196+
- Iterate/Crawl the site (navigation: branches)
197+
- Parse HTML for each target page (pages: leaves or nodes)
198198

199199

200200

@@ -208,6 +208,10 @@
208208
---
209209
background-image: url(images/selector_graph.png)
210210

211+
&lt;!-- an image of branches and nodes --&gt;
212+
213+
?? an image of branches and nodes
214+
211215

212216

213217

slides/styles/my-theme.css

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,10 @@ height: auto;
2424
max-height: 200px;
2525
}
2626

27+
.myc {
28+
text-align: right;
29+
}
30+
2731

2832

2933

0 commit comments

Comments
 (0)