Skip to content

Commit cbdb3bc

Browse files
committed
beta
1 parent b8334cd commit cbdb3bc

File tree

2 files changed

+83
-47
lines changed

2 files changed

+83
-47
lines changed

slides/index.Rmd

Lines changed: 33 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -37,18 +37,18 @@ I would like to take a moment to honor the land in Durham, NC. Duke University
3737
---
3838
## Demonstration Goals
3939

40+
```{r child="_child-footer.Rmd", include=FALSE}
41+
```
42+
4043
- Building on earlier [Rfun workshops](https://rfun.library.duke.edu/)
4144
- Web scraping is fundamentally a deconstruction process
4245
- Introduce just enough HTML/CSS and HTTP
4346
- Introduce the `library(rvest)` package for harvesting websites/HTML
44-
- Tidyverse iteration with `purrr:map`
47+
- Tidyverse iteration with `purrr::map`
4548
- Point out useful documentation & resources
4649

4750
.f7.i.moon-gray[This is a demonstration of leveraging the Tidyverse. This is not an research design or HTML design class. YMMV: data gathering and cleaning are vital and can be complex. ]
4851

49-
```{r child="_child-footer.Rmd", include=FALSE}
50-
```
51-
5252
--
5353

5454
### Caveats
@@ -57,10 +57,9 @@ I would like to take a moment to honor the land in Durham, NC. Duke University
5757
- Read and honor the host's robots.txt | https://www.robotstxt.org
5858
- Always **pause** to avoid the perception of a _Denial of Service_ (DOS) attack
5959

60-
61-
6260
---
63-
class:middle
61+
```{r child="_child-footer.Rmd", include=FALSE}
62+
```
6463

6564
.left-column[
6665
### Scraping =
@@ -70,13 +69,9 @@ class:middle
7069
![scraping bee propolis](images/Scraping_propolis.jpg "scraping propolis")
7170
 
7271
`rvest::`
73-
`read_html()`
74-
72+
`read_html()`
7573
]
7674

77-
```{r child="_child-footer.Rmd", include=FALSE}
78-
```
79-
8075
--
8176

8277
.right-column[
@@ -90,7 +85,18 @@ class:middle
9085
.pull-left[
9186
.f7[Systematically iterating through a website, gathering data from more than one page (URL)]
9287

93-
`purrr::map()`
88+
`purrr::map()`
89+
90+
 
91+
92+
 
93+
94+
 
95+
96+
97+
.f7[
98+
https://purrr.tidyverse.org
99+
]
94100
]
95101

96102

@@ -99,10 +105,22 @@ class:middle
99105

100106
`rvest::html_nodes()`
101107
`rvest::html_text()`
102-
`rvest::html_attr()`
108+
`rvest::html_attr()`
109+
110+
 
111+
112+
 
113+
114+
115+
.f7[
116+
https://rvest.tidyverse.org
117+
118+
]
103119
]
104120
]
105121

122+
123+
106124
---
107125
## HTML
108126

@@ -113,7 +131,7 @@ Hypter Text Markup Language
113131
<body>
114132

115133
<h1>My First Heading</h1>
116-
<p>My first paragraph. contains a
134+
<p>My first paragraph contains a
117135
<a href="https://www.w3schools.com">link</a> to
118136
W3schools.com
119137
</p>

slides/index.html

Lines changed: 50 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
<title>R case study: web scraping</title>
55
<meta charset="utf-8" />
66
<meta name="author" content="John Little" />
7-
<meta name="date" content="2021-03-01" />
7+
<meta name="date" content="2021-03-02" />
88
<script src="libs/header-attrs/header-attrs.js"></script>
99
<link href="libs/tachyons/tachyons.min.css" rel="stylesheet" />
1010
<link href="libs/font-awesome/css/all.css" rel="stylesheet" />
@@ -18,7 +18,7 @@
1818

1919
# R case study: web scraping
2020
### John Little
21-
### 2021-03-01
21+
### 2021-03-02
2222

2323
---
2424

@@ -34,25 +34,25 @@
3434
---
3535
## Demonstration Goals
3636

37-
- Building on earlier [Rfun workshops](https://rfun.library.duke.edu/)
38-
- Web scraping is fundamentally a deconstruction process
39-
- Introduce just enough HTML/CSS and HTTP
40-
- Introduce the `library(rvest)` package for harvesting websites/HTML
41-
- Tidyverse iteration with `purrr:map`
42-
- Point out useful documentation &amp; resources
43-
44-
.f7.i.moon-gray[This is a demonstration of leveraging the Tidyverse. This is not an research design or HTML design class. YMMV: data gathering and cleaning are vital and can be complex. ]
45-
4637

4738

4839
&lt;div class="footercc"&gt;
4940
&lt;i class="fab fa-creative-commons"&gt;&lt;/i&gt;&amp;nbsp; &lt;i class="fab fa-creative-commons-by"&gt;&lt;/i&gt;&lt;i class="fab fa-creative-commons-nc"&gt;&lt;/i&gt; &lt;a href = "https://JohnLittle.info"&gt;&lt;span class = "opacity30"&gt;https://&lt;/span&gt;JohnLittle&lt;span class = "opacity30"&gt;.info&lt;/span&gt;&lt;/a&gt;
50-
&lt;span class = "opacity30"&gt; | &lt;a href="https://github.com/libjohn/workshop_webscraping"&gt;https://github.com/libjohn/workshop_webscraping&lt;/a&gt; | 2021-03-01 &lt;/span&gt;
41+
&lt;span class = "opacity30"&gt; | &lt;a href="https://github.com/libjohn/workshop_webscraping"&gt;https://github.com/libjohn/workshop_webscraping&lt;/a&gt; | 2021-03-02 &lt;/span&gt;
5142
&lt;/div&gt;
5243

5344

5445

5546

47+
- Building on earlier [Rfun workshops](https://rfun.library.duke.edu/)
48+
- Web scraping is fundamentally a deconstruction process
49+
- Introduce just enough HTML/CSS and HTTP
50+
- Introduce the `library(rvest)` package for harvesting websites/HTML
51+
- Tidyverse iteration with `purrr::map`
52+
- Point out useful documentation &amp; resources
53+
54+
.f7.i.moon-gray[This is a demonstration of leveraging the Tidyverse. This is not an research design or HTML design class. YMMV: data gathering and cleaning are vital and can be complex. ]
55+
5656
--
5757

5858
### Caveats
@@ -61,10 +61,16 @@
6161
- Read and honor the host's robots.txt | https://www.robotstxt.org
6262
- Always **pause** to avoid the perception of a _Denial of Service_ (DOS) attack
6363

64+
---
65+
66+
67+
&lt;div class="footercc"&gt;
68+
&lt;i class="fab fa-creative-commons"&gt;&lt;/i&gt;&amp;nbsp; &lt;i class="fab fa-creative-commons-by"&gt;&lt;/i&gt;&lt;i class="fab fa-creative-commons-nc"&gt;&lt;/i&gt; &lt;a href = "https://JohnLittle.info"&gt;&lt;span class = "opacity30"&gt;https://&lt;/span&gt;JohnLittle&lt;span class = "opacity30"&gt;.info&lt;/span&gt;&lt;/a&gt;
69+
&lt;span class = "opacity30"&gt; | &lt;a href="https://github.com/libjohn/workshop_webscraping"&gt;https://github.com/libjohn/workshop_webscraping&lt;/a&gt; | 2021-03-02 &lt;/span&gt;
70+
&lt;/div&gt;
71+
6472

6573

66-
---
67-
class:middle
6874

6975
.left-column[
7076
### Scraping =
@@ -74,20 +80,9 @@
7480
![scraping bee propolis](images/Scraping_propolis.jpg "scraping propolis")
7581
&amp;nbsp;
7682
`rvest::`
77-
`read_html()`
78-
83+
`read_html()`
7984
]
8085

81-
82-
83-
&lt;div class="footercc"&gt;
84-
&lt;i class="fab fa-creative-commons"&gt;&lt;/i&gt;&amp;nbsp; &lt;i class="fab fa-creative-commons-by"&gt;&lt;/i&gt;&lt;i class="fab fa-creative-commons-nc"&gt;&lt;/i&gt; &lt;a href = "https://JohnLittle.info"&gt;&lt;span class = "opacity30"&gt;https://&lt;/span&gt;JohnLittle&lt;span class = "opacity30"&gt;.info&lt;/span&gt;&lt;/a&gt;
85-
&lt;span class = "opacity30"&gt; | &lt;a href="https://github.com/libjohn/workshop_webscraping"&gt;https://github.com/libjohn/workshop_webscraping&lt;/a&gt; | 2021-03-01 &lt;/span&gt;
86-
&lt;/div&gt;
87-
88-
89-
90-
9186
--
9287

9388
.right-column[
@@ -101,7 +96,18 @@
10196
.pull-left[
10297
.f7[Systematically iterating through a website, gathering data from more than one page (URL)]
10398

104-
`purrr::map()`
99+
`purrr::map()`
100+
101+
&amp;nbsp;
102+
103+
&amp;nbsp;
104+
105+
&amp;nbsp;
106+
107+
108+
.f7[
109+
https://purrr.tidyverse.org
110+
]
105111
]
106112

107113

@@ -110,9 +116,21 @@
110116

111117
`rvest::html_nodes()`
112118
`rvest::html_text()`
113-
`rvest::html_attr()`
119+
`rvest::html_attr()`
120+
121+
&amp;nbsp;
122+
123+
&amp;nbsp;
124+
125+
126+
.f7[
127+
https://rvest.tidyverse.org
128+
114129
]
115130
]
131+
]
132+
133+
116134

117135
---
118136
## HTML
@@ -138,7 +156,7 @@
138156

139157
&lt;div class="footercc"&gt;
140158
&lt;i class="fab fa-creative-commons"&gt;&lt;/i&gt;&amp;nbsp; &lt;i class="fab fa-creative-commons-by"&gt;&lt;/i&gt;&lt;i class="fab fa-creative-commons-nc"&gt;&lt;/i&gt; &lt;a href = "https://JohnLittle.info"&gt;&lt;span class = "opacity30"&gt;https://&lt;/span&gt;JohnLittle&lt;span class = "opacity30"&gt;.info&lt;/span&gt;&lt;/a&gt;
141-
&lt;span class = "opacity30"&gt; | &lt;a href="https://github.com/libjohn/workshop_webscraping"&gt;https://github.com/libjohn/workshop_webscraping&lt;/a&gt; | 2021-03-01 &lt;/span&gt;
159+
&lt;span class = "opacity30"&gt; | &lt;a href="https://github.com/libjohn/workshop_webscraping"&gt;https://github.com/libjohn/workshop_webscraping&lt;/a&gt; | 2021-03-02 &lt;/span&gt;
142160
&lt;/div&gt;
143161

144162

@@ -175,7 +193,7 @@
175193

176194
&lt;div class="footercc"&gt;
177195
&lt;i class="fab fa-creative-commons"&gt;&lt;/i&gt;&amp;nbsp; &lt;i class="fab fa-creative-commons-by"&gt;&lt;/i&gt;&lt;i class="fab fa-creative-commons-nc"&gt;&lt;/i&gt; &lt;a href = "https://JohnLittle.info"&gt;&lt;span class = "opacity30"&gt;https://&lt;/span&gt;JohnLittle&lt;span class = "opacity30"&gt;.info&lt;/span&gt;&lt;/a&gt;
178-
&lt;span class = "opacity30"&gt; | &lt;a href="https://github.com/libjohn/workshop_webscraping"&gt;https://github.com/libjohn/workshop_webscraping&lt;/a&gt; | 2021-03-01 &lt;/span&gt;
196+
&lt;span class = "opacity30"&gt; | &lt;a href="https://github.com/libjohn/workshop_webscraping"&gt;https://github.com/libjohn/workshop_webscraping&lt;/a&gt; | 2021-03-02 &lt;/span&gt;
179197
&lt;/div&gt;
180198

181199

@@ -206,7 +224,7 @@
206224

207225
&lt;div class="footercc"&gt;
208226
&lt;i class="fab fa-creative-commons"&gt;&lt;/i&gt;&amp;nbsp; &lt;i class="fab fa-creative-commons-by"&gt;&lt;/i&gt;&lt;i class="fab fa-creative-commons-nc"&gt;&lt;/i&gt; &lt;a href = "https://JohnLittle.info"&gt;&lt;span class = "opacity30"&gt;https://&lt;/span&gt;JohnLittle&lt;span class = "opacity30"&gt;.info&lt;/span&gt;&lt;/a&gt;
209-
&lt;span class = "opacity30"&gt; | &lt;a href="https://github.com/libjohn/workshop_webscraping"&gt;https://github.com/libjohn/workshop_webscraping&lt;/a&gt; | 2021-03-01 &lt;/span&gt;
227+
&lt;span class = "opacity30"&gt; | &lt;a href="https://github.com/libjohn/workshop_webscraping"&gt;https://github.com/libjohn/workshop_webscraping&lt;/a&gt; | 2021-03-02 &lt;/span&gt;
210228
&lt;/div&gt;
211229

212230

@@ -223,7 +241,7 @@
223241

224242
&lt;div class="footercc"&gt;
225243
&lt;i class="fab fa-creative-commons"&gt;&lt;/i&gt;&amp;nbsp; &lt;i class="fab fa-creative-commons-by"&gt;&lt;/i&gt;&lt;i class="fab fa-creative-commons-nc"&gt;&lt;/i&gt; &lt;a href = "https://JohnLittle.info"&gt;&lt;span class = "opacity30"&gt;https://&lt;/span&gt;JohnLittle&lt;span class = "opacity30"&gt;.info&lt;/span&gt;&lt;/a&gt;
226-
&lt;span class = "opacity30"&gt; | &lt;a href="https://github.com/libjohn/workshop_webscraping"&gt;https://github.com/libjohn/workshop_webscraping&lt;/a&gt; | 2021-03-01 &lt;/span&gt;
244+
&lt;span class = "opacity30"&gt; | &lt;a href="https://github.com/libjohn/workshop_webscraping"&gt;https://github.com/libjohn/workshop_webscraping&lt;/a&gt; | 2021-03-02 &lt;/span&gt;
227245
&lt;/div&gt;
228246

229247

0 commit comments

Comments
 (0)