Skip to content

Commit a1c9f3f

Browse files
committed
slides 0.7
1 parent 3740a19 commit a1c9f3f

30 files changed

+7133
-0
lines changed

slides/_child-footer.Rmd

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
---
2+
title: "slides child footer"
3+
---
4+
5+
<div class="footercc">
6+
<i class="fab fa-creative-commons"></i>&nbsp; <i class="fab fa-creative-commons-by"></i><i class="fab fa-creative-commons-nc"></i> <a href = "https://JohnLittle.info"><span class = "opacity30">https://</span>JohnLittle<span class = "opacity30">.info</a> |
7+
<a href="https://github.com/libjohn/workshop_webscraping">https://github.com/libjohn/workshop_webscraping</a> | `r Sys.Date()` </span>
8+
</span></div>
9+
10+
11+

slides/_child-footer.html

Lines changed: 245 additions & 0 deletions
Large diffs are not rendered by default.
27 KB
Loading

slides/images/crawling_med.jpg

190 KB
Loading

slides/images/crawling_small.jpg

51.8 KB
Loading

slides/images/selector_graph.png

64.7 KB
Loading

slides/images/strain_comb.jpg

360 KB
Loading

slides/index.Rmd

Lines changed: 205 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,205 @@
1+
---
2+
title: "R case study: web scraping"
3+
author: "John Little"
4+
date: "`r Sys.Date()`"
5+
output:
6+
xaringan::moon_reader:
7+
lib_dir: libs
8+
css:
9+
- xaringan-themer.css
10+
- styles/my-theme.css
11+
nature:
12+
highlightStyle: github
13+
highlightLines: true
14+
countIncrementalSlides: false
15+
---
16+
17+
```{r setup, include=FALSE}
18+
options(htmltools.dir.version = FALSE)
19+
```
20+
21+
```{r xaringan-themer, include=FALSE, warning=FALSE}
22+
library(xaringanthemer)
23+
library(tidyverse)
24+
library(gt)
25+
library(xaringanExtra)
26+
xaringanExtra::use_tachyons()
27+
library(htmltools)
28+
tagList(rmarkdown::html_dependency_font_awesome())
29+
30+
style_duo_accent(primary_color = "#012169", secondary_color = "#005587")
31+
```
32+
33+
## Duke University: Land Acknowledgement
34+
35+
I would like to take a moment to honor the land in Durham, NC. Duke University sits on the ancestral lands of the Shakori, Eno and Catawba people. This institution of higher education is built on land stolen from those peoples. These tribes were here before the colonizers arrived. Additionally this land has borne witness to over 400 years of the enslavement, torture, and systematic mistreatment of African people and their descendants. Recognizing this history is an honest attempt to breakout beyond persistent patterns of colonization and to rewrite the erasure of Indigenous and Black peoples. There is value in acknowledging the history of our occupied spaces and places. I hope we can glimpse an understanding of these histories by recognizing the origins of collective journeys.
36+
37+
---
38+
## Demonstration Goals
39+
40+
- Building on earlier [Rfun workshops](https://rfun.library.duke.edu/)
41+
- Web scraping is fundamentally a deconstruction process
42+
- Introduce just enough HTML/CSS and HTTP
43+
- Introduce the `library(rvest)` package for harvesting websites/HTML
44+
- Tidyverse iteration with `purrr:map`
45+
- Point out useful documentation & resources
46+
47+
.f7.i.moon-gray[This is not an research design or HTML design class. YMMV: data gathering and cleaning are vital and can be complex. This is a demonstration of leveraging the Tidyverse.]
48+
49+
### Caveats
50+
- You will be as successful as the web author(s) were consistent
51+
- Read and follow the _Terms of Use_ for any target web host
52+
- Read and honor the host's robots.txt | https://www.robotstxt.org
53+
- Always **pause** to avoid the perception of a _Denial of Service_ (DOS) attack
54+
55+
```{r child="_child-footer.Rmd", include=FALSE}
56+
```
57+
58+
---
59+
class:middle
60+
61+
.left-column[
62+
### Scraping
63+
64+
.f6[Gather or ingest web page data for analysis]
65+
66+
![scraping bee propolis](images/Scraping_propolis.jpg "scraping propolis")
67+
&nbsp;
68+
`rvest::`
69+
`read_html()`
70+
71+
]
72+
73+
.right-column[
74+
75+
**<span text-align: left;>Crawling<span> <span text-align: center;>+</span> <span text-align:right;>Parsing</span>**
76+
77+
<div class = "container" id = "imgspcl" style="width: 100%; max-width: 100%;">
78+
<img src = "images/crawling_med.jpg" width = "50%"> &nbsp; + &nbsp; <img src = "images/strain_comb.jpg" width="50%">
79+
</div>
80+
81+
82+
83+
.pull-left[
84+
.f7[Systematically iterating through a website, gathering data from more than one page (URL)]
85+
86+
`purrr::map()`
87+
]
88+
89+
.pull-right[
90+
.f7[Separating the syntactic elements of the HTML. Keeping only the data you need]
91+
92+
`rvest::html_nodes()`
93+
`rvest::html_text()`
94+
`rvest::html_attr()`
95+
]
96+
97+
98+
]
99+
100+
101+
```{r child="_child-footer.Rmd", include=FALSE}
102+
```
103+
104+
105+
---
106+
## HTML
107+
108+
Hypter Text Markup Language
109+
110+
```html
111+
<html>
112+
<body>
113+
114+
<h1>My First Heading</h1>
115+
<p>My first paragraph. contains a
116+
<a href="https://www.w3schools.com">link</a> to
117+
W3schools.com
118+
</p>
119+
120+
</body>
121+
</html>
122+
123+
```
124+
125+
```{r child="_child-footer.Rmd", include=FALSE}
126+
```
127+
---
128+
## CSS
129+
130+
Cascading Style Sheets
131+
132+
```css
133+
134+
<html>
135+
<body>
136+
137+
<div class=”abc”> </div>
138+
139+
<div id=”xyz”> </div>
140+
141+
</body>
142+
</html>
143+
144+
```
145+
146+
http://www.vondel.humanities.uva.nl/style.css
147+
148+
149+
```{r child="_child-footer.Rmd", include=FALSE}
150+
```
151+
152+
---
153+
## Procedure
154+
155+
The basic workflow of web scraping is
156+
157+
1. Development
158+
159+
- Import raw HTML of a single target page (page detail, or “leaf”)
160+
- Parse the HTML of the test page to gather the data you want
161+
- Check robots.txt, terms of use
162+
- In a web browser, manually browse and understand the site navigation of the scrape target (site navigation, or “branches”)
163+
- Parse the site navigation and develop an interation plan
164+
- Iterate: write code that implements iteration, i.e. automated page crawling
165+
- Perform a dry run with a limited subset of the target web site
166+
- Construct time pauses (to avoid DNS attacks)
167+
- Production
168+
169+
1. Iterate
170+
171+
- Crawl the site navigation (branches)
172+
- Parse HTML for each detail page (leaves)
173+
174+
```{r child="_child-footer.Rmd", include=FALSE}
175+
```
176+
---
177+
background-image: url(images/selector_graph.png)
178+
179+
180+
```{r child="_child-footer.Rmd", include=FALSE}
181+
```
182+
---
183+
class: middle, center
184+
185+
.bg-washed-blue.b--navy.ba.bw2.br3.shadow-5.ph4.mt5[
186+
187+
## John R Little
188+
189+
.f5.blue[Data Science Librarian
190+
Center for Data & Visualization Sciences
191+
Duke University Libraries
192+
]
193+
194+
.f7[https://johnlittle.info]
195+
.f7[https://Rfun.library.duke.edu]
196+
.f7[https://library.duke.edu/data]
197+
]
198+
199+
<i class="fab fa-creative-commons fa-2x"></i> &nbsp; <i class="fab fa-creative-commons-by fa-2x"></i><i class="fab fa-creative-commons-nc fa-2x"></i>
200+
.f6.moon-gray[Creative Commons: Attribution-NonCommercial 4.0]
201+
.f7.moon-gray[https://creativecommons.org/licenses/by-nc/4.0]
202+
203+
204+
205+

0 commit comments

Comments
 (0)