|
4 | 4 | <title>R case study: web scraping</title> |
5 | 5 | <meta charset="utf-8" /> |
6 | 6 | <meta name="author" content="John Little" /> |
7 | | - <meta name="date" content="2021-03-01" /> |
| 7 | + <meta name="date" content="2021-03-02" /> |
8 | 8 | <script src="libs/header-attrs/header-attrs.js"></script> |
9 | 9 | <link href="libs/tachyons/tachyons.min.css" rel="stylesheet" /> |
10 | 10 | <link href="libs/font-awesome/css/all.css" rel="stylesheet" /> |
|
18 | 18 |
|
19 | 19 | # R case study: web scraping |
20 | 20 | ### John Little |
21 | | -### 2021-03-01 |
| 21 | +### 2021-03-02 |
22 | 22 |
|
23 | 23 | --- |
24 | 24 |
|
|
34 | 34 | --- |
35 | 35 | ## Demonstration Goals |
36 | 36 |
|
37 | | -- Building on earlier [Rfun workshops](https://rfun.library.duke.edu/) |
38 | | -- Web scraping is fundamentally a deconstruction process |
39 | | -- Introduce just enough HTML/CSS and HTTP |
40 | | -- Introduce the `library(rvest)` package for harvesting websites/HTML |
41 | | -- Tidyverse iteration with `purrr:map` |
42 | | -- Point out useful documentation & resources |
43 | | - |
44 | | -.f7.i.moon-gray[This is a demonstration of leveraging the Tidyverse. This is not an research design or HTML design class. YMMV: data gathering and cleaning are vital and can be complex. ] |
45 | | - |
46 | 37 |
|
47 | 38 |
|
48 | 39 | <div class="footercc"> |
49 | 40 | <i class="fab fa-creative-commons"></i>&nbsp; <i class="fab fa-creative-commons-by"></i><i class="fab fa-creative-commons-nc"></i> <a href = "https://JohnLittle.info"><span class = "opacity30">https://</span>JohnLittle<span class = "opacity30">.info</span></a> |
50 | | -<span class = "opacity30"> | <a href="https://github.com/libjohn/workshop_webscraping">https://github.com/libjohn/workshop_webscraping</a> | 2021-03-01 </span> |
| 41 | +<span class = "opacity30"> | <a href="https://github.com/libjohn/workshop_webscraping">https://github.com/libjohn/workshop_webscraping</a> | 2021-03-02 </span> |
51 | 42 | </div> |
52 | 43 |
|
53 | 44 |
|
54 | 45 |
|
55 | 46 |
|
| 47 | +- Building on earlier [Rfun workshops](https://rfun.library.duke.edu/) |
| 48 | +- Web scraping is fundamentally a deconstruction process |
| 49 | +- Introduce just enough HTML/CSS and HTTP |
| 50 | +- Introduce the `library(rvest)` package for harvesting websites/HTML |
| 51 | +- Tidyverse iteration with `purrr::map` |
| 52 | +- Point out useful documentation & resources |
| 53 | + |
| 54 | +.f7.i.moon-gray[This is a demonstration of leveraging the Tidyverse. This is not an research design or HTML design class. YMMV: data gathering and cleaning are vital and can be complex. ] |
| 55 | + |
56 | 56 | -- |
57 | 57 |
|
58 | 58 | ### Caveats |
|
61 | 61 | - Read and honor the host's robots.txt | https://www.robotstxt.org |
62 | 62 | - Always **pause** to avoid the perception of a _Denial of Service_ (DOS) attack |
63 | 63 |
|
| 64 | +--- |
| 65 | + |
| 66 | + |
| 67 | +<div class="footercc"> |
| 68 | +<i class="fab fa-creative-commons"></i>&nbsp; <i class="fab fa-creative-commons-by"></i><i class="fab fa-creative-commons-nc"></i> <a href = "https://JohnLittle.info"><span class = "opacity30">https://</span>JohnLittle<span class = "opacity30">.info</span></a> |
| 69 | +<span class = "opacity30"> | <a href="https://github.com/libjohn/workshop_webscraping">https://github.com/libjohn/workshop_webscraping</a> | 2021-03-02 </span> |
| 70 | +</div> |
| 71 | + |
64 | 72 |
|
65 | 73 |
|
66 | | ---- |
67 | | -class:middle |
68 | 74 |
|
69 | 75 | .left-column[ |
70 | 76 | ### Scraping = |
|
74 | 80 |  |
75 | 81 | &nbsp; |
76 | 82 | `rvest::` |
77 | | -`read_html()` |
78 | | - |
| 83 | +`read_html()` |
79 | 84 | ] |
80 | 85 |
|
81 | | - |
82 | | - |
83 | | -<div class="footercc"> |
84 | | -<i class="fab fa-creative-commons"></i>&nbsp; <i class="fab fa-creative-commons-by"></i><i class="fab fa-creative-commons-nc"></i> <a href = "https://JohnLittle.info"><span class = "opacity30">https://</span>JohnLittle<span class = "opacity30">.info</span></a> |
85 | | -<span class = "opacity30"> | <a href="https://github.com/libjohn/workshop_webscraping">https://github.com/libjohn/workshop_webscraping</a> | 2021-03-01 </span> |
86 | | -</div> |
87 | | - |
88 | | - |
89 | | - |
90 | | - |
91 | 86 | -- |
92 | 87 |
|
93 | 88 | .right-column[ |
|
101 | 96 | .pull-left[ |
102 | 97 | .f7[Systematically iterating through a website, gathering data from more than one page (URL)] |
103 | 98 |
|
104 | | -`purrr::map()` |
| 99 | +`purrr::map()` |
| 100 | + |
| 101 | +&nbsp; |
| 102 | + |
| 103 | +&nbsp; |
| 104 | + |
| 105 | +&nbsp; |
| 106 | + |
| 107 | + |
| 108 | +.f7[ |
| 109 | +https://purrr.tidyverse.org |
| 110 | +] |
105 | 111 | ] |
106 | 112 |
|
107 | 113 |
|
|
110 | 116 |
|
111 | 117 | `rvest::html_nodes()` |
112 | 118 | `rvest::html_text()` |
113 | | -`rvest::html_attr()` |
| 119 | +`rvest::html_attr()` |
| 120 | + |
| 121 | +&nbsp; |
| 122 | + |
| 123 | +&nbsp; |
| 124 | + |
| 125 | + |
| 126 | +.f7[ |
| 127 | +https://rvest.tidyverse.org |
| 128 | + |
114 | 129 | ] |
115 | 130 | ] |
| 131 | +] |
| 132 | + |
| 133 | + |
116 | 134 |
|
117 | 135 | --- |
118 | 136 | ## HTML |
|
138 | 156 |
|
139 | 157 | <div class="footercc"> |
140 | 158 | <i class="fab fa-creative-commons"></i>&nbsp; <i class="fab fa-creative-commons-by"></i><i class="fab fa-creative-commons-nc"></i> <a href = "https://JohnLittle.info"><span class = "opacity30">https://</span>JohnLittle<span class = "opacity30">.info</span></a> |
141 | | -<span class = "opacity30"> | <a href="https://github.com/libjohn/workshop_webscraping">https://github.com/libjohn/workshop_webscraping</a> | 2021-03-01 </span> |
| 159 | +<span class = "opacity30"> | <a href="https://github.com/libjohn/workshop_webscraping">https://github.com/libjohn/workshop_webscraping</a> | 2021-03-02 </span> |
142 | 160 | </div> |
143 | 161 |
|
144 | 162 |
|
|
175 | 193 |
|
176 | 194 | <div class="footercc"> |
177 | 195 | <i class="fab fa-creative-commons"></i>&nbsp; <i class="fab fa-creative-commons-by"></i><i class="fab fa-creative-commons-nc"></i> <a href = "https://JohnLittle.info"><span class = "opacity30">https://</span>JohnLittle<span class = "opacity30">.info</span></a> |
178 | | -<span class = "opacity30"> | <a href="https://github.com/libjohn/workshop_webscraping">https://github.com/libjohn/workshop_webscraping</a> | 2021-03-01 </span> |
| 196 | +<span class = "opacity30"> | <a href="https://github.com/libjohn/workshop_webscraping">https://github.com/libjohn/workshop_webscraping</a> | 2021-03-02 </span> |
179 | 197 | </div> |
180 | 198 |
|
181 | 199 |
|
|
206 | 224 |
|
207 | 225 | <div class="footercc"> |
208 | 226 | <i class="fab fa-creative-commons"></i>&nbsp; <i class="fab fa-creative-commons-by"></i><i class="fab fa-creative-commons-nc"></i> <a href = "https://JohnLittle.info"><span class = "opacity30">https://</span>JohnLittle<span class = "opacity30">.info</span></a> |
209 | | -<span class = "opacity30"> | <a href="https://github.com/libjohn/workshop_webscraping">https://github.com/libjohn/workshop_webscraping</a> | 2021-03-01 </span> |
| 227 | +<span class = "opacity30"> | <a href="https://github.com/libjohn/workshop_webscraping">https://github.com/libjohn/workshop_webscraping</a> | 2021-03-02 </span> |
210 | 228 | </div> |
211 | 229 |
|
212 | 230 |
|
|
223 | 241 |
|
224 | 242 | <div class="footercc"> |
225 | 243 | <i class="fab fa-creative-commons"></i>&nbsp; <i class="fab fa-creative-commons-by"></i><i class="fab fa-creative-commons-nc"></i> <a href = "https://JohnLittle.info"><span class = "opacity30">https://</span>JohnLittle<span class = "opacity30">.info</span></a> |
226 | | -<span class = "opacity30"> | <a href="https://github.com/libjohn/workshop_webscraping">https://github.com/libjohn/workshop_webscraping</a> | 2021-03-01 </span> |
| 244 | +<span class = "opacity30"> | <a href="https://github.com/libjohn/workshop_webscraping">https://github.com/libjohn/workshop_webscraping</a> | 2021-03-02 </span> |
227 | 245 | </div> |
228 | 246 |
|
229 | 247 |
|
|
0 commit comments