|
41 | 41 | - Tidyverse iteration with `purrr:map` |
42 | 42 | - Point out useful documentation & resources |
43 | 43 |
|
44 | | -.f7.i.moon-gray[This is not an research design or HTML design class. YMMV: data gathering and cleaning are vital and can be complex. This is a demonstration of leveraging the Tidyverse.] |
| 44 | +.f7.i.moon-gray[This is a demonstration of leveraging the Tidyverse. This is not an research design or HTML design class. YMMV: data gathering and cleaning are vital and can be complex. ] |
45 | 45 |
|
46 | | -### Caveats |
47 | | -- You will be as successful as the web author(s) were consistent |
48 | | -- Read and follow the _Terms of Use_ for any target web host |
49 | | -- Read and honor the host's robots.txt | https://www.robotstxt.org |
50 | | - - Always **pause** to avoid the perception of a _Denial of Service_ (DOS) attack |
51 | | - |
52 | 46 |
|
53 | 47 |
|
54 | 48 | <div class="footercc"> |
|
59 | 53 |
|
60 | 54 |
|
61 | 55 |
|
| 56 | +-- |
| 57 | + |
| 58 | +### Caveats |
| 59 | +- You will be as successful as the web author(s) were consistent |
| 60 | +- Read and follow the _Terms of Use_ for any target web host |
| 61 | +- Read and honor the host's robots.txt | https://www.robotstxt.org |
| 62 | + - Always **pause** to avoid the perception of a _Denial of Service_ (DOS) attack |
| 63 | + |
| 64 | + |
| 65 | + |
62 | 66 | --- |
63 | 67 | class:middle |
64 | 68 |
|
65 | 69 | .left-column[ |
66 | | -### Scraping |
| 70 | +### Scraping = |
67 | 71 |
|
68 | 72 | .f6[Gather or ingest web page data for analysis] |
69 | 73 |
|
|
74 | 78 |
|
75 | 79 | ] |
76 | 80 |
|
| 81 | + |
| 82 | + |
| 83 | +<div class="footercc"> |
| 84 | +<i class="fab fa-creative-commons"></i>&nbsp; <i class="fab fa-creative-commons-by"></i><i class="fab fa-creative-commons-nc"></i> <a href = "https://JohnLittle.info"><span class = "opacity30">https://</span>JohnLittle<span class = "opacity30">.info</span></a> |
| 85 | +<span class = "opacity30"><a href="https://github.com/libjohn/workshop_webscraping">https://github.com/libjohn/workshop_webscraping</a> | 2021-03-01 </span> |
| 86 | +</div> |
| 87 | + |
| 88 | + |
| 89 | + |
| 90 | + |
| 91 | +-- |
| 92 | + |
77 | 93 | .right-column[ |
78 | 94 |
|
79 | | -**<span text-align: left;>Crawling<span> <span text-align: center;>+</span> <span text-align:right;>Parsing</span>** |
| 95 | +** Crawling + Parsing** |
80 | 96 |
|
81 | 97 | <div class = "container" id = "imgspcl" style="width: 100%; max-width: 100%;"> |
82 | 98 | <img src = "images/crawling_med.jpg" width = "50%"> &nbsp; + &nbsp; <img src = "images/strain_comb.jpg" width="50%"> |
83 | 99 | </div> |
84 | 100 |
|
85 | | - |
86 | | - |
87 | 101 | .pull-left[ |
88 | 102 | .f7[Systematically iterating through a website, gathering data from more than one page (URL)] |
89 | 103 |
|
90 | 104 | `purrr::map()` |
91 | 105 | ] |
92 | 106 |
|
| 107 | + |
93 | 108 | .pull-right[ |
94 | 109 | .f7[Separating the syntactic elements of the HTML. Keeping only the data you need] |
95 | 110 |
|
96 | 111 | `rvest::html_nodes()` |
97 | 112 | `rvest::html_text()` |
98 | 113 | `rvest::html_attr()` |
99 | 114 | ] |
100 | | - |
101 | | - |
102 | 115 | ] |
103 | 116 |
|
104 | | - |
105 | | - |
106 | | - |
107 | | -<div class="footercc"> |
108 | | -<i class="fab fa-creative-commons"></i>&nbsp; <i class="fab fa-creative-commons-by"></i><i class="fab fa-creative-commons-nc"></i> <a href = "https://JohnLittle.info"><span class = "opacity30">https://</span>JohnLittle<span class = "opacity30">.info</span></a> |
109 | | -<span class = "opacity30"><a href="https://github.com/libjohn/workshop_webscraping">https://github.com/libjohn/workshop_webscraping</a> | 2021-03-01 </span> |
110 | | -</div> |
111 | | - |
112 | | - |
113 | | - |
114 | | - |
115 | | - |
116 | 117 | --- |
117 | 118 | ## HTML |
118 | 119 |
|
|
181 | 182 |
|
182 | 183 | 1. Development |
183 | 184 |
|
184 | | - - Import raw HTML of a single target page (page detail, or “leaf”) |
185 | | - - Parse the HTML of the test page to gather the data you want |
186 | | - - Check robots.txt, terms of use |
187 | | - - In a web browser, manually browse and understand the site navigation of the scrape target (site navigation, or “branches”) |
188 | | - - Parse the site navigation and develop an interation plan |
189 | | - - Iterate: write code that implements iteration, i.e. automated page crawling |
| 185 | + - Import raw HTML of a single target page (page detail: a leaf or node) |
| 186 | + - Parse the HTML of the test page and gather specific data |
| 187 | + - Check _robots.txt_ and _Terms Of Use_ (TOU) |
| 188 | + - In a web browser, manually browse and understand the target site's navigation (site navigation: branches) |
| 189 | + - _Parse_ the site navigation and develop an _iteration_ plan |
| 190 | + - _Iterate_: orchestrate/automate page crawling |
190 | 191 | - Perform a dry run with a limited subset of the target web site |
191 | | - - Construct time pauses (to avoid DNS attacks) |
192 | | - - Production |
| 192 | + - Construct pauses: avoid the posture of a DNS attack |
193 | 193 |
|
194 | | -1. Iterate |
| 194 | +1. Production |
195 | 195 |
|
196 | | - - Crawl the site navigation (branches) |
197 | | - - Parse HTML for each detail page (leaves) |
| 196 | + - Iterate/Crawl the site (navigation: branches) |
| 197 | + - Parse HTML for each target page (pages: leaves or nodes) |
198 | 198 |
|
199 | 199 |
|
200 | 200 |
|
|
208 | 208 | --- |
209 | 209 | background-image: url(images/selector_graph.png) |
210 | 210 |
|
| 211 | +<!-- an image of branches and nodes --> |
| 212 | + |
| 213 | +?? an image of branches and nodes |
| 214 | + |
211 | 215 |
|
212 | 216 |
|
213 | 217 |
|
|
0 commit comments