|
| 1 | +--- |
| 2 | +title: "R case study: web scraping" |
| 3 | +author: "John Little" |
| 4 | +date: "`r Sys.Date()`" |
| 5 | +output: |
| 6 | + xaringan::moon_reader: |
| 7 | + lib_dir: libs |
| 8 | + css: |
| 9 | + - xaringan-themer.css |
| 10 | + - styles/my-theme.css |
| 11 | + nature: |
| 12 | + highlightStyle: github |
| 13 | + highlightLines: true |
| 14 | + countIncrementalSlides: false |
| 15 | +--- |
| 16 | + |
| 17 | +```{r setup, include=FALSE} |
| 18 | +options(htmltools.dir.version = FALSE) |
| 19 | +``` |
| 20 | + |
| 21 | +```{r xaringan-themer, include=FALSE, warning=FALSE} |
| 22 | +library(xaringanthemer) |
| 23 | +library(tidyverse) |
| 24 | +library(gt) |
| 25 | +library(xaringanExtra) |
| 26 | +xaringanExtra::use_tachyons() |
| 27 | +library(htmltools) |
| 28 | +tagList(rmarkdown::html_dependency_font_awesome()) |
| 29 | +
|
| 30 | +style_duo_accent(primary_color = "#012169", secondary_color = "#005587") |
| 31 | +``` |
| 32 | + |
| 33 | +## Duke University: Land Acknowledgement |
| 34 | + |
| 35 | +I would like to take a moment to honor the land in Durham, NC. Duke University sits on the ancestral lands of the Shakori, Eno and Catawba people. This institution of higher education is built on land stolen from those peoples. These tribes were here before the colonizers arrived. Additionally this land has borne witness to over 400 years of the enslavement, torture, and systematic mistreatment of African people and their descendants. Recognizing this history is an honest attempt to breakout beyond persistent patterns of colonization and to rewrite the erasure of Indigenous and Black peoples. There is value in acknowledging the history of our occupied spaces and places. I hope we can glimpse an understanding of these histories by recognizing the origins of collective journeys. |
| 36 | + |
| 37 | +--- |
| 38 | +## Demonstration Goals |
| 39 | + |
| 40 | +- Building on earlier [Rfun workshops](https://rfun.library.duke.edu/) |
| 41 | +- Web scraping is fundamentally a deconstruction process |
| 42 | +- Introduce just enough HTML/CSS and HTTP |
| 43 | +- Introduce the `library(rvest)` package for harvesting websites/HTML |
| 44 | +- Tidyverse iteration with `purrr:map` |
| 45 | +- Point out useful documentation & resources |
| 46 | + |
| 47 | +.f7.i.moon-gray[This is not an research design or HTML design class. YMMV: data gathering and cleaning are vital and can be complex. This is a demonstration of leveraging the Tidyverse.] |
| 48 | + |
| 49 | +### Caveats |
| 50 | +- You will be as successful as the web author(s) were consistent |
| 51 | +- Read and follow the _Terms of Use_ for any target web host |
| 52 | +- Read and honor the host's robots.txt | https://www.robotstxt.org |
| 53 | + - Always **pause** to avoid the perception of a _Denial of Service_ (DOS) attack |
| 54 | + |
| 55 | +```{r child="_child-footer.Rmd", include=FALSE} |
| 56 | +``` |
| 57 | + |
| 58 | +--- |
| 59 | +class:middle |
| 60 | + |
| 61 | +.left-column[ |
| 62 | +### Scraping |
| 63 | + |
| 64 | +.f6[Gather or ingest web page data for analysis] |
| 65 | + |
| 66 | + |
| 67 | + |
| 68 | +`rvest::` |
| 69 | +`read_html()` |
| 70 | + |
| 71 | +] |
| 72 | + |
| 73 | +.right-column[ |
| 74 | + |
| 75 | +**<span text-align: left;>Crawling<span> <span text-align: center;>+</span> <span text-align:right;>Parsing</span>** |
| 76 | + |
| 77 | +<div class = "container" id = "imgspcl" style="width: 100%; max-width: 100%;"> |
| 78 | +<img src = "images/crawling_med.jpg" width = "50%"> + <img src = "images/strain_comb.jpg" width="50%"> |
| 79 | +</div> |
| 80 | + |
| 81 | + |
| 82 | + |
| 83 | +.pull-left[ |
| 84 | +.f7[Systematically iterating through a website, gathering data from more than one page (URL)] |
| 85 | + |
| 86 | +`purrr::map()` |
| 87 | +] |
| 88 | + |
| 89 | +.pull-right[ |
| 90 | +.f7[Separating the syntactic elements of the HTML. Keeping only the data you need] |
| 91 | + |
| 92 | +`rvest::html_nodes()` |
| 93 | +`rvest::html_text()` |
| 94 | +`rvest::html_attr()` |
| 95 | +] |
| 96 | + |
| 97 | + |
| 98 | +] |
| 99 | + |
| 100 | + |
| 101 | +```{r child="_child-footer.Rmd", include=FALSE} |
| 102 | +``` |
| 103 | + |
| 104 | + |
| 105 | +--- |
| 106 | +## HTML |
| 107 | + |
| 108 | +Hypter Text Markup Language |
| 109 | + |
| 110 | +```html |
| 111 | +<html> |
| 112 | + <body> |
| 113 | + |
| 114 | + <h1>My First Heading</h1> |
| 115 | + <p>My first paragraph. contains a |
| 116 | + <a href="https://www.w3schools.com">link</a> to |
| 117 | + W3schools.com |
| 118 | + </p> |
| 119 | + |
| 120 | + </body> |
| 121 | +</html> |
| 122 | + |
| 123 | +``` |
| 124 | + |
| 125 | +```{r child="_child-footer.Rmd", include=FALSE} |
| 126 | +``` |
| 127 | +--- |
| 128 | +## CSS |
| 129 | + |
| 130 | +Cascading Style Sheets |
| 131 | + |
| 132 | +```css |
| 133 | + |
| 134 | +<html> |
| 135 | +<body> |
| 136 | + |
| 137 | + <div class=”abc”> </div> |
| 138 | + |
| 139 | + <div id=”xyz”> </div> |
| 140 | + |
| 141 | +</body> |
| 142 | +</html> |
| 143 | + |
| 144 | +``` |
| 145 | + |
| 146 | +http://www.vondel.humanities.uva.nl/style.css |
| 147 | + |
| 148 | + |
| 149 | +```{r child="_child-footer.Rmd", include=FALSE} |
| 150 | +``` |
| 151 | + |
| 152 | +--- |
| 153 | +## Procedure |
| 154 | + |
| 155 | +The basic workflow of web scraping is |
| 156 | + |
| 157 | +1. Development |
| 158 | + |
| 159 | + - Import raw HTML of a single target page (page detail, or “leaf”) |
| 160 | + - Parse the HTML of the test page to gather the data you want |
| 161 | + - Check robots.txt, terms of use |
| 162 | + - In a web browser, manually browse and understand the site navigation of the scrape target (site navigation, or “branches”) |
| 163 | + - Parse the site navigation and develop an interation plan |
| 164 | + - Iterate: write code that implements iteration, i.e. automated page crawling |
| 165 | + - Perform a dry run with a limited subset of the target web site |
| 166 | + - Construct time pauses (to avoid DNS attacks) |
| 167 | + - Production |
| 168 | + |
| 169 | +1. Iterate |
| 170 | + |
| 171 | + - Crawl the site navigation (branches) |
| 172 | + - Parse HTML for each detail page (leaves) |
| 173 | + |
| 174 | +```{r child="_child-footer.Rmd", include=FALSE} |
| 175 | +``` |
| 176 | +--- |
| 177 | +background-image: url(images/selector_graph.png) |
| 178 | + |
| 179 | + |
| 180 | +```{r child="_child-footer.Rmd", include=FALSE} |
| 181 | +``` |
| 182 | +--- |
| 183 | +class: middle, center |
| 184 | + |
| 185 | +.bg-washed-blue.b--navy.ba.bw2.br3.shadow-5.ph4.mt5[ |
| 186 | + |
| 187 | +## John R Little |
| 188 | + |
| 189 | +.f5.blue[Data Science Librarian |
| 190 | +Center for Data & Visualization Sciences |
| 191 | +Duke University Libraries |
| 192 | +] |
| 193 | + |
| 194 | +.f7[https://johnlittle.info] |
| 195 | +.f7[https://Rfun.library.duke.edu] |
| 196 | +.f7[https://library.duke.edu/data] |
| 197 | +] |
| 198 | + |
| 199 | +<i class="fab fa-creative-commons fa-2x"></i> <i class="fab fa-creative-commons-by fa-2x"></i><i class="fab fa-creative-commons-nc fa-2x"></i> |
| 200 | +.f6.moon-gray[Creative Commons: Attribution-NonCommercial 4.0] |
| 201 | +.f7.moon-gray[https://creativecommons.org/licenses/by-nc/4.0] |
| 202 | + |
| 203 | + |
| 204 | + |
| 205 | + |
0 commit comments