|
| 1 | +--- |
| 2 | +title: Scraping SDK |
| 3 | +--- |
| 4 | + |
| 5 | +As part of code generation, Reworkd generates code in its own custom SDK called [Harambe](https://github.com/reworkd/harambe). |
| 6 | +Harambe is web scraping SDK with a number of useful methods and features for: |
| 7 | +- Saving data and validating that the data follows a specific schema |
| 8 | +- Enqueuing (and automatically formatting) urls |
| 9 | +- De-duplicating saved data, urls, etc |
| 10 | +- Effectively handling classic web scraping problems like pagination, pdfs, downloads, etc |
| 11 | + |
| 12 | +These methods, what they do, how they work, and some examples of how to use them will be highlighted below. |
| 13 | + |
| 14 | +--- |
| 15 | + |
| 16 | +## `save_data` |
| 17 | +Save scraped data and validate its type matches the current schema |
| 18 | + |
| 19 | +**Signature:** |
| 20 | +```python |
| 21 | +def save_data(self, data: dict[str, Any]) -> None |
| 22 | +``` |
| 23 | +**Params:** |
| 24 | +- `data`: Rows of data (as dictionaries) to save |
| 25 | + |
| 26 | + |
| 27 | + |
| 28 | + |
| 29 | +**Raises:** |
| 30 | +- `SchemaValidationError`: If any of the saved data does not match the provided schema |
| 31 | + |
| 32 | + |
| 33 | +**Example:** |
| 34 | +```python |
| 35 | +await sdk.save_data({ "title": "example", "description": "another_example" }) |
| 36 | +``` |
| 37 | +--- |
| 38 | + |
| 39 | +## `enqueue` |
| 40 | +Enqueue url(s) to be scraped later. |
| 41 | + |
| 42 | +**Signature:** |
| 43 | +```python |
| 44 | +def enqueue(self, urls: str | Awaitable[str], context: dict[str, Any] | None = None, options: dict[str, Any] | None = None) -> None |
| 45 | +``` |
| 46 | +**Params:** |
| 47 | +- `urls`: urls to enqueue |
| 48 | +- `context`: additional context to pass to the next run of the next stage/url. Typically just data that is only available on the current page but required in the schema. Only use this when some data is available on this page, but not on the page that is enqueued. |
| 49 | +- `options`: job level options to pass to the next stage/url |
| 50 | + |
| 51 | + |
| 52 | + |
| 53 | + |
| 54 | + |
| 55 | + |
| 56 | +**Example:** |
| 57 | +```python |
| 58 | +await sdk.enqueue("https://www.test.com") |
| 59 | +await sdk.enqueue("/some-path") # This will automatically be converted into an absolute url |
| 60 | +``` |
| 61 | +--- |
| 62 | + |
| 63 | +## `paginate` |
| 64 | +SDK method to automatically facilitate paginating a list of elements. |
| 65 | +Simply define a function that should return any of: |
| 66 | +- A direct link to the next page |
| 67 | +- An element with hrefs to the next page |
| 68 | +- An element to click on to get to the next page |
| 69 | +And call `sdk.paginate` at the end of your scrape function. The element will automatically be used to paginate the site and run the scraping code against all pages |
| 70 | +Pagination will conclude once all pages are reached no next page element is found. |
| 71 | +This method should ALWAYS be used for pagination instead of manual for loops and if statements. |
| 72 | + |
| 73 | +**Signature:** |
| 74 | +```python |
| 75 | +def paginate(self, get_next_page_element: Callable[Ellipsis, Awaitable[str | playwright.async_api._generated.ElementHandle | None]], timeout: int = 2000) -> None |
| 76 | +``` |
| 77 | +**Params:** |
| 78 | +- `get_next_page_element`: the url or ElementHandle of the next page |
| 79 | +- `timeout`: milliseconds to sleep for before continuing. Only use if there is no other wait option |
| 80 | + |
| 81 | + |
| 82 | + |
| 83 | + |
| 84 | + |
| 85 | + |
| 86 | +**Example:** |
| 87 | +```python |
| 88 | +async def pager(): |
| 89 | + return await page.query_selector("div.pagination > .pager.next") |
| 90 | + |
| 91 | +await sdk.paginate(pager) |
| 92 | +``` |
| 93 | +--- |
| 94 | + |
| 95 | +## `capture_url` |
| 96 | +Capture the url of a click event. This will click the element and return the url |
| 97 | +via network request interception. This is useful for capturing urls that are |
| 98 | +generated dynamically (eg: redirects to document downloads). |
| 99 | + |
| 100 | +**Signature:** |
| 101 | +```python |
| 102 | +def capture_url(self, clickable: ElementHandle, resource_type: Literal[document, stylesheet, image, media, font, script, texttrack, xhr, fetch, eventsource, websocket, manifest, other, *] = 'document', timeout: int | None = 10000) -> str | None |
| 103 | +``` |
| 104 | +**Params:** |
| 105 | +- `clickable`: the element to click |
| 106 | +- `resource_type`: the type of resource to capture |
| 107 | +- `timeout`: the time to wait for the new page to open (in ms) |
| 108 | + |
| 109 | + |
| 110 | +**Return Value:** |
| 111 | +url: the url of the captured resource or None if no match was found |
| 112 | + |
| 113 | +**Raises:** |
| 114 | +- `ValueError`: if more than page is created by the click event |
| 115 | + |
| 116 | + |
| 117 | + |
| 118 | +--- |
| 119 | + |
| 120 | +## `capture_download` |
| 121 | +Capture a download event that gets triggered by clicking an element. This method will: |
| 122 | +- Handle clicking the element |
| 123 | +- Download the resulting file |
| 124 | +- Apply download handling logic and build a download URL |
| 125 | +- Return a download metadata object |
| 126 | +Use this method to manually download dynamic files or files that can only be downloaded in the current browser session. |
| 127 | + |
| 128 | +**Signature:** |
| 129 | +```python |
| 130 | +def capture_download(self, clickable: ElementHandle, override_filename: str | None = None, override_url: str | None = None, timeout: float | None = None) -> DownloadMeta |
| 131 | +``` |
| 132 | + |
| 133 | + |
| 134 | +**Return Value:** |
| 135 | +DownloadMeta: A typed dict containing the download metadata such as the `url` and `filename` |
| 136 | + |
| 137 | + |
| 138 | + |
| 139 | + |
| 140 | +--- |
| 141 | + |
| 142 | +## `capture_html` |
| 143 | +Capture and download the html content of the document or a specific element. |
| 144 | +The returned HTML will be cleaned of any excluded elements and will be wrapped in a proper HTML document structure. |
| 145 | + |
| 146 | +**Signature:** |
| 147 | +```python |
| 148 | +def capture_html(self, selector: str = 'html', exclude_selectors: list[str] | None = None, soup_transform: Callable[BeautifulSoup, None] | None = None, html_converter_type: Literal[markdown, text] = 'markdown') -> HTMLMetadata |
| 149 | +``` |
| 150 | +**Params:** |
| 151 | +- `selector`: CSS selector of element to capture. Defaults to "html" for the document element. |
| 152 | +- `exclude_selectors`: List of CSS selectors for elements to exclude from capture. |
| 153 | +- `soup_transform`: A function to transform the BeautifulSoup html prior to saving. Use this to remove aspects of the returned content |
| 154 | +- `html_converter_type`: Type of HTML converter to use for the inner text. Defaults to "markdown". |
| 155 | + |
| 156 | + |
| 157 | +**Return Value:** |
| 158 | +HTMLMetadata containing the `html` of the element, the formatted `text` of the element, along with the `url` and `filename` of the document |
| 159 | + |
| 160 | +**Raises:** |
| 161 | +- `ValueError`: If the specified selector doesn't match any element. |
| 162 | + |
| 163 | + |
| 164 | +**Example:** |
| 165 | +```python |
| 166 | +meta = await sdk.capture_html(selector="div.content") |
| 167 | +await sdk.save_data({"name": meta["filename"], "text": meta["text"], "download_url": meta["url"]}) |
| 168 | +``` |
| 169 | +--- |
| 170 | + |
| 171 | +## `capture_pdf` |
| 172 | +Capture the current page as a pdf and then apply some download handling logic |
| 173 | +from the observer to transform to a usable URL |
| 174 | + |
| 175 | +**Signature:** |
| 176 | +```python |
| 177 | +def capture_pdf(self) -> DownloadMeta |
| 178 | +``` |
| 179 | + |
| 180 | + |
| 181 | +**Return Value:** |
| 182 | +DownloadMeta: A typed dict containing the download metadata such as the `url` and `filename` |
| 183 | + |
| 184 | + |
| 185 | + |
| 186 | +**Example:** |
| 187 | +```python |
| 188 | +meta = await sdk.capture_pdf() |
| 189 | +await sdk.save_data({"file_name": meta["filename"], "download_url": meta["url"]}) |
| 190 | +``` |
0 commit comments