Skip to content
This repository was archived by the owner on Jan 28, 2026. It is now read-only.

Commit 9a72474

Browse files
📄 Update docs (#1661)
1 parent cd900e5 commit 9a72474

35 files changed

+666
-807
lines changed

docs/developers/api-keys.mdx

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
---
2+
title: API Keys
3+
---
4+
5+
Before you can use the Reworkd API, you will need to create an API key for your organization. To do this:
6+
1. Travel to the organization API key page. You may either visit [https://auth.reworkd.ai/org/api_keys/](https://auth.reworkd.ai/org/api_keys/) directly or click the settings button on the organization menu dropdown
7+
<Frame>
8+
<img src="/images/organization-menu.png" />
9+
</Frame>
10+
2. Ensure you are on the `Organization API Keys` page
11+
3. Click the `New API key` button and create a new API key with a reasonable expiration date
12+
13+
You should now be able to use your new API key to authenticate your requests. To do this, you will need to add the following header to your requests:
14+
```
15+
Authorization: Bearer <YOUR-API-KEY>
16+
```

docs/developers/file-downloads.mdx

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
---
2+
title: Handling File Downloads
3+
---
4+
Different types of file downloads require different code strategies. This page outlines various strategies you may take.
5+
6+
## Regular Download Links
7+
Regular downloads occur when the file link is directly available within the HTML (typically in the `href` of an `<a>` tag). Clicking these links directly initiates a file download.
8+
9+
To handle these downloads:
10+
11+
1. Save the URL directly from the page.
12+
2. Reworkd will then asynchronously visit and download the file. We use `curl-cffi` mimicking browser behavior when downloading the file.
13+
14+
```python
15+
# Select the link element
16+
link = await sdk.page.query_selector('a.download')
17+
18+
# Get the URL directly
19+
href = await link.get_attribute("href")
20+
21+
# Save the URL, Lambda will handle the download
22+
await sdk.save_data({"download_url": href })
23+
```
24+
25+
## Indirect Download Links
26+
27+
Indirect downloads happen when the direct link isn't immediately visible but becomes available after clicking a button or link.
28+
29+
To handle indirect downloads:
30+
31+
1. Click the button/link to open the URL.
32+
2. Capture and save the newly loaded URL.
33+
3. Automatically navigate back.
34+
35+
```python
36+
# Select element to open page
37+
element = await sdk.page.query_selector('button.download')
38+
39+
# Capture the URL after clicking
40+
download_url = await sdk.capture_url(element)
41+
42+
# Save URL for download via Lambda
43+
await sdk.save_data({"download_url": download_url })
44+
```
45+
46+
## JavaScript/Dynamic Downloads
47+
48+
Dynamic downloads occur when a file download is triggered by JavaScript events directly in the browser, without a direct URL.
49+
50+
To handle dynamic downloads:
51+
52+
1. Use `capture_download` method to trigger and capture the download directly in the browser.
53+
2. Retrieve the file metadata (URL and title).
54+
55+
```python
56+
# Select element triggering download
57+
element = await sdk.page.query_selector('button.download')
58+
59+
# Capture download event directly
60+
download_metadata = await sdk.capture_download(element)
61+
62+
# Save file metadata directly
63+
await sdk.save_data({
64+
"attachment": {
65+
"download_url": download_metadata["url"],
66+
"title": download_metadata["title"],
67+
},
68+
})
69+
```
70+
71+
## Downloads Requiring Cookies/Session
72+
73+
Some sites require the download to occur within the same browser session that accessed the page, making AWS Lambda unsuitable.
74+
75+
In these cases:
76+
77+
- Follow the same approach as dynamic downloads, handling the download directly in the browser context using `capture_download`.

docs/developers/sdk.mdx

Lines changed: 190 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,190 @@
1+
---
2+
title: Scraping SDK
3+
---
4+
5+
As part of code generation, Reworkd generates code in its own custom SDK called [Harambe](https://github.com/reworkd/harambe).
6+
Harambe is web scraping SDK with a number of useful methods and features for:
7+
- Saving data and validating that the data follows a specific schema
8+
- Enqueuing (and automatically formatting) urls
9+
- De-duplicating saved data, urls, etc
10+
- Effectively handling classic web scraping problems like pagination, pdfs, downloads, etc
11+
12+
These methods, what they do, how they work, and some examples of how to use them will be highlighted below.
13+
14+
---
15+
16+
## `save_data`
17+
Save scraped data and validate its type matches the current schema
18+
19+
**Signature:**
20+
```python
21+
def save_data(self, data: dict[str, Any]) -> None
22+
```
23+
**Params:**
24+
- `data`: Rows of data (as dictionaries) to save
25+
26+
27+
28+
29+
**Raises:**
30+
- `SchemaValidationError`: If any of the saved data does not match the provided schema
31+
32+
33+
**Example:**
34+
```python
35+
await sdk.save_data({ "title": "example", "description": "another_example" })
36+
```
37+
---
38+
39+
## `enqueue`
40+
Enqueue url(s) to be scraped later.
41+
42+
**Signature:**
43+
```python
44+
def enqueue(self, urls: str | Awaitable[str], context: dict[str, Any] | None = None, options: dict[str, Any] | None = None) -> None
45+
```
46+
**Params:**
47+
- `urls`: urls to enqueue
48+
- `context`: additional context to pass to the next run of the next stage/url. Typically just data that is only available on the current page but required in the schema. Only use this when some data is available on this page, but not on the page that is enqueued.
49+
- `options`: job level options to pass to the next stage/url
50+
51+
52+
53+
54+
55+
56+
**Example:**
57+
```python
58+
await sdk.enqueue("https://www.test.com")
59+
await sdk.enqueue("/some-path") # This will automatically be converted into an absolute url
60+
```
61+
---
62+
63+
## `paginate`
64+
SDK method to automatically facilitate paginating a list of elements.
65+
Simply define a function that should return any of:
66+
- A direct link to the next page
67+
- An element with hrefs to the next page
68+
- An element to click on to get to the next page
69+
And call `sdk.paginate` at the end of your scrape function. The element will automatically be used to paginate the site and run the scraping code against all pages
70+
Pagination will conclude once all pages are reached no next page element is found.
71+
This method should ALWAYS be used for pagination instead of manual for loops and if statements.
72+
73+
**Signature:**
74+
```python
75+
def paginate(self, get_next_page_element: Callable[Ellipsis, Awaitable[str | playwright.async_api._generated.ElementHandle | None]], timeout: int = 2000) -> None
76+
```
77+
**Params:**
78+
- `get_next_page_element`: the url or ElementHandle of the next page
79+
- `timeout`: milliseconds to sleep for before continuing. Only use if there is no other wait option
80+
81+
82+
83+
84+
85+
86+
**Example:**
87+
```python
88+
async def pager():
89+
return await page.query_selector("div.pagination > .pager.next")
90+
91+
await sdk.paginate(pager)
92+
```
93+
---
94+
95+
## `capture_url`
96+
Capture the url of a click event. This will click the element and return the url
97+
via network request interception. This is useful for capturing urls that are
98+
generated dynamically (eg: redirects to document downloads).
99+
100+
**Signature:**
101+
```python
102+
def capture_url(self, clickable: ElementHandle, resource_type: Literal[document, stylesheet, image, media, font, script, texttrack, xhr, fetch, eventsource, websocket, manifest, other, *] = 'document', timeout: int | None = 10000) -> str | None
103+
```
104+
**Params:**
105+
- `clickable`: the element to click
106+
- `resource_type`: the type of resource to capture
107+
- `timeout`: the time to wait for the new page to open (in ms)
108+
109+
110+
**Return Value:**
111+
url: the url of the captured resource or None if no match was found
112+
113+
**Raises:**
114+
- `ValueError`: if more than page is created by the click event
115+
116+
117+
118+
---
119+
120+
## `capture_download`
121+
Capture a download event that gets triggered by clicking an element. This method will:
122+
- Handle clicking the element
123+
- Download the resulting file
124+
- Apply download handling logic and build a download URL
125+
- Return a download metadata object
126+
Use this method to manually download dynamic files or files that can only be downloaded in the current browser session.
127+
128+
**Signature:**
129+
```python
130+
def capture_download(self, clickable: ElementHandle, override_filename: str | None = None, override_url: str | None = None, timeout: float | None = None) -> DownloadMeta
131+
```
132+
133+
134+
**Return Value:**
135+
DownloadMeta: A typed dict containing the download metadata such as the `url` and `filename`
136+
137+
138+
139+
140+
---
141+
142+
## `capture_html`
143+
Capture and download the html content of the document or a specific element.
144+
The returned HTML will be cleaned of any excluded elements and will be wrapped in a proper HTML document structure.
145+
146+
**Signature:**
147+
```python
148+
def capture_html(self, selector: str = 'html', exclude_selectors: list[str] | None = None, soup_transform: Callable[BeautifulSoup, None] | None = None, html_converter_type: Literal[markdown, text] = 'markdown') -> HTMLMetadata
149+
```
150+
**Params:**
151+
- `selector`: CSS selector of element to capture. Defaults to "html" for the document element.
152+
- `exclude_selectors`: List of CSS selectors for elements to exclude from capture.
153+
- `soup_transform`: A function to transform the BeautifulSoup html prior to saving. Use this to remove aspects of the returned content
154+
- `html_converter_type`: Type of HTML converter to use for the inner text. Defaults to "markdown".
155+
156+
157+
**Return Value:**
158+
HTMLMetadata containing the `html` of the element, the formatted `text` of the element, along with the `url` and `filename` of the document
159+
160+
**Raises:**
161+
- `ValueError`: If the specified selector doesn't match any element.
162+
163+
164+
**Example:**
165+
```python
166+
meta = await sdk.capture_html(selector="div.content")
167+
await sdk.save_data({"name": meta["filename"], "text": meta["text"], "download_url": meta["url"]})
168+
```
169+
---
170+
171+
## `capture_pdf`
172+
Capture the current page as a pdf and then apply some download handling logic
173+
from the observer to transform to a usable URL
174+
175+
**Signature:**
176+
```python
177+
def capture_pdf(self) -> DownloadMeta
178+
```
179+
180+
181+
**Return Value:**
182+
DownloadMeta: A typed dict containing the download metadata such as the `url` and `filename`
183+
184+
185+
186+
**Example:**
187+
```python
188+
meta = await sdk.capture_pdf()
189+
await sdk.save_data({"file_name": meta["filename"], "download_url": meta["url"]})
190+
```

docs/development/auth.mdx

Lines changed: 0 additions & 63 deletions
This file was deleted.

docs/development/memory.mdx

Lines changed: 0 additions & 35 deletions
This file was deleted.

0 commit comments

Comments
 (0)