Skip to content

Commit 8dd28fc

Browse files
authored
docs: improve docstrings of storages (#465)
### Description - Improve docstrings of storage classes. - I also changed the list of main classes to reflect at least "somehow" the current public interface. ### Issues - Relates: #304 ### Testing - Website was rendered locally. ### Checklist - [x] CI passed
1 parent dbf3b2e commit 8dd28fc

File tree

9 files changed

+169
-138
lines changed

9 files changed

+169
-138
lines changed

.github/workflows/check_pr_title.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,11 @@ name: Check PR title
22

33
on:
44
pull_request_target:
5-
types: [ opened, edited, synchronize ]
5+
types: [opened, edited, synchronize]
66

77
jobs:
88
check_pr_title:
9-
name: 'Check PR title'
9+
name: Check PR title
1010
runs-on: ubuntu-latest
1111
steps:
1212
- uses: amannn/[email protected]

.github/workflows/docs.yaml

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
name: docs
2+
3+
on:
4+
push:
5+
branches:
6+
- master
7+
workflow_dispatch:
8+
9+
jobs:
10+
build:
11+
environment:
12+
name: github-pages
13+
permissions:
14+
contents: write
15+
pages: write
16+
id-token: write
17+
runs-on: ubuntu-latest
18+
19+
steps:
20+
- uses: actions/checkout@v4
21+
22+
- name: Use Node.js 20
23+
uses: actions/setup-node@v4
24+
with:
25+
node-version: 20
26+
27+
- name: Enable corepack
28+
run: |
29+
corepack enable
30+
corepack prepare yarn@stable --activate
31+
32+
- name: Set up Python
33+
uses: actions/setup-python@v5
34+
with:
35+
python-version: 3.12
36+
37+
- name: Install Python dependencies
38+
run: make install-dev
39+
40+
- name: Build generated API reference
41+
run: make build-api-reference
42+
43+
- name: Build & deploy docs
44+
run: |
45+
# go to website dir
46+
cd website
47+
# install website deps
48+
yarn
49+
# build the docs
50+
yarn build
51+
env:
52+
APIFY_SIGNING_TOKEN: ${{ secrets.APIFY_SIGNING_TOKEN }}
53+
54+
- name: Set up GitHub Pages
55+
uses: actions/configure-pages@v5
56+
57+
- name: Upload GitHub Pages artifact
58+
uses: actions/upload-pages-artifact@v3
59+
with:
60+
path: ./website/build
61+
62+
- name: Deploy artifact to GitHub Pages
63+
uses: actions/deploy-pages@v4

.github/workflows/docs.yml

Lines changed: 0 additions & 63 deletions
This file was deleted.

.github/workflows/run_release.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ on:
2323
description: The custom version to bump to (only for "custom" type)
2424
required: false
2525
type: string
26-
default: ''
26+
default: ""
2727

2828
jobs:
2929
# This job determines if the conditions are met for a release to occur. It will proceed if triggered manually,
@@ -110,7 +110,7 @@ jobs:
110110
with:
111111
author_name: Apify Release Bot
112112
author_email: [email protected]
113-
message: 'chore(release): Update changelog and package version [skip ci]'
113+
message: "chore(release): Update changelog and package version [skip ci]"
114114

115115
create_github_release:
116116
name: Create github release

src/crawlee/_request.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -37,14 +37,14 @@ class BaseRequestData(BaseModel):
3737
"""URL of the web page to crawl"""
3838

3939
unique_key: Annotated[str, Field(alias='uniqueKey')]
40-
"""A unique key identifying the request. Two requests with the same `uniqueKey` are considered as pointing to the
41-
same URL.
40+
"""A unique key identifying the request. Two requests with the same `unique_key` are considered as pointing
41+
to the same URL.
4242
43-
If `uniqueKey` is not provided, then it is automatically generated by normalizing the URL.
44-
For example, the URL of `HTTP://www.EXAMPLE.com/something/` will produce the `uniqueKey`
43+
If `unique_key` is not provided, then it is automatically generated by normalizing the URL.
44+
For example, the URL of `HTTP://www.EXAMPLE.com/something/` will produce the `unique_key`
4545
of `http://www.example.com/something`.
4646
47-
Pass an arbitrary non-empty text value to the `uniqueKey` property
47+
Pass an arbitrary non-empty text value to the `unique_key` property
4848
to override the default behavior and specify which URLs shall be considered equal.
4949
"""
5050

src/crawlee/storages/_dataset.py

Lines changed: 21 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -75,22 +75,32 @@ class ExportToKwargs(TypedDict):
7575

7676

7777
class Dataset(BaseStorage):
78-
"""Represents an append-only structured storage, ideal for tabular data akin to database tables.
78+
"""Represents an append-only structured storage, ideal for tabular data similar to database tables.
7979
80-
Represents a structured data store similar to a table, where each object (row) has consistent attributes (columns).
81-
Datasets operate on an append-only basis, allowing for the addition of new records without the modification or
82-
removal of existing ones. This class is typically used for storing crawling results.
80+
The `Dataset` class is designed to store structured data, where each entry (row) maintains consistent attributes
81+
(columns) across the dataset. It operates in an append-only mode, allowing new records to be added, but not
82+
modified or deleted. This makes it particularly useful for storing results from web crawling operations.
8383
84-
Data can be stored locally or in the cloud, with local storage paths formatted as:
85-
`{CRAWLEE_STORAGE_DIR}/datasets/{DATASET_ID}/{INDEX}.json`. Here, `{DATASET_ID}` is either "default" or
86-
a specific dataset ID, and `{INDEX}` represents the zero-based index of the item in the dataset.
84+
Data can be stored either locally or in the cloud. It depends on the setup of underlying storage client.
85+
By default a `MemoryStorageClient` is used, but it can be changed to a different one.
8786
88-
To open a dataset, use the `open` class method with an `id`, `name`, or `config`. If unspecified, the default
89-
dataset for the current crawler run is used. Opening a non-existent dataset by `id` raises an error, while
90-
by `name`, it is created.
87+
By default, data is stored using the following path structure:
88+
```
89+
{CRAWLEE_STORAGE_DIR}/datasets/{DATASET_ID}/{INDEX}.json
90+
```
91+
- `{CRAWLEE_STORAGE_DIR}`: The root directory for all storage data specified by the environment variable.
92+
- `{DATASET_ID}`: Specifies the dataset, either "default" or a custom dataset ID.
93+
- `{INDEX}`: Represents the zero-based index of the record within the dataset.
94+
95+
To open a dataset, use the `open` class method by specifying an `id`, `name`, or `configuration`. If none are
96+
provided, the default dataset for the current crawler run is used. Attempting to open a dataset by `id` that does
97+
not exist will raise an error; however, if accessed by `name`, the dataset will be created if it doesn't already
98+
exist.
9199
92100
Usage:
93-
dataset = await Dataset.open(id='my_dataset_id')
101+
```python
102+
dataset = await Dataset.open(name='my_dataset')
103+
```
94104
"""
95105

96106
_MAX_PAYLOAD_SIZE = ByteSize.from_mb(9)

src/crawlee/storages/_key_value_store.py

Lines changed: 26 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -15,24 +15,34 @@
1515

1616

1717
class KeyValueStore(BaseStorage):
18-
"""Represents a key-value based storage for reading data records or files.
19-
20-
Each record is identified by a unique key and associated with a MIME content type. This class is used within
21-
crawler runs to store inputs and outputs, typically in JSON format, but supports other types as well.
22-
23-
The data can be stored on a local filesystem or in the cloud, determined by the `CRAWLEE_STORAGE_DIR`
24-
environment variable.
25-
26-
By default, data is stored in `{CRAWLEE_STORAGE_DIR}/key_value_stores/{STORE_ID}/{INDEX}.{EXT}`, where
27-
`{STORE_ID}` is either "default" or specified by `CRAWLEE_DEFAULT_KEY_VALUE_STORE_ID`, `{KEY}` is the record key,
28-
and `{EXT}` is the MIME type.
29-
30-
To open a key-value store, use the class method `open`, providing either an `id` or `name` along with optional
31-
`config`. If neither is provided, the default store for the crawler run is used. Opening a non-existent store by
32-
`id` raises an error, while a non-existent store by `name` is created.
18+
"""Represents a key-value based storage for reading and writing data records or files.
19+
20+
Each data record is identified by a unique key and associated with a specific MIME content type. This class is
21+
commonly used in crawler runs to store inputs and outputs, typically in JSON format, but it also supports other
22+
content types.
23+
24+
Data can be stored either locally or in the cloud. It depends on the setup of underlying storage client.
25+
By default a `MemoryStorageClient` is used, but it can be changed to a different one.
26+
27+
By default, data is stored using the following path structure:
28+
```
29+
{CRAWLEE_STORAGE_DIR}/key_value_stores/{STORE_ID}/{KEY}.{EXT}
30+
```
31+
- `{CRAWLEE_STORAGE_DIR}`: The root directory for all storage data specified by the environment variable.
32+
- `{STORE_ID}`: The identifier for the key-value store, either "default" or as specified by
33+
`CRAWLEE_DEFAULT_KEY_VALUE_STORE_ID`.
34+
- `{KEY}`: The unique key for the record.
35+
- `{EXT}`: The file extension corresponding to the MIME type of the content.
36+
37+
To open a key-value store, use the `open` class method, providing an `id`, `name`, or optional `configuration`.
38+
If none are specified, the default store for the current crawler run is used. Attempting to open a store by `id`
39+
that does not exist will raise an error; however, if accessed by `name`, the store will be created if it does not
40+
already exist.
3341
3442
Usage:
35-
kvs = await KeyValueStore.open(id='my_kvs_id')
43+
```python
44+
kvs = await KeyValueStore.open(name='my_kvs')
45+
```
3646
"""
3747

3848
def __init__(

0 commit comments

Comments
 (0)