@@ -75,22 +75,32 @@ class ExportToKwargs(TypedDict):
75
75
76
76
77
77
class Dataset (BaseStorage ):
78
- """Represents an append-only structured storage, ideal for tabular data akin to database tables.
78
+ """Represents an append-only structured storage, ideal for tabular data similar to database tables.
79
79
80
- Represents a structured data store similar to a table , where each object (row) has consistent attributes (columns).
81
- Datasets operate on an append-only basis , allowing for the addition of new records without the modification or
82
- removal of existing ones . This class is typically used for storing crawling results .
80
+ The `Dataset` class is designed to store structured data , where each entry (row) maintains consistent attributes
81
+ (columns) across the dataset. It operates in an append-only mode , allowing new records to be added, but not
82
+ modified or deleted . This makes it particularly useful for storing results from web crawling operations .
83
83
84
- Data can be stored locally or in the cloud, with local storage paths formatted as:
85
- `{CRAWLEE_STORAGE_DIR}/datasets/{DATASET_ID}/{INDEX}.json`. Here, `{DATASET_ID}` is either "default" or
86
- a specific dataset ID, and `{INDEX}` represents the zero-based index of the item in the dataset.
84
+ Data can be stored either locally or in the cloud. It depends on the setup of underlying storage client.
85
+ By default a `MemoryStorageClient` is used, but it can be changed to a different one.
87
86
88
- To open a dataset, use the `open` class method with an `id`, `name`, or `config`. If unspecified, the default
89
- dataset for the current crawler run is used. Opening a non-existent dataset by `id` raises an error, while
90
- by `name`, it is created.
87
+ By default, data is stored using the following path structure:
88
+ ```
89
+ {CRAWLEE_STORAGE_DIR}/datasets/{DATASET_ID}/{INDEX}.json
90
+ ```
91
+ - `{CRAWLEE_STORAGE_DIR}`: The root directory for all storage data specified by the environment variable.
92
+ - `{DATASET_ID}`: Specifies the dataset, either "default" or a custom dataset ID.
93
+ - `{INDEX}`: Represents the zero-based index of the record within the dataset.
94
+
95
+ To open a dataset, use the `open` class method by specifying an `id`, `name`, or `configuration`. If none are
96
+ provided, the default dataset for the current crawler run is used. Attempting to open a dataset by `id` that does
97
+ not exist will raise an error; however, if accessed by `name`, the dataset will be created if it doesn't already
98
+ exist.
91
99
92
100
Usage:
93
- dataset = await Dataset.open(id='my_dataset_id')
101
+ ```python
102
+ dataset = await Dataset.open(name='my_dataset')
103
+ ```
94
104
"""
95
105
96
106
_MAX_PAYLOAD_SIZE = ByteSize .from_mb (9 )
0 commit comments