Skip to content

Commit 5dbd212

Browse files
authored
feat: Add support for NDU storages (#1401)
### Description - The alias version was implemented, see the #1175 (comment) for more context. - Storage with `alias='default'` is the default unnamed storage. ### Issues - Closes: #1175 ### Testing - New tests were implemented. ### Checklist - [x] CI passed
1 parent 9843350 commit 5dbd212

21 files changed

+1918
-130
lines changed
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
import asyncio
2+
3+
from crawlee.storages import Dataset
4+
5+
6+
async def main() -> None:
7+
# Named storage (persists across runs)
8+
dataset_named = await Dataset.open(name='my-persistent-dataset')
9+
10+
# Unnamed storage with alias (purged on start)
11+
dataset_unnamed = await Dataset.open(alias='temporary-results')
12+
13+
# Default unnamed storage (both are equivalent and purged on start)
14+
dataset_default = await Dataset.open()
15+
dataset_default = await Dataset.open(alias='default')
16+
17+
18+
if __name__ == '__main__':
19+
asyncio.run(main())

docs/guides/storages.mdx

Lines changed: 22 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,8 @@ import Tabs from '@theme/Tabs';
99
import TabItem from '@theme/TabItem';
1010
import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';
1111

12+
import OpeningExample from '!!raw-loader!roa-loader!./code_examples/storages/opening.py';
13+
1214
import RqBasicExample from '!!raw-loader!roa-loader!./code_examples/storages/rq_basic_example.py';
1315
import RqWithCrawlerExample from '!!raw-loader!roa-loader!./code_examples/storages/rq_with_crawler_example.py';
1416
import RqWithCrawlerExplicitExample from '!!raw-loader!roa-loader!./code_examples/storages/rq_with_crawler_explicit_example.py';
@@ -26,7 +28,9 @@ import KvsWithCrawlerExplicitExample from '!!raw-loader!roa-loader!./code_exampl
2628
import CleaningDoNotPurgeExample from '!!raw-loader!roa-loader!./code_examples/storages/cleaning_do_not_purge_example.py';
2729
import CleaningPurgeExplicitlyExample from '!!raw-loader!roa-loader!./code_examples/storages/cleaning_purge_explicitly_example.py';
2830

29-
Crawlee offers several storage types for managing and persisting your crawling data. Request-oriented storages, such as the <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink>, help you store and deduplicate URLs, while result-oriented storages, like <ApiLink to="class/Dataset">`Dataset`</ApiLink> and <ApiLink to="class/KeyValueStore">`KeyValueStore`</ApiLink>, focus on storing and retrieving scraping results. This guide helps you choose the storage type that suits your needs.
31+
Crawlee offers several storage types for managing and persisting your crawling data. Request-oriented storages, such as the <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink>, help you store and deduplicate URLs, while result-oriented storages, like <ApiLink to="class/Dataset">`Dataset`</ApiLink> and <ApiLink to="class/KeyValueStore">`KeyValueStore`</ApiLink>, focus on storing and retrieving scraping results. This guide explains when to use each type, how to interact with them, and how to control their lifecycle.
32+
33+
## Overview
3034

3135
Crawlee's storage system consists of two main layers:
3236
- **Storages** (<ApiLink to="class/Dataset">`Dataset`</ApiLink>, <ApiLink to="class/KeyValueStore">`KeyValueStore`</ApiLink>, <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink>): High-level interfaces for interacting with different storage types.
@@ -70,6 +74,21 @@ Storage --|> KeyValueStore
7074
Storage --|> RequestQueue
7175
```
7276

77+
### Named and unnamed storages
78+
79+
Crawlee supports two types of storages:
80+
81+
- **Named storages**: Persistent storages with a specific name that persist across runs. These are useful when you want to share data between different crawler runs or access the same storage from multiple places.
82+
- **Unnamed storages**: Temporary storages identified by an alias that are scoped to a single run. These are automatically purged at the start of each run (when `purge_on_start` is enabled, which is the default).
83+
84+
### Default storage
85+
86+
Each storage type (<ApiLink to="class/Dataset">`Dataset`</ApiLink>, <ApiLink to="class/KeyValueStore">`KeyValueStore`</ApiLink>, <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink>) has a default instance that can be accessed without specifying `id`, `name` or `alias`. Default unnamed storage is accessed by calling storage's `open` method without parameters. This is the most common way to use storages in simple crawlers. The special alias `"default"` is equivalent to calling `open` without parameters
87+
88+
<RunnableCodeBlock className="language-python" language="python">
89+
{OpeningExample}
90+
</RunnableCodeBlock>
91+
7392
## Request queue
7493

7594
The <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink> is the primary storage for URLs in Crawlee, especially useful for deep crawling. It supports dynamic addition of URLs, making it ideal for recursive tasks where URLs are discovered and added during the crawling process (e.g., following links across multiple pages). Each Crawlee project has a **default request queue**, which can be used to store URLs during a specific run.
@@ -186,13 +205,7 @@ Crawlee provides the following helper function to simplify interactions with the
186205

187206
## Cleaning up the storages
188207

189-
By default, Crawlee automatically cleans up **default storages** before each crawler run to ensure a clean state. This behavior is controlled by the <ApiLink to="class/Configuration#purge_on_start">`Configuration.purge_on_start`</ApiLink> setting (default: `True`).
190-
191-
### What gets purged
192-
193-
- **Default storages** are completely removed and recreated at the start of each run, ensuring that you start with a clean slate.
194-
- **Named storages** are never automatically purged and persist across runs.
195-
- The behavior depends on the storage client implementation.
208+
By default, Crawlee cleans up all unnamed storages (including the default one) at the start of each run, so every crawl begins with a clean state. This behavior is controlled by <ApiLink to="class/Configuration#purge_on_start">`Configuration.purge_on_start`</ApiLink> (default: True). In contrast, named storages are never purged automatically and persist across runs. The exact behavior may vary depending on the storage client implementation.
196209

197210
### When purging happens
198211

@@ -221,6 +234,6 @@ Note that purging behavior may vary between storage client implementations. For
221234

222235
## Conclusion
223236

224-
This guide introduced you to the different storage types available in Crawlee and how to interact with them. You learned how to manage requests using the <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink> and store and retrieve scraping results using the <ApiLink to="class/Dataset">`Dataset`</ApiLink> and <ApiLink to="class/KeyValueStore">`KeyValueStore`</ApiLink>. You also discovered how to use helper functions to simplify interactions with these storages. Finally, you learned how to clean up storages before starting a crawler run.
237+
This guide introduced you to the different storage types available in Crawlee and how to interact with them. You learned about the distinction between named storages (persistent across runs) and unnamed storages with aliases (temporary and purged on start). You discovered how to manage requests using the <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink> and store and retrieve scraping results using the <ApiLink to="class/Dataset">`Dataset`</ApiLink> and <ApiLink to="class/KeyValueStore">`KeyValueStore`</ApiLink>. You also learned how to use helper functions to simplify interactions with these storages and how to control storage cleanup behavior.
225238

226239
If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/crawlee-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!

src/crawlee/_types.py

Lines changed: 18 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -189,6 +189,7 @@ class PushDataFunctionCall(PushDataKwargs):
189189
data: list[dict[str, Any]] | dict[str, Any]
190190
dataset_id: str | None
191191
dataset_name: str | None
192+
dataset_alias: str | None
192193

193194

194195
class KeyValueStoreInterface(Protocol):
@@ -255,7 +256,7 @@ def __init__(self, *, key_value_store_getter: GetKeyValueStoreFunction) -> None:
255256
self._key_value_store_getter = key_value_store_getter
256257
self.add_requests_calls = list[AddRequestsKwargs]()
257258
self.push_data_calls = list[PushDataFunctionCall]()
258-
self.key_value_store_changes = dict[tuple[str | None, str | None], KeyValueStoreChangeRecords]()
259+
self.key_value_store_changes = dict[tuple[str | None, str | None, str | None], KeyValueStoreChangeRecords]()
259260

260261
async def add_requests(
261262
self,
@@ -270,6 +271,7 @@ async def push_data(
270271
data: list[dict[str, Any]] | dict[str, Any],
271272
dataset_id: str | None = None,
272273
dataset_name: str | None = None,
274+
dataset_alias: str | None = None,
273275
**kwargs: Unpack[PushDataKwargs],
274276
) -> None:
275277
"""Track a call to the `push_data` context helper."""
@@ -278,6 +280,7 @@ async def push_data(
278280
data=data,
279281
dataset_id=dataset_id,
280282
dataset_name=dataset_name,
283+
dataset_alias=dataset_alias,
281284
**kwargs,
282285
)
283286
)
@@ -287,13 +290,14 @@ async def get_key_value_store(
287290
*,
288291
id: str | None = None,
289292
name: str | None = None,
293+
alias: str | None = None,
290294
) -> KeyValueStoreInterface:
291-
if (id, name) not in self.key_value_store_changes:
292-
self.key_value_store_changes[id, name] = KeyValueStoreChangeRecords(
293-
await self._key_value_store_getter(id=id, name=name)
295+
if (id, name, alias) not in self.key_value_store_changes:
296+
self.key_value_store_changes[id, name, alias] = KeyValueStoreChangeRecords(
297+
await self._key_value_store_getter(id=id, name=name, alias=alias)
294298
)
295299

296-
return self.key_value_store_changes[id, name]
300+
return self.key_value_store_changes[id, name, alias]
297301

298302

299303
@docs_group('Functions')
@@ -424,12 +428,14 @@ def __call__(
424428
*,
425429
id: str | None = None,
426430
name: str | None = None,
431+
alias: str | None = None,
427432
) -> Coroutine[None, None, KeyValueStore]:
428433
"""Call dunder method.
429434
430435
Args:
431436
id: The ID of the `KeyValueStore` to get.
432-
name: The name of the `KeyValueStore` to get.
437+
name: The name of the `KeyValueStore` to get (global scope, named storage).
438+
alias: The alias of the `KeyValueStore` to get (run scope, unnamed storage).
433439
"""
434440

435441

@@ -444,12 +450,14 @@ def __call__(
444450
*,
445451
id: str | None = None,
446452
name: str | None = None,
453+
alias: str | None = None,
447454
) -> Coroutine[None, None, KeyValueStoreInterface]:
448455
"""Call dunder method.
449456
450457
Args:
451458
id: The ID of the `KeyValueStore` to get.
452-
name: The name of the `KeyValueStore` to get.
459+
name: The name of the `KeyValueStore` to get (global scope, named storage).
460+
alias: The alias of the `KeyValueStore` to get (run scope, unnamed storage).
453461
"""
454462

455463

@@ -466,14 +474,16 @@ def __call__(
466474
data: list[dict[str, Any]] | dict[str, Any],
467475
dataset_id: str | None = None,
468476
dataset_name: str | None = None,
477+
dataset_alias: str | None = None,
469478
**kwargs: Unpack[PushDataKwargs],
470479
) -> Coroutine[None, None, None]:
471480
"""Call dunder method.
472481
473482
Args:
474483
data: The data to push to the `Dataset`.
475484
dataset_id: The ID of the `Dataset` to push the data to.
476-
dataset_name: The name of the `Dataset` to push the data to.
485+
dataset_name: The name of the `Dataset` to push the data to (global scope, named storage).
486+
dataset_alias: The alias of the `Dataset` to push the data to (run scope, unnamed storage).
477487
**kwargs: Additional keyword arguments.
478488
"""
479489

src/crawlee/crawlers/_basic/_basic_crawler.py

Lines changed: 19 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -557,18 +557,20 @@ async def get_dataset(
557557
*,
558558
id: str | None = None,
559559
name: str | None = None,
560+
alias: str | None = None,
560561
) -> Dataset:
561562
"""Return the `Dataset` with the given ID or name. If none is provided, return the default one."""
562-
return await Dataset.open(id=id, name=name)
563+
return await Dataset.open(id=id, name=name, alias=alias)
563564

564565
async def get_key_value_store(
565566
self,
566567
*,
567568
id: str | None = None,
568569
name: str | None = None,
570+
alias: str | None = None,
569571
) -> KeyValueStore:
570572
"""Return the `KeyValueStore` with the given ID or name. If none is provided, return the default KVS."""
571-
return await KeyValueStore.open(id=id, name=name)
573+
return await KeyValueStore.open(id=id, name=name, alias=alias)
572574

573575
def error_handler(
574576
self, handler: ErrorHandler[TCrawlingContext | BasicCrawlingContext]
@@ -772,6 +774,7 @@ async def get_data(
772774
self,
773775
dataset_id: str | None = None,
774776
dataset_name: str | None = None,
777+
dataset_alias: str | None = None,
775778
**kwargs: Unpack[GetDataKwargs],
776779
) -> DatasetItemsListPage:
777780
"""Retrieve data from a `Dataset`.
@@ -781,20 +784,22 @@ async def get_data(
781784
782785
Args:
783786
dataset_id: The ID of the `Dataset`.
784-
dataset_name: The name of the `Dataset`.
787+
dataset_name: The name of the `Dataset` (global scope, named storage).
788+
dataset_alias: The alias of the `Dataset` (run scope, unnamed storage).
785789
kwargs: Keyword arguments to be passed to the `Dataset.get_data()` method.
786790
787791
Returns:
788792
The retrieved data.
789793
"""
790-
dataset = await Dataset.open(id=dataset_id, name=dataset_name)
794+
dataset = await Dataset.open(id=dataset_id, name=dataset_name, alias=dataset_alias)
791795
return await dataset.get_data(**kwargs)
792796

793797
async def export_data(
794798
self,
795799
path: str | Path,
796800
dataset_id: str | None = None,
797801
dataset_name: str | None = None,
802+
dataset_alias: str | None = None,
798803
) -> None:
799804
"""Export all items from a Dataset to a JSON or CSV file.
800805
@@ -804,10 +809,11 @@ async def export_data(
804809
805810
Args:
806811
path: The destination file path. Must end with '.json' or '.csv'.
807-
dataset_id: The ID of the Dataset to export from. If None, uses `name` parameter instead.
808-
dataset_name: The name of the Dataset to export from. If None, uses `id` parameter instead.
812+
dataset_id: The ID of the Dataset to export from.
813+
dataset_name: The name of the Dataset to export from (global scope, named storage).
814+
dataset_alias: The alias of the Dataset to export from (run scope, unnamed storage).
809815
"""
810-
dataset = await self.get_dataset(id=dataset_id, name=dataset_name)
816+
dataset = await self.get_dataset(id=dataset_id, name=dataset_name, alias=dataset_alias)
811817

812818
path = path if isinstance(path, Path) else Path(path)
813819
dst = path.open('w', newline='')
@@ -824,6 +830,7 @@ async def _push_data(
824830
data: list[dict[str, Any]] | dict[str, Any],
825831
dataset_id: str | None = None,
826832
dataset_name: str | None = None,
833+
dataset_alias: str | None = None,
827834
**kwargs: Unpack[PushDataKwargs],
828835
) -> None:
829836
"""Push data to a `Dataset`.
@@ -834,10 +841,11 @@ async def _push_data(
834841
Args:
835842
data: The data to push to the `Dataset`.
836843
dataset_id: The ID of the `Dataset`.
837-
dataset_name: The name of the `Dataset`.
844+
dataset_name: The name of the `Dataset` (global scope, named storage).
845+
dataset_alias: The alias of the `Dataset` (run scope, unnamed storage).
838846
kwargs: Keyword arguments to be passed to the `Dataset.push_data()` method.
839847
"""
840-
dataset = await self.get_dataset(id=dataset_id, name=dataset_name)
848+
dataset = await self.get_dataset(id=dataset_id, name=dataset_name, alias=dataset_alias)
841849
await dataset.push_data(data, **kwargs)
842850

843851
def _should_retry_request(self, context: BasicCrawlingContext, error: Exception) -> bool:
@@ -1226,8 +1234,8 @@ async def _commit_key_value_store_changes(
12261234
result: RequestHandlerRunResult, get_kvs: GetKeyValueStoreFromRequestHandlerFunction
12271235
) -> None:
12281236
"""Store key value store changes recorded in result."""
1229-
for (id, name), changes in result.key_value_store_changes.items():
1230-
store = await get_kvs(id=id, name=name)
1237+
for (id, name, alias), changes in result.key_value_store_changes.items():
1238+
store = await get_kvs(id=id, name=name, alias=alias)
12311239
for key, value in changes.updates.items():
12321240
await store.set_value(key, value.content, value.content_type)
12331241

src/crawlee/storage_clients/_base/_storage_client.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@ async def create_dataset_client(
3434
*,
3535
id: str | None = None,
3636
name: str | None = None,
37+
alias: str | None = None,
3738
configuration: Configuration | None = None,
3839
) -> DatasetClient:
3940
"""Create a dataset client."""
@@ -44,6 +45,7 @@ async def create_kvs_client(
4445
*,
4546
id: str | None = None,
4647
name: str | None = None,
48+
alias: str | None = None,
4749
configuration: Configuration | None = None,
4850
) -> KeyValueStoreClient:
4951
"""Create a key-value store client."""
@@ -54,6 +56,7 @@ async def create_rq_client(
5456
*,
5557
id: str | None = None,
5658
name: str | None = None,
59+
alias: str | None = None,
5760
configuration: Configuration | None = None,
5861
) -> RequestQueueClient:
5962
"""Create a request queue client."""

0 commit comments

Comments
 (0)