feat: Add support for NDU storages #1401

vdusek · 2025-09-09T13:39:10Z

Description

The alias version was implemented, see the Add support for non-default unnamed storages #1175 (comment) for more context.
Storage with alias='default' is the default unnamed storage.

Issues

Closes: Add support for non-default unnamed storages #1175

Testing

New tests were implemented.

Checklist

CI passed

vdusek · 2025-09-09T13:56:21Z

cc @Mantisus in context of #1339

janbuchar · 2025-09-09T14:51:00Z

Can you add some docs/examples please? 🙂

Mantisus

Looks great!

tests/unit/storages/test_dataset.py

tests/unit/storages/test_request_queue.py

vdusek · 2025-09-10T12:15:10Z

Can you add some docs/examples please? 🙂

I extended the Storages guide.

tests/unit/storages/test_key_value_store.py

tests/unit/storages/test_request_queue.py

src/crawlee/crawlers/_basic/_basic_crawler.py

docs/guides/storages.mdx

src/crawlee/storage_clients/_file_system/_dataset_client.py

src/crawlee/storages/_storage_instance_manager.py

Pijukatel · 2025-09-11T07:15:27Z

Should StorageMetadata contain some information regarding the alias or scope? How will, for example Apify platform know what kind of storage it is?

vdusek · 2025-09-11T07:28:17Z

Should StorageMetadata contain some information regarding the alias or scope? How will, for example Apify platform know what kind of storage it is?

I don't think so. On the Apify platform, the distinction between global scope and run scope storage is based just on naming - unnamed versus named. The alias is just for us (FS and Apify clients use it, or will use it).

Mantisus

LGTM

Pijukatel

Lets merge it and improve if needed based on real usage feedback.

janbuchar

Looks pretty good to me! I have some nitpicky comments, plus I'd like to ask if you already tried implementing this in ApifyStorageClient. It would be good to see that before merging this part (not mandatory I guess...).

janbuchar · 2025-09-12T12:48:52Z

src/crawlee/storage_clients/_memory/_key_value_store_client.py

        *,
        id: str | None,
        name: str | None,
+        alias: str | None,


um, shouldn't this actually be, you know, used somewhere in the method? I imagine you could just do if alias: name = alias, but I don't see that here - am I missing something?

I guess this might work that to the StorageInstanceManager doing the actual work needed to distinguish the instances, but it's kinda hard to see at first, if that's the case. Also we might want to put the alias in the metadata?

As we discussed on Slack, the alias does not have any effect on the memory storage client implementation. I added additional explanatory text to the docstrings.

Also we might want to put the alias in the metadata?

We might, but I'm not 100% sure about it. I would say we can better add it later than remove it later. So I would suggest keeping it as it is and adding it later if we have any reasoning/request for it.

janbuchar · 2025-09-12T13:08:41Z

tests/unit/storages/test_dataset.py

+    # Clean up
+    await dataset_1.drop()


Doesn't the storage_client fixture take care of this? And if it doesn't, it definitely should...

StorageClient class does not have access, and so cannot drop the specific storage clients that were created using it.

I see. I guess it's fine then, even though you could theoretically set up some monkey patching to get that access

vdusek · 2025-09-12T15:23:44Z

Looks pretty good to me! I have some nitpicky comments, plus I'd like to ask if you already tried implementing this in ApifyStorageClient. It would be good to see that before merging this part (not mandatory I guess...).

I've got it working, it just needs some more polishing and tests. Anyway, there were no issues with the implementation, and it's pretty straightforward. Also, no BCs, as expected.

vdusek · 2025-09-12T16:01:34Z

I'm merging this as all the comments were clarified and/or addressed, so that we can adopt this in #1386 and #1339.

…ervices (#1386) ### Description This is a collection of closely related changes that are hard to separate from one another. The main purpose is to enable flexible storage use across the code base without unexpected limitations and limit unexpected side effects in global services. #### Top-level changes: - There can be multiple crawlers with different storage clients, configurations, or event managers. (Previously, this would cause `ServiceConflictError`) - `StorageInstanceManager` allows for similar but different storage instances to be used at the same time(Previously, similar storage instances could be incorrectly retrieved instead of creating a new storage instance). - Differently configured storages can be used at the same time, even the storages that are using the same `StorageClient` and are different only by using different `Configuration`. - `Crawler` can no longer cause side effects in the global service_locator (apart from adding new instances to `StorageInstanceManager`). - Global `service_locator` can be used at the same time as local instances of `ServiceLocator` (for example, each Crawler has its own `ServiceLocator` instance, which does not interfere with the global service_locator.) - Services in `ServiceLocator` can be set only once. Any attempt to reset them will throw an Error. Not setting the services and using them is possible. That will set services in `ServiceLocator` to some implicit default, and it will log warnings as implicit services can lead to hard-to-predict code. The preferred way is to set services explicitly. Either manually or through some helper code, for example, through `Actor`. [See related PR](apify/apify-sdk-python#576) #### Implementation notes: - Storage caching now supports all relevant ways to distinguish storage instances. Apart from generic parameters like `name`, `id`, `storage_type`, `storage_client_type`, there is also an `additional_cache_key`. This can be used by the `StorageClient` to define a unique way to distinguish between two similar but different instances. For example, `FileSystemStorageClient` depends on `Configuration.storage_dir`, which is included in the custom cache key for `FileSystemStorageClient`, but this is not true for `MemoryStorageClient` as the `storage_dir` is not relevant for it, see example: (This `additional_cache_key` could possibly be used for caching of NDU in #1401) ```python storage_client = FileSystemStorageClient() d1= await Dataset.open(storage_client=storage_client, configuration=Configuration(storage_dir="path1")) d2= await Dataset.open(storage_client=storage_client, configuration=Configuration(storage_dir="path2")) d3= await Dataset.open(storage_client=storage_client, configuration=Configuration(storage_dir="path1")) assert d2 is not d1 assert d3 is d1 storage_client_2 =MemoryStorageClient() d4= await Dataset.open(storage_client=storage_client_2, configuration=Configuration(storage_dir="path1")) d5= await Dataset.open(storage_client=storage_client_2, configuration=Configuration(storage_dir="path2")) assert d4 is d5 ``` - Each crawler will create its own instance of `ServiceLocator`. It will either use explicitly passed services(configuration, storage client, event_manager) to crawler init or services from the global `service_locator` as implicit defaults. This allows multiple differently configured crawlers to work in the same code. For example: ```python custom_configuration_1 = Configuration() custom_event_manager_1 = LocalEventManager.from_config(custom_configuration_1) custom_storage_client_1 = MemoryStorageClient() custom_configuration_2 = Configuration() custom_event_manager_2 = LocalEventManager.from_config(custom_configuration_2) custom_storage_client_2 = MemoryStorageClient() crawler_1 = BasicCrawler( configuration=custom_configuration_1, event_manager=custom_event_manager_1, storage_client=custom_storage_client_1, ) crawler_2 = BasicCrawler( configuration=custom_configuration_2, event_manager=custom_event_manager_2, storage_client=custom_storage_client_2, ) # use crawlers without runtime crash... ``` - `ServiceLocator` is now way more strict when it comes to setting the services. Previously, it allowed changing services until some service had `_was_retrieved` flag set to `True`. Then it would throw a runtime error. This led to hard-to-predict code as the global `service_locator` could be changed as a side effect from many places. Now the services in `ServiceLocator` can be set only once, and the side effects of attempting to change the services are limited as much as possible. Such side effects are also accompanied by warning messages to draw attention to code that could cause RuntimeError. ### Issues Closes: #1379 Connected to: - #1354 (through necessary changes in `StorageInstanceManagaer`) - apify/apify-sdk-python#513 (through necessary changes in `StorageInstanceManagaer` and storage clients/configuration related changes in `service_locator`) ### Testing - New unit tests were added. - Tested on the `Apify` platform together with SDK changes in [related PR](apify/apify-sdk-python#576) --------- Co-authored-by: Vlada Dusek <[email protected]>

vdusek self-assigned this Sep 9, 2025

vdusek added the t-tooling Issues with this label are in the ownership of the tooling team. label Sep 9, 2025

github-actions bot added this to the 123rd sprint - Tooling team milestone Sep 9, 2025

github-actions bot added the tested Temporary label used only programatically for some analytics. label Sep 9, 2025

vdusek force-pushed the add-ndu-storages branch from 0a2262b to 73d4ee6 Compare September 9, 2025 13:40

vdusek requested review from janbuchar, Pijukatel and Mantisus September 9, 2025 13:54

vdusek added the adhoc Ad-hoc unplanned task added during the sprint. label Sep 9, 2025

vdusek marked this pull request as ready for review September 9, 2025 13:55

Mantisus reviewed Sep 10, 2025

View reviewed changes

tests/unit/storages/test_dataset.py Outdated Show resolved Hide resolved

tests/unit/storages/test_request_queue.py Outdated Show resolved Hide resolved

vdusek requested a review from Mantisus September 10, 2025 09:28

Mantisus reviewed Sep 10, 2025

View reviewed changes

tests/unit/storages/test_key_value_store.py Outdated Show resolved Hide resolved

tests/unit/storages/test_request_queue.py Outdated Show resolved Hide resolved

src/crawlee/crawlers/_basic/_basic_crawler.py Outdated Show resolved Hide resolved

Pijukatel mentioned this pull request Sep 10, 2025

refactor!: Refactor storage creation and caching, configuration and services #1386

Merged

vdusek added 6 commits September 11, 2025 09:05

feat: Add support for NDU storages

9990dca

Improve FS clients creation

eab68bc

Improve memory clients creation

19b7ea3

Improve tests

5d39923

Add docs

f8aba0c

address feedback

50dde42

vdusek force-pushed the add-ndu-storages branch from cb120b3 to 50dde42 Compare September 11, 2025 07:08

vdusek requested a review from Mantisus September 11, 2025 07:08

Pijukatel reviewed Sep 11, 2025

View reviewed changes

docs/guides/storages.mdx Show resolved Hide resolved

src/crawlee/storage_clients/_file_system/_dataset_client.py Show resolved Hide resolved

src/crawlee/storages/_storage_instance_manager.py Show resolved Hide resolved

vdusek requested a review from Pijukatel September 11, 2025 07:28

fix

ff4058f

Mantisus approved these changes Sep 11, 2025

View reviewed changes

Pijukatel approved these changes Sep 12, 2025

View reviewed changes

janbuchar reviewed Sep 12, 2025

View reviewed changes

Add comment about alias in memory storages

304dce6

vdusek requested a review from janbuchar September 12, 2025 15:59

vdusek merged commit 5dbd212 into master Sep 12, 2025
19 checks passed

vdusek deleted the add-ndu-storages branch September 12, 2025 16:01

janbuchar mentioned this pull request Oct 7, 2025

fix: Use Self type in the open() method of storage clients #1462

Merged

feat: Add support for NDU storages #1401

feat: Add support for NDU storages #1401

Uh oh!

Conversation

vdusek commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issues

Testing

Checklist

Uh oh!

vdusek commented Sep 9, 2025

Uh oh!

janbuchar commented Sep 9, 2025

Uh oh!

Mantisus left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

vdusek commented Sep 10, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Pijukatel commented Sep 11, 2025

Uh oh!

vdusek commented Sep 11, 2025

Uh oh!

Mantisus left a comment

Choose a reason for hiding this comment

Uh oh!

Pijukatel left a comment

Choose a reason for hiding this comment

Uh oh!

janbuchar left a comment

Choose a reason for hiding this comment

Uh oh!

janbuchar Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

vdusek Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

janbuchar Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

vdusek Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

janbuchar Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

vdusek commented Sep 12, 2025

Uh oh!

vdusek commented Sep 12, 2025

Uh oh!

Uh oh!

Uh oh!

vdusek commented Sep 9, 2025 •

edited

Loading