Skip to content

Conversation

vdusek
Copy link
Collaborator

@vdusek vdusek commented Sep 9, 2025

Description

Issues

Testing

  • New tests were implemented.

Checklist

  • CI passed

@vdusek vdusek self-assigned this Sep 9, 2025
@vdusek vdusek added the t-tooling Issues with this label are in the ownership of the tooling team. label Sep 9, 2025
@github-actions github-actions bot added this to the 123rd sprint - Tooling team milestone Sep 9, 2025
@github-actions github-actions bot added the tested Temporary label used only programatically for some analytics. label Sep 9, 2025
@vdusek vdusek added the adhoc Ad-hoc unplanned task added during the sprint. label Sep 9, 2025
@vdusek vdusek marked this pull request as ready for review September 9, 2025 13:55
@vdusek
Copy link
Collaborator Author

vdusek commented Sep 9, 2025

cc @Mantisus in context of #1339

@janbuchar
Copy link
Collaborator

Can you add some docs/examples please? 🙂

Copy link
Collaborator

@Mantisus Mantisus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@vdusek vdusek requested a review from Mantisus September 10, 2025 09:28
@vdusek
Copy link
Collaborator Author

vdusek commented Sep 10, 2025

Can you add some docs/examples please? 🙂

I extended the Storages guide.

@Pijukatel
Copy link
Collaborator

Should StorageMetadata contain some information regarding the alias or scope? How will, for example Apify platform know what kind of storage it is?

@vdusek
Copy link
Collaborator Author

vdusek commented Sep 11, 2025

Should StorageMetadata contain some information regarding the alias or scope? How will, for example Apify platform know what kind of storage it is?

I don't think so. On the Apify platform, the distinction between global scope and run scope storage is based just on naming - unnamed versus named. The alias is just for us (FS and Apify clients use it, or will use it).

@vdusek vdusek requested a review from Pijukatel September 11, 2025 07:28
Copy link
Collaborator

@Mantisus Mantisus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@Pijukatel Pijukatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets merge it and improve if needed based on real usage feedback.

Copy link
Collaborator

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good to me! I have some nitpicky comments, plus I'd like to ask if you already tried implementing this in ApifyStorageClient. It would be good to see that before merging this part (not mandatory I guess...).

*,
id: str | None,
name: str | None,
alias: str | None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

um, shouldn't this actually be, you know, used somewhere in the method? I imagine you could just do if alias: name = alias, but I don't see that here - am I missing something?

I guess this might work that to the StorageInstanceManager doing the actual work needed to distinguish the instances, but it's kinda hard to see at first, if that's the case. Also we might want to put the alias in the metadata?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed on Slack, the alias does not have any effect on the memory storage client implementation. I added additional explanatory text to the docstrings.

Also we might want to put the alias in the metadata?

We might, but I'm not 100% sure about it. I would say we can better add it later than remove it later. So I would suggest keeping it as it is and adding it later if we have any reasoning/request for it.

Comment on lines +651 to +652
# Clean up
await dataset_1.drop()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the storage_client fixture take care of this? And if it doesn't, it definitely should...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StorageClient class does not have access, and so cannot drop the specific storage clients that were created using it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I guess it's fine then, even though you could theoretically set up some monkey patching to get that access

@vdusek
Copy link
Collaborator Author

vdusek commented Sep 12, 2025

Looks pretty good to me! I have some nitpicky comments, plus I'd like to ask if you already tried implementing this in ApifyStorageClient. It would be good to see that before merging this part (not mandatory I guess...).

I've got it working, it just needs some more polishing and tests. Anyway, there were no issues with the implementation, and it's pretty straightforward. Also, no BCs, as expected.

@vdusek vdusek requested a review from janbuchar September 12, 2025 15:59
@vdusek
Copy link
Collaborator Author

vdusek commented Sep 12, 2025

I'm merging this as all the comments were clarified and/or addressed, so that we can adopt this in #1386 and #1339.

@vdusek vdusek merged commit 5dbd212 into master Sep 12, 2025
19 checks passed
@vdusek vdusek deleted the add-ndu-storages branch September 12, 2025 16:01
Pijukatel added a commit that referenced this pull request Sep 16, 2025
…ervices (#1386)

### Description

This is a collection of closely related changes that are hard to
separate from one another. The main purpose is to enable flexible
storage use across the code base without unexpected limitations and
limit unexpected side effects in global services.

#### Top-level changes:
- There can be multiple crawlers with different storage clients,
configurations, or event managers. (Previously, this would cause
`ServiceConflictError`)
- `StorageInstanceManager` allows for similar but different storage
instances to be used at the same time(Previously, similar storage
instances could be incorrectly retrieved instead of creating a new
storage instance).
- Differently configured storages can be used at the same time, even the
storages that are using the same `StorageClient` and are different only
by using different `Configuration`.
- `Crawler` can no longer cause side effects in the global
service_locator (apart from adding new instances to
`StorageInstanceManager`).
- Global `service_locator` can be used at the same time as local
instances of `ServiceLocator` (for example, each Crawler has its own
`ServiceLocator` instance, which does not interfere with the global
service_locator.)
- Services in `ServiceLocator` can be set only once. Any attempt to
reset them will throw an Error. Not setting the services and using them
is possible. That will set services in `ServiceLocator` to some implicit
default, and it will log warnings as implicit services can lead to
hard-to-predict code. The preferred way is to set services explicitly.
Either manually or through some helper code, for example, through
`Actor`. [See related
PR](apify/apify-sdk-python#576)

#### Implementation notes:
- Storage caching now supports all relevant ways to distinguish storage
instances. Apart from generic parameters like `name`, `id`,
`storage_type`, `storage_client_type`, there is also an
`additional_cache_key`. This can be used by the `StorageClient` to
define a unique way to distinguish between two similar but different
instances. For example, `FileSystemStorageClient` depends on
`Configuration.storage_dir`, which is included in the custom cache key
for `FileSystemStorageClient`, but this is not true for
`MemoryStorageClient` as the `storage_dir` is not relevant for it, see
example:
(This `additional_cache_key` could possibly be used for caching of NDU
in #1401)
```python
storage_client = FileSystemStorageClient()
d1= await Dataset.open(storage_client=storage_client, configuration=Configuration(storage_dir="path1"))
d2= await Dataset.open(storage_client=storage_client, configuration=Configuration(storage_dir="path2"))
d3= await Dataset.open(storage_client=storage_client, configuration=Configuration(storage_dir="path1"))

assert d2 is not d1
assert d3 is d1

storage_client_2 =MemoryStorageClient()
d4= await Dataset.open(storage_client=storage_client_2, configuration=Configuration(storage_dir="path1"))
d5= await Dataset.open(storage_client=storage_client_2, configuration=Configuration(storage_dir="path2"))
assert d4 is d5
```
- Each crawler will create its own instance of `ServiceLocator`. It will
either use explicitly passed services(configuration, storage client,
event_manager) to crawler init or services from the global
`service_locator` as implicit defaults. This allows multiple differently
configured crawlers to work in the same code. For example:
```python
custom_configuration_1 = Configuration()
custom_event_manager_1 = LocalEventManager.from_config(custom_configuration_1)
custom_storage_client_1 = MemoryStorageClient()

custom_configuration_2 = Configuration()
custom_event_manager_2 = LocalEventManager.from_config(custom_configuration_2)
custom_storage_client_2 = MemoryStorageClient()

crawler_1 = BasicCrawler(
    configuration=custom_configuration_1,
    event_manager=custom_event_manager_1,
    storage_client=custom_storage_client_1,
)

crawler_2 = BasicCrawler(
    configuration=custom_configuration_2,
    event_manager=custom_event_manager_2,
    storage_client=custom_storage_client_2,
  )

# use crawlers without runtime crash...
```
- `ServiceLocator` is now way more strict when it comes to setting the
services. Previously, it allowed changing services until some service
had `_was_retrieved` flag set to `True`. Then it would throw a runtime
error. This led to hard-to-predict code as the global `service_locator`
could be changed as a side effect from many places. Now the services in
`ServiceLocator` can be set only once, and the side effects of
attempting to change the services are limited as much as possible. Such
side effects are also accompanied by warning messages to draw attention
to code that could cause RuntimeError.
### Issues

Closes: #1379
Connected to: 
- #1354  (through necessary changes in `StorageInstanceManagaer`)
- apify/apify-sdk-python#513 (through
necessary changes in `StorageInstanceManagaer` and storage
clients/configuration related changes in `service_locator`)

### Testing

- New unit tests were added.
- Tested on the `Apify` platform together with SDK changes in [related
PR](apify/apify-sdk-python#576)

---------

Co-authored-by: Vlada Dusek <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
adhoc Ad-hoc unplanned task added during the sprint. t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for non-default unnamed storages
4 participants