-
Notifications
You must be signed in to change notification settings - Fork 488
feat: Add support for NDU storages #1401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
0a2262b
to
73d4ee6
Compare
Can you add some docs/examples please? 🙂 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
I extended the Storages guide. |
cb120b3
to
50dde42
Compare
Should |
I don't think so. On the Apify platform, the distinction between global scope and run scope storage is based just on naming - unnamed versus named. The alias is just for us (FS and Apify clients use it, or will use it). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets merge it and improve if needed based on real usage feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good to me! I have some nitpicky comments, plus I'd like to ask if you already tried implementing this in ApifyStorageClient
. It would be good to see that before merging this part (not mandatory I guess...).
*, | ||
id: str | None, | ||
name: str | None, | ||
alias: str | None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
um, shouldn't this actually be, you know, used somewhere in the method? I imagine you could just do if alias: name = alias
, but I don't see that here - am I missing something?
I guess this might work that to the StorageInstanceManager
doing the actual work needed to distinguish the instances, but it's kinda hard to see at first, if that's the case. Also we might want to put the alias
in the metadata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed on Slack, the alias does not have any effect on the memory storage client implementation. I added additional explanatory text to the docstrings.
Also we might want to put the alias in the metadata?
We might, but I'm not 100% sure about it. I would say we can better add it later than remove it later. So I would suggest keeping it as it is and adding it later if we have any reasoning/request for it.
# Clean up | ||
await dataset_1.drop() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't the storage_client
fixture take care of this? And if it doesn't, it definitely should...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StorageClient
class does not have access, and so cannot drop the specific storage clients that were created using it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I guess it's fine then, even though you could theoretically set up some monkey patching to get that access
I've got it working, it just needs some more polishing and tests. Anyway, there were no issues with the implementation, and it's pretty straightforward. Also, no BCs, as expected. |
…ervices (#1386) ### Description This is a collection of closely related changes that are hard to separate from one another. The main purpose is to enable flexible storage use across the code base without unexpected limitations and limit unexpected side effects in global services. #### Top-level changes: - There can be multiple crawlers with different storage clients, configurations, or event managers. (Previously, this would cause `ServiceConflictError`) - `StorageInstanceManager` allows for similar but different storage instances to be used at the same time(Previously, similar storage instances could be incorrectly retrieved instead of creating a new storage instance). - Differently configured storages can be used at the same time, even the storages that are using the same `StorageClient` and are different only by using different `Configuration`. - `Crawler` can no longer cause side effects in the global service_locator (apart from adding new instances to `StorageInstanceManager`). - Global `service_locator` can be used at the same time as local instances of `ServiceLocator` (for example, each Crawler has its own `ServiceLocator` instance, which does not interfere with the global service_locator.) - Services in `ServiceLocator` can be set only once. Any attempt to reset them will throw an Error. Not setting the services and using them is possible. That will set services in `ServiceLocator` to some implicit default, and it will log warnings as implicit services can lead to hard-to-predict code. The preferred way is to set services explicitly. Either manually or through some helper code, for example, through `Actor`. [See related PR](apify/apify-sdk-python#576) #### Implementation notes: - Storage caching now supports all relevant ways to distinguish storage instances. Apart from generic parameters like `name`, `id`, `storage_type`, `storage_client_type`, there is also an `additional_cache_key`. This can be used by the `StorageClient` to define a unique way to distinguish between two similar but different instances. For example, `FileSystemStorageClient` depends on `Configuration.storage_dir`, which is included in the custom cache key for `FileSystemStorageClient`, but this is not true for `MemoryStorageClient` as the `storage_dir` is not relevant for it, see example: (This `additional_cache_key` could possibly be used for caching of NDU in #1401) ```python storage_client = FileSystemStorageClient() d1= await Dataset.open(storage_client=storage_client, configuration=Configuration(storage_dir="path1")) d2= await Dataset.open(storage_client=storage_client, configuration=Configuration(storage_dir="path2")) d3= await Dataset.open(storage_client=storage_client, configuration=Configuration(storage_dir="path1")) assert d2 is not d1 assert d3 is d1 storage_client_2 =MemoryStorageClient() d4= await Dataset.open(storage_client=storage_client_2, configuration=Configuration(storage_dir="path1")) d5= await Dataset.open(storage_client=storage_client_2, configuration=Configuration(storage_dir="path2")) assert d4 is d5 ``` - Each crawler will create its own instance of `ServiceLocator`. It will either use explicitly passed services(configuration, storage client, event_manager) to crawler init or services from the global `service_locator` as implicit defaults. This allows multiple differently configured crawlers to work in the same code. For example: ```python custom_configuration_1 = Configuration() custom_event_manager_1 = LocalEventManager.from_config(custom_configuration_1) custom_storage_client_1 = MemoryStorageClient() custom_configuration_2 = Configuration() custom_event_manager_2 = LocalEventManager.from_config(custom_configuration_2) custom_storage_client_2 = MemoryStorageClient() crawler_1 = BasicCrawler( configuration=custom_configuration_1, event_manager=custom_event_manager_1, storage_client=custom_storage_client_1, ) crawler_2 = BasicCrawler( configuration=custom_configuration_2, event_manager=custom_event_manager_2, storage_client=custom_storage_client_2, ) # use crawlers without runtime crash... ``` - `ServiceLocator` is now way more strict when it comes to setting the services. Previously, it allowed changing services until some service had `_was_retrieved` flag set to `True`. Then it would throw a runtime error. This led to hard-to-predict code as the global `service_locator` could be changed as a side effect from many places. Now the services in `ServiceLocator` can be set only once, and the side effects of attempting to change the services are limited as much as possible. Such side effects are also accompanied by warning messages to draw attention to code that could cause RuntimeError. ### Issues Closes: #1379 Connected to: - #1354 (through necessary changes in `StorageInstanceManagaer`) - apify/apify-sdk-python#513 (through necessary changes in `StorageInstanceManagaer` and storage clients/configuration related changes in `service_locator`) ### Testing - New unit tests were added. - Tested on the `Apify` platform together with SDK changes in [related PR](apify/apify-sdk-python#576) --------- Co-authored-by: Vlada Dusek <[email protected]>
Description
alias='default'
is the default unnamed storage.Issues
Testing
Checklist