Skip to content

Conversation

Pijukatel
Copy link
Contributor

@Pijukatel Pijukatel commented Sep 2, 2025

Description

  • All relevant parts of Actor are initialized in async init, not in __init__.
  • Actor is considered finalized after Actor.init was run. This also means that the same configuration used by the Actor is set in the global service_locator.
  • There are three valid scenarios for setting up the configuration.
    • Setting global configuration in service_locator before the Actor.init
    • Having no configuration set in service_locator and set it through Actor.(configuration=...) and running Actor.init()
    • Having no configuration set in service_locator and no configuration passed to Actor will create and set implicit default configuration
  • Properly set ApifyFileSystemStorageClient as local client to support pre-existing input file.
  • Depends on refactor!: Refactor storage creation and caching, configuration and services crawlee-python#1386
  • Enable caching of ApifyStorageClient based on token and api_public_url and update NDU storage handling.

Issues

Rated to: #513, #590

Testing

@github-actions github-actions bot added this to the 122nd sprint - Tooling team milestone Sep 2, 2025
@github-actions github-actions bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Sep 2, 2025
@Pijukatel Pijukatel changed the title refactor!: Test with crawlee branch storage-clients-and-configurations-2 refactor!: Make Actor initialization stricter and more predictable Sep 11, 2025
@Pijukatel Pijukatel marked this pull request as ready for review September 11, 2025 08:29
await proxy_configuration.initialize()

assert len(caplog.records) == 1
assert len(caplog.records) == 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Soo what was added here? 😁

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will throw this warning:
"'No configuration set, implicitly creating and using default Configuration.'"

Which is fine. Normally, the proxy configuration would be used within the context of the Actor and it would set the configuration.

It still works even this way, it just warns that no one set the configuration before and thus it is doing it now implicitly.

config = Configuration.get_global_configuration()

async with ApifyEventManager(config):
async with ApifyEventManager(Configuration()):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary, but I wanted to avoid Configuration.get_global_configuration() as it warns that it is setting the implicit default configuration as no configuration was set explicitly.
I can also properly set the configuration before the test or just pass the configuration directly and avoid the service_locator related side effects

Copy link
Contributor

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first batch

Copy link
Contributor

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another batch (from our discussion), I'll wait for your updates now.

name: str | None = None,
alias: str | None = None,
configuration: Configuration | None = None,
configuration: CrawleeConfiguration | None = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I might just be out of the loop, but wasn't this supposed to work with some "stored" configuration so that we could just remove this parameter? Is there some reason why we need it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that approach initially, but that led to more problems related to the order in which services can be set in the service locator. Since one service is dependent on another service, but they can be set independently...
So imagine you do:
service_locator.set_configuration(Configuration1)
service_locator.set_storage_client(SomeStorageClient(Configuration2))
What now? Better to prevent this from even happening

return cls.from_configuration(global_configuration)

@classmethod
def from_configuration(cls, configuration: CrawleeConfiguration) -> Configuration:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result should be cached so that two calls to get_global_configuration always return the exact same object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a warning to the only potential path where this could cause a problem. If someone gets there, it is not by intention, but for the sake of backward compatibility, it will still work in most cases as expected.


@cached_property
def charging_manager_implementation(self) -> ChargingManagerImplementation:
return ChargingManagerImplementation(self.config, self.apify_client)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this should call _raise_if_not_initialized too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"""Retrieve the charging manager to access granular pricing information."""
self._raise_if_not_initialized()
return self._charging_manager
return self.charging_manager_implementation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this method now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Public usage of charging_manager should be allowed only if Actor is initialized and Actor is not initialized before await self._charging_manager_implementation.__aenter__()

Copy link
Contributor

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Pijukatel Pijukatel merged commit 912222a into master Sep 18, 2025
23 checks passed
@Pijukatel Pijukatel deleted the test-new-storage-services-creation-2 branch September 18, 2025 15:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.