Skip to content
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
8369ff9
Test with crawlee branch `storage-clients-and-configurations`
Pijukatel Aug 29, 2025
d9137aa
Add debug
Pijukatel Aug 29, 2025
cf1ee6f
Update config handling
Pijukatel Sep 1, 2025
d6b85ac
Add many configuration based tests
Pijukatel Sep 1, 2025
ea8e085
Add storage tests
Pijukatel Sep 1, 2025
9c3e7b1
Do Pydantic workaround
Pijukatel Sep 1, 2025
a2825bf
Wip, TODO: Solve patching of service_locator from Crawlee
Pijukatel Sep 2, 2025
0b96454
Update lock
Pijukatel Sep 2, 2025
432c79c
Remove any monkey patching from Configuration
Pijukatel Sep 10, 2025
a4a046e
Move all relevant initialization for Actor from __init__ to init to e…
Pijukatel Sep 10, 2025
2a52cdc
Update inits
Pijukatel Sep 10, 2025
841d89a
Update tests
Pijukatel Sep 10, 2025
8bd59fd
Fix failing tests
Pijukatel Sep 10, 2025
c4b5d48
Remove leftover edits
Pijukatel Sep 11, 2025
19ea5c7
Update init regarding the implicit config finalization
Pijukatel Sep 11, 2025
54a3523
Finalize tests
Pijukatel Sep 11, 2025
c89fd73
Merge remote-tracking branch 'origin/master' into test-new-storage-se…
Pijukatel Sep 11, 2025
b4efbff
Properly set implicit ApifyFileSystemStorageClient
Pijukatel Sep 11, 2025
6a6ab98
Update test
Pijukatel Sep 12, 2025
f1ce0d1
Review feedback
Pijukatel Sep 12, 2025
8347eb6
Merge remote-tracking branch 'origin/master' into test-new-storage-se…
Pijukatel Sep 12, 2025
b7101a4
Master related update
Pijukatel Sep 12, 2025
19b79f1
Add upgrading guide
Pijukatel Sep 12, 2025
b256876
Add migration test
Pijukatel Sep 12, 2025
6fbb5f4
Ensure proper storoage client init when is_at_home to avoid unnecesar…
Pijukatel Sep 12, 2025
5bf51f7
Add warning for usage of FileSystemStorageClient in Actor context
Pijukatel Sep 12, 2025
14c5395
Add special caching for ApifyClient
Pijukatel Sep 15, 2025
f7c9a58
Remove line that is no longer necessary
Pijukatel Sep 15, 2025
4450bf8
Update lock
Pijukatel Sep 15, 2025
f28fcd7
Merge remote-tracking branch 'origin/master' into test-new-storage-se…
Pijukatel Sep 16, 2025
c2c8ca5
Update NDU creation logic based on updated Crawlee
Pijukatel Sep 17, 2025
7911c48
Update tests
Pijukatel Sep 17, 2025
e68bdef
Update lock
Pijukatel Sep 17, 2025
04e74bc
Review comments
Pijukatel Sep 17, 2025
1cb295f
Do not attempt to deal with limited retention for alias storages locally
Pijukatel Sep 17, 2025
4b5946f
Add more docstrings
Pijukatel Sep 17, 2025
70890e7
Review call changes
Pijukatel Sep 17, 2025
2d61f1e
Review call changes 2
Pijukatel Sep 17, 2025
3813ea3
Add docs and compute_short_hash for additional cache key
Pijukatel Sep 18, 2025
5608b4d
Remove Actor.config
Pijukatel Sep 18, 2025
4b6c414
crawler actor reboot test
Pijukatel Sep 18, 2025
79b0ff7
Move test_apify_storages
Pijukatel Sep 18, 2025
b424c9b
Update test_configuration.py
Pijukatel Sep 18, 2025
cae107e
Update typing
Pijukatel Sep 18, 2025
177dbb2
Fix naming in failing test
Pijukatel Sep 18, 2025
698c089
Add warning to potential missuse of Configuration
Pijukatel Sep 18, 2025
3b7634a
Update caplog test
Pijukatel Sep 18, 2025
0de39ad
Revert lock changes
Pijukatel Sep 18, 2025
b19ea8c
Review comments
Pijukatel Sep 18, 2025
85bf04d
Remove unused self._charging_manager
Pijukatel Sep 18, 2025
47f84eb
Apply suggestions from code review
Pijukatel Sep 18, 2025
7d1fc84
Review comments
Pijukatel Sep 18, 2025
35ad0dd
Remove difference in storage clients
Pijukatel Sep 18, 2025
28ebd8d
inline finalize
vdusek Sep 18, 2025
f884356
Order methods in AliasResolver
Pijukatel Sep 18, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ keywords = [
dependencies = [
"apify-client>=2.0.0,<3.0.0",
"apify-shared>=2.0.0,<3.0.0",
"crawlee==1.0.0rc1",
"crawlee @ git+https://github.com/apify/crawlee-python.git@storage-clients-and-configurations-2",
"cachetools>=5.5.0",
"cryptography>=42.0.0",
"impit>=0.5.3",
Expand Down
159 changes: 100 additions & 59 deletions src/apify/_actor.py

Large diffs are not rendered by default.

35 changes: 30 additions & 5 deletions src/apify/_configuration.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
from pydantic import AliasChoices, BeforeValidator, Field, model_validator
from typing_extensions import Self, deprecated

from crawlee import service_locator
from crawlee._utils.models import timedelta_ms
from crawlee._utils.urls import validate_http_url
from crawlee.configuration import Configuration as CrawleeConfiguration
Expand Down Expand Up @@ -424,11 +425,35 @@ def disable_browser_sandbox_on_platform(self) -> Self:
def get_global_configuration(cls) -> Configuration:
"""Retrieve the global instance of the configuration.

Mostly for the backwards compatibility. It is recommended to use the `service_locator.get_configuration()`
instead.
This method ensures that ApifyConfigration is returned, even if CrawleeConfiguration was set in the
service locator.
"""
return cls()
global_configuration = service_locator.get_configuration()

if isinstance(global_configuration, Configuration):
# If Apify configuration was already stored in service locator, return it.
return global_configuration

# Monkey-patch the base class so that it works with the extended configuration
CrawleeConfiguration.get_global_configuration = Configuration.get_global_configuration # type: ignore[method-assign]
return cls.from_configuration(global_configuration)

@classmethod
def from_configuration(cls, configuration: CrawleeConfiguration) -> Configuration:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result should be cached so that two calls to get_global_configuration always return the exact same object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a warning to the only potential path where this could cause a problem. If someone gets there, it is not by intention, but for the sake of backward compatibility, it will still work in most cases as expected.

"""Create Apify Configuration from existing Crawlee Configuration.

Args:
configuration: The existing Crawlee Configuration.

Returns:
The created Apify Configuration.
"""
apify_configuration = cls()

# Ensure the returned configuration is of type Apify Configuration.
# Most likely crawlee configuration was already set. Create Apify configuration from it.
# Due to known Pydantic issue https://github.com/pydantic/pydantic/issues/9516, creating new instance of
# Configuration from existing one in situation where environment can have some fields set by alias is very
# unpredictable. Use the stable workaround.
for name in configuration.model_fields:
setattr(apify_configuration, name, getattr(configuration, name))

return apify_configuration
2 changes: 1 addition & 1 deletion tests/integration/actor_source_base/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# The test fixture will put the Apify SDK wheel path on the next line
APIFY_SDK_WHEEL_PLACEHOLDER
uvicorn[standard]
crawlee[parsel]==1.0.0rc1
crawlee[parsel] @ git+https://github.com/apify/crawlee-python.git@storage-clients-and-configurations-2
10 changes: 1 addition & 9 deletions tests/integration/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,18 +58,10 @@ def _prepare_test_env() -> None:
service_locator._configuration = None
service_locator._event_manager = None
service_locator._storage_client = None
service_locator._storage_instance_manager = None

# Reset the retrieval flags.
service_locator._configuration_was_retrieved = False
service_locator._event_manager_was_retrieved = False
service_locator._storage_client_was_retrieved = False
service_locator.storage_instance_manager.clear_cache()

# Verify that the test environment was set up correctly.
assert os.environ.get(ApifyEnvVars.LOCAL_STORAGE_DIR) == str(tmp_path)
assert service_locator._configuration_was_retrieved is False
assert service_locator._storage_client_was_retrieved is False
assert service_locator._event_manager_was_retrieved is False

return _prepare_test_env

Expand Down
3 changes: 3 additions & 0 deletions tests/unit/actor/test_actor_lifecycle.py
Original file line number Diff line number Diff line change
Expand Up @@ -150,6 +150,9 @@ async def handler(websocket: websockets.asyncio.server.ServerConnection) -> None
port: int = ws_server.sockets[0].getsockname()[1] # type: ignore[index]
monkeypatch.setenv(ActorEnvVars.EVENTS_WEBSOCKET_URL, f'ws://localhost:{port}')

# Make sure there is a charging manager that can be mocked
Actor._get_charging_manager_implementation()

mock_run_client = Mock()
mock_run_client.run.return_value.get = AsyncMock(
side_effect=lambda: {
Expand Down
188 changes: 186 additions & 2 deletions tests/unit/actor/test_configuration.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,15 @@
from pathlib import Path

import pytest

from apify import Configuration
from crawlee import Request, service_locator
from crawlee._types import BasicCrawlingContext
from crawlee.configuration import Configuration as CrawleeConfiguration
from crawlee.crawlers import BasicCrawler
from crawlee.errors import ServiceConflictError

from apify import Actor
from apify import Configuration as ApifyConfiguration


@pytest.mark.parametrize(
Expand All @@ -16,6 +25,181 @@ def test_disable_browser_sandbox(
*, is_at_home: bool, disable_browser_sandbox_in: bool, disable_browser_sandbox_out: bool
) -> None:
assert (
Configuration(is_at_home=is_at_home, disable_browser_sandbox=disable_browser_sandbox_in).disable_browser_sandbox
ApifyConfiguration(
is_at_home=is_at_home, disable_browser_sandbox=disable_browser_sandbox_in
).disable_browser_sandbox
== disable_browser_sandbox_out
)


def test_apify_configuration_is_always_used() -> None:
"""Set Crawlee Configuration in Actor and verify that Apify Configuration is used."""
max_used_cpu_ratio = 0.123456 # Some unique value to verify configuration

service_locator.set_configuration(CrawleeConfiguration(max_used_cpu_ratio=max_used_cpu_ratio))

assert Actor.config.max_used_cpu_ratio == max_used_cpu_ratio
assert isinstance(Actor.config, ApifyConfiguration)


async def test_existing_apify_config_respected_by_actor() -> None:
"""Set Apify Configuration in service_locator and verify that Actor respects it."""
max_used_cpu_ratio = 0.123456 # Some unique value to verify configuration
apify_config = ApifyConfiguration(max_used_cpu_ratio=max_used_cpu_ratio)
service_locator.set_configuration(apify_config)
async with Actor:
pass

returned_config = service_locator.get_configuration()
assert returned_config is apify_config


async def test_existing_crawlee_config_respected_by_actor() -> None:
"""Set Crawlee Configuration in service_locator and verify that Actor respects it."""
max_used_cpu_ratio = 0.123456 # Some unique value to verify configuration
crawlee_config = CrawleeConfiguration(max_used_cpu_ratio=max_used_cpu_ratio)
service_locator.set_configuration(crawlee_config)
async with Actor:
pass

assert Actor.config is not crawlee_config
assert isinstance(Actor.config, ApifyConfiguration)
# Make sure the Crawlee Configuration was used to create ApifyConfiguration in Actor
assert Actor.config.max_used_cpu_ratio == max_used_cpu_ratio


async def test_existing_apify_config_throws_error_when_set_in_actor() -> None:
"""Test that passing explicit configuration to actor after service locator configuration was already set,
raises exception."""
service_locator.set_configuration(ApifyConfiguration())
with pytest.raises(ServiceConflictError):
async with Actor(configuration=ApifyConfiguration()):
pass


async def test_setting_config_after_actor_raises_exception() -> None:
"""Test that setting configuration in service locator after actor was created raises an exception."""
async with Actor():
with pytest.raises(ServiceConflictError):
service_locator.set_configuration(ApifyConfiguration())


async def test_actor_using_input_configuration() -> None:
"""Test that setting configuration in service locator after actor was created raises an exception."""
apify_config = ApifyConfiguration()
async with Actor(configuration=apify_config):
pass

assert service_locator.get_configuration() is apify_config


async def test_crawler_implicit_configuration_through_actor() -> None:
"""Test that crawler uses Actor configuration unless explicit configuration was passed to it."""
apify_config = ApifyConfiguration()
async with Actor(configuration=apify_config):
crawler = BasicCrawler()

assert crawler._service_locator.get_configuration() is apify_config
assert service_locator.get_configuration() is apify_config


async def test_crawler_implicit_configuration() -> None:
"""Test that crawler and Actor use implicit service_locator based configuration unless explicit configuration
was passed to them."""
async with Actor():
crawler = BasicCrawler()

assert Actor.config is service_locator.get_configuration() is crawler._service_locator.get_configuration()


async def test_crawlers_own_configuration(tmp_path: Path) -> None:
"""Test that crawlers can use own configurations without crashing."""
config_actor = ApifyConfiguration()
dir_1 = tmp_path / 'dir_1'
dir_2 = tmp_path / 'dir_2'
config_crawler_1 = ApifyConfiguration()
config_actor.storage_dir = str(dir_1)
config_crawler_2 = ApifyConfiguration()
config_crawler_2.storage_dir = str(dir_2)

async with Actor(configuration=config_actor):

async def request_handler(context: BasicCrawlingContext) -> None:
Actor.log.info(f'Processing: {context.request.url}')

crawler_1 = BasicCrawler(configuration=config_crawler_1, request_handler=request_handler)
crawler_2 = BasicCrawler(configuration=config_crawler_2, request_handler=request_handler)
await crawler_1.add_requests([Request.from_url(url='http://example.com/1')])
await crawler_2.add_requests(
[Request.from_url(url='http://example.com/2'), Request.from_url(url='http://example.com/3')]
)

await crawler_1.run()
await crawler_2.run()

assert service_locator.get_configuration() is config_actor
assert crawler_1._service_locator.get_configuration() is config_crawler_1
assert crawler_2._service_locator.get_configuration() is config_crawler_2

assert crawler_1.statistics.state.requests_total == 1
assert crawler_2.statistics.state.requests_total == 2


async def test_crawler_global_configuration() -> None:
"""Test that crawler and Actor use service_locator based configuration unless explicit configuration
was passed to them."""
config_global = ApifyConfiguration()
service_locator.set_configuration(config_global)

async with Actor():
crawler = BasicCrawler()

assert service_locator.get_configuration() is config_global
assert crawler._service_locator.get_configuration() is config_global


async def test_crawler_uses_implicit_apify_config() -> None:
"""Test that Actor is using implicit ApifyConfiguration in Actor context."""
async with Actor:
assert isinstance(Actor.config, ApifyConfiguration)


async def test_storage_retrieved_is_different_with_different_config(tmp_path: Path) -> None:
"""Test that retrieving storage depends on used configuration."""
dir_1 = tmp_path / 'dir_1'
dir_2 = tmp_path / 'dir_2'
config_actor = ApifyConfiguration()
config_actor.storage_dir = str(dir_1)
config_crawler = ApifyConfiguration()
config_crawler.storage_dir = str(dir_2)

async with Actor(configuration=config_actor):
actor_kvs = await Actor.open_key_value_store()
crawler = BasicCrawler(configuration=config_crawler)
crawler_kvs = await crawler.get_key_value_store()

assert actor_kvs is not crawler_kvs


async def test_storage_retrieved_is_same_with_equivalent_config() -> None:
"""Test that retrieving storage depends on used configuration. If two equivalent configuration(even if they are
different instances) are used it returns same storage."""
config_actor = ApifyConfiguration()
apify_crawler = ApifyConfiguration()

async with Actor(configuration=config_actor):
actor_kvs = await Actor.open_key_value_store()
crawler = BasicCrawler(configuration=apify_crawler)
crawler_kvs = await crawler.get_key_value_store()

assert actor_kvs is crawler_kvs


async def test_storage_retrieved_is_same_with_same_config() -> None:
"""Test that retrieving storage is same if same configuration is used."""
async with Actor():
actor_kvs = await Actor.open_key_value_store()
crawler = BasicCrawler()
crawler_kvs = await crawler.get_key_value_store()

assert actor_kvs is crawler_kvs
10 changes: 1 addition & 9 deletions tests/unit/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,18 +70,10 @@ def _prepare_test_env() -> None:
service_locator._configuration = None
service_locator._event_manager = None
service_locator._storage_client = None
service_locator._storage_instance_manager = None

# Reset the retrieval flags.
service_locator._configuration_was_retrieved = False
service_locator._event_manager_was_retrieved = False
service_locator._storage_client_was_retrieved = False
service_locator.storage_instance_manager.clear_cache()

# Verify that the test environment was set up correctly.
assert os.environ.get(ApifyEnvVars.LOCAL_STORAGE_DIR) == str(tmp_path)
assert service_locator._configuration_was_retrieved is False
assert service_locator._storage_client_was_retrieved is False
assert service_locator._event_manager_was_retrieved is False

return _prepare_test_env

Expand Down
3 changes: 1 addition & 2 deletions tests/unit/events/test_apify_event_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,9 +24,8 @@

async def test_lifecycle_local(caplog: pytest.LogCaptureFixture) -> None:
caplog.set_level(logging.DEBUG, logger='apify')
config = Configuration.get_global_configuration()

async with ApifyEventManager(config):
async with ApifyEventManager(Configuration()):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary, but I wanted to avoid Configuration.get_global_configuration() as it warns that it is setting the implicit default configuration as no configuration was set explicitly.
I can also properly set the configuration before the test or just pass the configuration directly and avoid the service_locator related side effects

pass

assert len(caplog.records) == 1
Expand Down
4 changes: 2 additions & 2 deletions tests/unit/test_proxy_configuration.py
Original file line number Diff line number Diff line change
Expand Up @@ -532,9 +532,9 @@ async def test_initialize_when_status_page_unavailable(

await proxy_configuration.initialize()

assert len(caplog.records) == 1
assert len(caplog.records) == 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Soo what was added here? 😁

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will throw this warning:
"'No configuration set, implicitly creating and using default Configuration.'"

Which is fine. Normally, the proxy configuration would be used within the context of the Actor and it would set the configuration.

It still works even this way, it just warns that no one set the configuration before and thus it is doing it now implicitly.

assert caplog.records[0].levelname == 'WARNING'
assert 'Apify Proxy access check timed out' in caplog.records[0].message
assert 'Apify Proxy access check timed out' in caplog.records[1].message


async def test_initialize_with_non_apify_proxy(
Expand Down
19 changes: 7 additions & 12 deletions uv.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading