Skip to content
Merged
Show file tree
Hide file tree
Changes from 29 commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
8369ff9
Test with crawlee branch `storage-clients-and-configurations`
Pijukatel Aug 29, 2025
d9137aa
Add debug
Pijukatel Aug 29, 2025
cf1ee6f
Update config handling
Pijukatel Sep 1, 2025
d6b85ac
Add many configuration based tests
Pijukatel Sep 1, 2025
ea8e085
Add storage tests
Pijukatel Sep 1, 2025
9c3e7b1
Do Pydantic workaround
Pijukatel Sep 1, 2025
a2825bf
Wip, TODO: Solve patching of service_locator from Crawlee
Pijukatel Sep 2, 2025
0b96454
Update lock
Pijukatel Sep 2, 2025
432c79c
Remove any monkey patching from Configuration
Pijukatel Sep 10, 2025
a4a046e
Move all relevant initialization for Actor from __init__ to init to e…
Pijukatel Sep 10, 2025
2a52cdc
Update inits
Pijukatel Sep 10, 2025
841d89a
Update tests
Pijukatel Sep 10, 2025
8bd59fd
Fix failing tests
Pijukatel Sep 10, 2025
c4b5d48
Remove leftover edits
Pijukatel Sep 11, 2025
19ea5c7
Update init regarding the implicit config finalization
Pijukatel Sep 11, 2025
54a3523
Finalize tests
Pijukatel Sep 11, 2025
c89fd73
Merge remote-tracking branch 'origin/master' into test-new-storage-se…
Pijukatel Sep 11, 2025
b4efbff
Properly set implicit ApifyFileSystemStorageClient
Pijukatel Sep 11, 2025
6a6ab98
Update test
Pijukatel Sep 12, 2025
f1ce0d1
Review feedback
Pijukatel Sep 12, 2025
8347eb6
Merge remote-tracking branch 'origin/master' into test-new-storage-se…
Pijukatel Sep 12, 2025
b7101a4
Master related update
Pijukatel Sep 12, 2025
19b79f1
Add upgrading guide
Pijukatel Sep 12, 2025
b256876
Add migration test
Pijukatel Sep 12, 2025
6fbb5f4
Ensure proper storoage client init when is_at_home to avoid unnecesar…
Pijukatel Sep 12, 2025
5bf51f7
Add warning for usage of FileSystemStorageClient in Actor context
Pijukatel Sep 12, 2025
14c5395
Add special caching for ApifyClient
Pijukatel Sep 15, 2025
f7c9a58
Remove line that is no longer necessary
Pijukatel Sep 15, 2025
4450bf8
Update lock
Pijukatel Sep 15, 2025
f28fcd7
Merge remote-tracking branch 'origin/master' into test-new-storage-se…
Pijukatel Sep 16, 2025
c2c8ca5
Update NDU creation logic based on updated Crawlee
Pijukatel Sep 17, 2025
7911c48
Update tests
Pijukatel Sep 17, 2025
e68bdef
Update lock
Pijukatel Sep 17, 2025
04e74bc
Review comments
Pijukatel Sep 17, 2025
1cb295f
Do not attempt to deal with limited retention for alias storages locally
Pijukatel Sep 17, 2025
4b5946f
Add more docstrings
Pijukatel Sep 17, 2025
70890e7
Review call changes
Pijukatel Sep 17, 2025
2d61f1e
Review call changes 2
Pijukatel Sep 17, 2025
3813ea3
Add docs and compute_short_hash for additional cache key
Pijukatel Sep 18, 2025
5608b4d
Remove Actor.config
Pijukatel Sep 18, 2025
4b6c414
crawler actor reboot test
Pijukatel Sep 18, 2025
79b0ff7
Move test_apify_storages
Pijukatel Sep 18, 2025
b424c9b
Update test_configuration.py
Pijukatel Sep 18, 2025
cae107e
Update typing
Pijukatel Sep 18, 2025
177dbb2
Fix naming in failing test
Pijukatel Sep 18, 2025
698c089
Add warning to potential missuse of Configuration
Pijukatel Sep 18, 2025
3b7634a
Update caplog test
Pijukatel Sep 18, 2025
0de39ad
Revert lock changes
Pijukatel Sep 18, 2025
b19ea8c
Review comments
Pijukatel Sep 18, 2025
85bf04d
Remove unused self._charging_manager
Pijukatel Sep 18, 2025
47f84eb
Apply suggestions from code review
Pijukatel Sep 18, 2025
7d1fc84
Review comments
Pijukatel Sep 18, 2025
35ad0dd
Remove difference in storage clients
Pijukatel Sep 18, 2025
28ebd8d
inline finalize
vdusek Sep 18, 2025
f884356
Order methods in AliasResolver
Pijukatel Sep 18, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions docs/04_upgrading/upgrading_to_v3.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,38 @@ This page summarizes the breaking changes between Apify Python SDK v2.x and v3.0

Support for Python 3.9 has been dropped. The Apify Python SDK v3.x now requires Python 3.10 or later. Make sure your environment is running a compatible version before upgrading.

## Actor initialization and ServiceLocator changes

`Actor` initialization and global `service_locator` services setup is more strict and predictable.
- Services in `Actor` can't be changed after calling `Actor.init`, entering the `async with Actor` context manager or after requesting them from the `Actor`
- Services in `Actor` can be different from services in Crawler

**Now (v3.0):**

```python
from crawlee.crawlers import BasicCrawler
from crawlee.storage_clients import MemoryStorageClient, FileSystemStorageClient
from crawlee.configuration import Configuration
from crawlee.events import LocalEventManager
from apify import Actor

async def main():

async with Actor():
# This crawler will use same services as Actor and global service_locator
crawler_1 = BasicCrawler()

# This crawler will use custom services
custom_configuration = Configuration()
custom_event_manager = LocalEventManager.from_config(custom_configuration)
custom_storage_client = MemoryStorageClient()
crawler_2 = BasicCrawler(
configuration=custom_configuration,
event_manager=custom_event_manager,
storage_client=custom_storage_client,
)
```

## Storages

<!-- TODO -->
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ keywords = [
dependencies = [
"apify-client>=2.0.0,<3.0.0",
"apify-shared>=2.0.0,<3.0.0",
"crawlee==1.0.0rc1",
"crawlee @ git+https://github.com/apify/crawlee-python.git@storage-clients-and-configurations-2",
"cachetools>=5.5.0",
"cryptography>=42.0.0",
"impit>=0.5.3",
Expand Down
205 changes: 137 additions & 68 deletions src/apify/_actor.py

Large diffs are not rendered by default.

35 changes: 30 additions & 5 deletions src/apify/_configuration.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
from pydantic import AliasChoices, BeforeValidator, Field, model_validator
from typing_extensions import Self, deprecated

from crawlee import service_locator
from crawlee._utils.models import timedelta_ms
from crawlee._utils.urls import validate_http_url
from crawlee.configuration import Configuration as CrawleeConfiguration
Expand Down Expand Up @@ -424,11 +425,35 @@ def disable_browser_sandbox_on_platform(self) -> Self:
def get_global_configuration(cls) -> Configuration:
"""Retrieve the global instance of the configuration.

Mostly for the backwards compatibility. It is recommended to use the `service_locator.get_configuration()`
instead.
This method ensures that ApifyConfigration is returned, even if CrawleeConfiguration was set in the
service locator.
"""
return cls()
global_configuration = service_locator.get_configuration()

if isinstance(global_configuration, Configuration):
# If Apify configuration was already stored in service locator, return it.
return global_configuration

# Monkey-patch the base class so that it works with the extended configuration
CrawleeConfiguration.get_global_configuration = Configuration.get_global_configuration # type: ignore[method-assign]
return cls.from_configuration(global_configuration)

@classmethod
def from_configuration(cls, configuration: CrawleeConfiguration) -> Configuration:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result should be cached so that two calls to get_global_configuration always return the exact same object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a warning to the only potential path where this could cause a problem. If someone gets there, it is not by intention, but for the sake of backward compatibility, it will still work in most cases as expected.

"""Create Apify Configuration from existing Crawlee Configuration.

Args:
configuration: The existing Crawlee Configuration.

Returns:
The created Apify Configuration.
"""
apify_configuration = cls()

# Ensure the returned configuration is of type Apify Configuration.
# Most likely crawlee configuration was already set. Create Apify configuration from it.
# Due to known Pydantic issue https://github.com/pydantic/pydantic/issues/9516, creating new instance of
# Configuration from existing one in situation where environment can have some fields set by alias is very
# unpredictable. Use the stable workaround.
for name in configuration.model_fields:
setattr(apify_configuration, name, getattr(configuration, name))

return apify_configuration
49 changes: 24 additions & 25 deletions src/apify/storage_clients/_apify/_storage_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,72 +9,71 @@
from ._dataset_client import ApifyDatasetClient
from ._key_value_store_client import ApifyKeyValueStoreClient
from ._request_queue_client import ApifyRequestQueueClient
from apify._configuration import Configuration as ApifyConfiguration
from apify._utils import docs_group

if TYPE_CHECKING:
from crawlee.configuration import Configuration
from collections.abc import Hashable

from crawlee.configuration import Configuration as CrawleeConfiguration


@docs_group('Storage clients')
class ApifyStorageClient(StorageClient):
"""Apify storage client."""

# This class breaches Liskov Substitution Principle. It requires specialized Configuration compared to its parent.
_lsp_violation_error_message_template = (
'Expected "configuration" to be an instance of "apify.Configuration", but got {} instead.'
)

@override
def get_additional_cache_key(self, configuration: CrawleeConfiguration) -> Hashable:
if isinstance(configuration, ApifyConfiguration):
return f'{configuration.api_base_url},{configuration.token}'
raise TypeError(self._lsp_violation_error_message_template.format(type(configuration).__name__))

@override
async def create_dataset_client(
self,
*,
id: str | None = None,
name: str | None = None,
configuration: Configuration | None = None,
alias: str | None = None,
configuration: CrawleeConfiguration | None = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I might just be out of the loop, but wasn't this supposed to work with some "stored" configuration so that we could just remove this parameter? Is there some reason why we need it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that approach initially, but that led to more problems related to the order in which services can be set in the service locator. Since one service is dependent on another service, but they can be set independently...
So imagine you do:
service_locator.set_configuration(Configuration1)
service_locator.set_storage_client(SomeStorageClient(Configuration2))
What now? Better to prevent this from even happening

) -> ApifyDatasetClient:
# Import here to avoid circular imports.
from apify import Configuration as ApifyConfiguration # noqa: PLC0415

configuration = configuration or ApifyConfiguration.get_global_configuration()
if isinstance(configuration, ApifyConfiguration):
return await ApifyDatasetClient.open(id=id, name=name, configuration=configuration)

raise TypeError(
f'Expected "configuration" to be an instance of "apify.Configuration", '
f'but got {type(configuration).__name__} instead.'
)
raise TypeError(self._lsp_violation_error_message_template.format(type(configuration).__name__))

@override
async def create_kvs_client(
self,
*,
id: str | None = None,
name: str | None = None,
configuration: Configuration | None = None,
alias: str | None = None,
configuration: CrawleeConfiguration | None = None,
) -> ApifyKeyValueStoreClient:
# Import here to avoid circular imports.
from apify import Configuration as ApifyConfiguration # noqa: PLC0415

configuration = configuration or ApifyConfiguration.get_global_configuration()
if isinstance(configuration, ApifyConfiguration):
return await ApifyKeyValueStoreClient.open(id=id, name=name, configuration=configuration)

raise TypeError(
f'Expected "configuration" to be an instance of "apify.Configuration", '
f'but got {type(configuration).__name__} instead.'
)
raise TypeError(self._lsp_violation_error_message_template.format(type(configuration).__name__))

@override
async def create_rq_client(
self,
*,
id: str | None = None,
name: str | None = None,
configuration: Configuration | None = None,
alias: str | None = None,
configuration: CrawleeConfiguration | None = None,
) -> ApifyRequestQueueClient:
# Import here to avoid circular imports.
from apify import Configuration as ApifyConfiguration # noqa: PLC0415

configuration = configuration or ApifyConfiguration.get_global_configuration()
if isinstance(configuration, ApifyConfiguration):
return await ApifyRequestQueueClient.open(id=id, name=name, configuration=configuration)

raise TypeError(
f'Expected "configuration" to be an instance of "apify.Configuration", '
f'but got {type(configuration).__name__} instead.'
)
raise TypeError(self._lsp_violation_error_message_template.format(type(configuration).__name__))
5 changes: 4 additions & 1 deletion src/apify/storage_clients/_file_system/_storage_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,9 +27,12 @@ async def create_kvs_client(
*,
id: str | None = None,
name: str | None = None,
alias: str | None = None,
configuration: Configuration | None = None,
) -> FileSystemKeyValueStoreClient:
configuration = configuration or Configuration.get_global_configuration()
client = await ApifyFileSystemKeyValueStoreClient.open(id=id, name=name, configuration=configuration)
client = await ApifyFileSystemKeyValueStoreClient.open(
id=id, name=name, alias=alias, configuration=configuration
)
await self._purge_if_needed(client, configuration)
return client
2 changes: 1 addition & 1 deletion tests/integration/actor_source_base/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# The test fixture will put the Apify SDK wheel path on the next line
APIFY_SDK_WHEEL_PLACEHOLDER
uvicorn[standard]
crawlee[parsel]==1.0.0rc1
crawlee[parsel] @ git+https://github.com/apify/crawlee-python.git@storage-clients-and-configurations-2
10 changes: 1 addition & 9 deletions tests/integration/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,18 +58,10 @@ def _prepare_test_env() -> None:
service_locator._configuration = None
service_locator._event_manager = None
service_locator._storage_client = None
service_locator._storage_instance_manager = None

# Reset the retrieval flags.
service_locator._configuration_was_retrieved = False
service_locator._event_manager_was_retrieved = False
service_locator._storage_client_was_retrieved = False
service_locator.storage_instance_manager.clear_cache()

# Verify that the test environment was set up correctly.
assert os.environ.get(ApifyEnvVars.LOCAL_STORAGE_DIR) == str(tmp_path)
assert service_locator._configuration_was_retrieved is False
assert service_locator._storage_client_was_retrieved is False
assert service_locator._event_manager_was_retrieved is False

return _prepare_test_env

Expand Down
48 changes: 48 additions & 0 deletions tests/integration/test_actor_migration.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
from __future__ import annotations

from typing import TYPE_CHECKING

if TYPE_CHECKING:
from .conftest import MakeActorFunction, RunActorFunction


async def test_migration_through_reboot(make_actor: MakeActorFunction, run_actor: RunActorFunction) -> None:
"""Test that actor works as expected after migration through testing behavior after reboot.
Handle two requests. Migrate in between the two requests."""

async def main() -> None:
from crawlee._types import BasicCrawlingContext, ConcurrencySettings
from crawlee.crawlers import BasicCrawler

from apify import Actor

async with Actor:
crawler = BasicCrawler(concurrency_settings=ConcurrencySettings(max_concurrency=1))
requests = ['https://example.com/1', 'https://example.com/2']

run = await Actor.apify_client.run(Actor.config.actor_run_id or '').get()
assert run
first_run = run.get('stats', {}).get('rebootCount', 0) == 0
Actor.log.warning(run)

@crawler.router.default_handler
async def default_handler(context: BasicCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')

# Simulate migration through reboot
if context.request.url == requests[1] and first_run:
context.log.info(f'Reclaiming {context.request.url} ...')
rq = await crawler.get_request_manager()
await rq.reclaim_request(context.request)
await Actor.reboot()

await crawler.run(requests)

# Each time one request is finished.
assert crawler.statistics.state.requests_finished == 1

actor = await make_actor(label='migration', main_func=main)
run_result = await run_actor(actor)

assert run_result.status == 'SUCCEEDED'
34 changes: 17 additions & 17 deletions tests/unit/actor/test_actor_lifecycle.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
import sys
from datetime import datetime, timezone
from typing import TYPE_CHECKING, Any, cast
from unittest import mock
from unittest.mock import AsyncMock, Mock

import pytest
Expand Down Expand Up @@ -179,25 +180,24 @@ async def handler(websocket: websockets.asyncio.server.ServerConnection) -> None
}
)

monkeypatch.setattr(Actor._charging_manager, '_client', mock_run_client)

async with Actor:
Actor.on(Event.PERSIST_STATE, log_persist_state)
await asyncio.sleep(2)

for socket in ws_server.connections:
await socket.send(
json.dumps(
{
'name': 'migrating',
'data': {
'isMigrating': True,
},
}
with mock.patch.object(Actor, 'new_client', return_value=mock_run_client):
async with Actor:
Actor.on(Event.PERSIST_STATE, log_persist_state)
await asyncio.sleep(2)

for socket in ws_server.connections:
await socket.send(
json.dumps(
{
'name': 'migrating',
'data': {
'isMigrating': True,
},
}
)
)
)

await asyncio.sleep(1)
await asyncio.sleep(1)

assert len(persist_state_events_data) >= 3

Expand Down
54 changes: 54 additions & 0 deletions tests/unit/actor/test_apify_storage.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
from unittest import mock
from unittest.mock import AsyncMock

import pytest

from crawlee.storages import Dataset, KeyValueStore, RequestQueue
from crawlee.storages._base import Storage

from apify import Configuration
from apify.storage_clients import ApifyStorageClient
from apify.storage_clients._apify import ApifyDatasetClient, ApifyKeyValueStoreClient, ApifyRequestQueueClient


@pytest.mark.parametrize(
('storage', '_storage_client'),
[
(Dataset, ApifyDatasetClient),
(KeyValueStore, ApifyKeyValueStoreClient),
(RequestQueue, ApifyRequestQueueClient),
],
)
async def test_get_additional_cache_key(
storage: Storage, _storage_client: ApifyDatasetClient | ApifyKeyValueStoreClient | ApifyRequestQueueClient
) -> None:
"""Test that Storages based on `ApifyStorageClient` include `token` and `api_base_url` in additional cache key."""
storage_names = iter(['1', '2', '3', '1', '3'])

apify_storage_client = ApifyStorageClient()

config_1 = Configuration(token='a')
config_2 = Configuration(token='b')
config_3 = Configuration(token='a', api_base_url='https://super_custom_api.com')

config_4 = Configuration(token='a')
config_5 = Configuration(token='a', api_base_url='https://super_custom_api.com')

mocked_open = AsyncMock(spec=_storage_client.open)
mocked_open.get_metadata = AsyncMock(storage_names)

with mock.patch.object(_storage_client, 'open', mocked_open):
storage_1 = await storage.open(storage_client=apify_storage_client, configuration=config_1)
storage_2 = await storage.open(storage_client=apify_storage_client, configuration=config_2)
storage_3 = await storage.open(storage_client=apify_storage_client, configuration=config_3)
storage_4 = await storage.open(storage_client=apify_storage_client, configuration=config_4)
storage_5 = await storage.open(storage_client=apify_storage_client, configuration=config_5)

# Different configuration results in different storage clients.
assert storage_1 is not storage_2
assert storage_1 is not storage_3
assert storage_2 is not storage_3

# Equivalent configuration results in same storage clients.
assert storage_1 is storage_4
assert storage_3 is storage_5
Loading