Skip to content
Merged
Show file tree
Hide file tree
Changes from 38 commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
8369ff9
Test with crawlee branch `storage-clients-and-configurations`
Pijukatel Aug 29, 2025
d9137aa
Add debug
Pijukatel Aug 29, 2025
cf1ee6f
Update config handling
Pijukatel Sep 1, 2025
d6b85ac
Add many configuration based tests
Pijukatel Sep 1, 2025
ea8e085
Add storage tests
Pijukatel Sep 1, 2025
9c3e7b1
Do Pydantic workaround
Pijukatel Sep 1, 2025
a2825bf
Wip, TODO: Solve patching of service_locator from Crawlee
Pijukatel Sep 2, 2025
0b96454
Update lock
Pijukatel Sep 2, 2025
432c79c
Remove any monkey patching from Configuration
Pijukatel Sep 10, 2025
a4a046e
Move all relevant initialization for Actor from __init__ to init to e…
Pijukatel Sep 10, 2025
2a52cdc
Update inits
Pijukatel Sep 10, 2025
841d89a
Update tests
Pijukatel Sep 10, 2025
8bd59fd
Fix failing tests
Pijukatel Sep 10, 2025
c4b5d48
Remove leftover edits
Pijukatel Sep 11, 2025
19ea5c7
Update init regarding the implicit config finalization
Pijukatel Sep 11, 2025
54a3523
Finalize tests
Pijukatel Sep 11, 2025
c89fd73
Merge remote-tracking branch 'origin/master' into test-new-storage-se…
Pijukatel Sep 11, 2025
b4efbff
Properly set implicit ApifyFileSystemStorageClient
Pijukatel Sep 11, 2025
6a6ab98
Update test
Pijukatel Sep 12, 2025
f1ce0d1
Review feedback
Pijukatel Sep 12, 2025
8347eb6
Merge remote-tracking branch 'origin/master' into test-new-storage-se…
Pijukatel Sep 12, 2025
b7101a4
Master related update
Pijukatel Sep 12, 2025
19b79f1
Add upgrading guide
Pijukatel Sep 12, 2025
b256876
Add migration test
Pijukatel Sep 12, 2025
6fbb5f4
Ensure proper storoage client init when is_at_home to avoid unnecesar…
Pijukatel Sep 12, 2025
5bf51f7
Add warning for usage of FileSystemStorageClient in Actor context
Pijukatel Sep 12, 2025
14c5395
Add special caching for ApifyClient
Pijukatel Sep 15, 2025
f7c9a58
Remove line that is no longer necessary
Pijukatel Sep 15, 2025
4450bf8
Update lock
Pijukatel Sep 15, 2025
f28fcd7
Merge remote-tracking branch 'origin/master' into test-new-storage-se…
Pijukatel Sep 16, 2025
c2c8ca5
Update NDU creation logic based on updated Crawlee
Pijukatel Sep 17, 2025
7911c48
Update tests
Pijukatel Sep 17, 2025
e68bdef
Update lock
Pijukatel Sep 17, 2025
04e74bc
Review comments
Pijukatel Sep 17, 2025
1cb295f
Do not attempt to deal with limited retention for alias storages locally
Pijukatel Sep 17, 2025
4b5946f
Add more docstrings
Pijukatel Sep 17, 2025
70890e7
Review call changes
Pijukatel Sep 17, 2025
2d61f1e
Review call changes 2
Pijukatel Sep 17, 2025
3813ea3
Add docs and compute_short_hash for additional cache key
Pijukatel Sep 18, 2025
5608b4d
Remove Actor.config
Pijukatel Sep 18, 2025
4b6c414
crawler actor reboot test
Pijukatel Sep 18, 2025
79b0ff7
Move test_apify_storages
Pijukatel Sep 18, 2025
b424c9b
Update test_configuration.py
Pijukatel Sep 18, 2025
cae107e
Update typing
Pijukatel Sep 18, 2025
177dbb2
Fix naming in failing test
Pijukatel Sep 18, 2025
698c089
Add warning to potential missuse of Configuration
Pijukatel Sep 18, 2025
3b7634a
Update caplog test
Pijukatel Sep 18, 2025
0de39ad
Revert lock changes
Pijukatel Sep 18, 2025
b19ea8c
Review comments
Pijukatel Sep 18, 2025
85bf04d
Remove unused self._charging_manager
Pijukatel Sep 18, 2025
47f84eb
Apply suggestions from code review
Pijukatel Sep 18, 2025
7d1fc84
Review comments
Pijukatel Sep 18, 2025
35ad0dd
Remove difference in storage clients
Pijukatel Sep 18, 2025
28ebd8d
inline finalize
vdusek Sep 18, 2025
f884356
Order methods in AliasResolver
Pijukatel Sep 18, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions docs/04_upgrading/upgrading_to_v3.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,38 @@ This page summarizes the breaking changes between Apify Python SDK v2.x and v3.0

Support for Python 3.9 has been dropped. The Apify Python SDK v3.x now requires Python 3.10 or later. Make sure your environment is running a compatible version before upgrading.

## Actor initialization and ServiceLocator changes

`Actor` initialization and global `service_locator` services setup is more strict and predictable.
- Services in `Actor` can't be changed after calling `Actor.init`, entering the `async with Actor` context manager or after requesting them from the `Actor`
- Services in `Actor` can be different from services in Crawler

**Now (v3.0):**

```python
from crawlee.crawlers import BasicCrawler
from crawlee.storage_clients import MemoryStorageClient, FileSystemStorageClient
from crawlee.configuration import Configuration
from crawlee.events import LocalEventManager
from apify import Actor

async def main():

async with Actor():
# This crawler will use same services as Actor and global service_locator
crawler_1 = BasicCrawler()

# This crawler will use custom services
custom_configuration = Configuration()
custom_event_manager = LocalEventManager.from_config(custom_configuration)
custom_storage_client = MemoryStorageClient()
crawler_2 = BasicCrawler(
configuration=custom_configuration,
event_manager=custom_event_manager,
storage_client=custom_storage_client,
)
```

## Storages

<!-- TODO -->
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ keywords = [
dependencies = [
"apify-client>=2.0.0,<3.0.0",
"apify-shared>=2.0.0,<3.0.0",
"crawlee==0.6.13b37",
"crawlee @ git+https://github.com/apify/crawlee-python.git@master",
"cachetools>=5.5.0",
"cryptography>=42.0.0",
"impit>=0.6.1",
Expand Down
219 changes: 143 additions & 76 deletions src/apify/_actor.py

Large diffs are not rendered by default.

35 changes: 30 additions & 5 deletions src/apify/_configuration.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
from pydantic import AliasChoices, BeforeValidator, Field, model_validator
from typing_extensions import Self, deprecated

from crawlee import service_locator
from crawlee._utils.models import timedelta_ms
from crawlee._utils.urls import validate_http_url
from crawlee.configuration import Configuration as CrawleeConfiguration
Expand Down Expand Up @@ -424,11 +425,35 @@ def disable_browser_sandbox_on_platform(self) -> Self:
def get_global_configuration(cls) -> Configuration:
"""Retrieve the global instance of the configuration.

Mostly for the backwards compatibility. It is recommended to use the `service_locator.get_configuration()`
instead.
This method ensures that ApifyConfigration is returned, even if CrawleeConfiguration was set in the
service locator.
"""
return cls()
global_configuration = service_locator.get_configuration()

if isinstance(global_configuration, Configuration):
# If Apify configuration was already stored in service locator, return it.
return global_configuration

# Monkey-patch the base class so that it works with the extended configuration
CrawleeConfiguration.get_global_configuration = Configuration.get_global_configuration # type: ignore[method-assign]
return cls.from_configuration(global_configuration)

@classmethod
def from_configuration(cls, configuration: CrawleeConfiguration) -> Configuration:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result should be cached so that two calls to get_global_configuration always return the exact same object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a warning to the only potential path where this could cause a problem. If someone gets there, it is not by intention, but for the sake of backward compatibility, it will still work in most cases as expected.

"""Create Apify Configuration from existing Crawlee Configuration.

Args:
configuration: The existing Crawlee Configuration.

Returns:
The created Apify Configuration.
"""
apify_configuration = cls()

# Ensure the returned configuration is of type Apify Configuration.
# Most likely crawlee configuration was already set. Create Apify configuration from it.
# Due to known Pydantic issue https://github.com/pydantic/pydantic/issues/9516, creating new instance of
# Configuration from existing one in situation where environment can have some fields set by alias is very
# unpredictable. Use the stable workaround.
for name in configuration.model_fields:
setattr(apify_configuration, name, getattr(configuration, name))

return apify_configuration
27 changes: 14 additions & 13 deletions src/apify/storage_clients/_apify/_dataset_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,9 @@
from crawlee._utils.file import json_dumps
from crawlee.storage_clients._base import DatasetClient
from crawlee.storage_clients.models import DatasetItemsListPage, DatasetMetadata
from crawlee.storages import Dataset

from ._utils import resolve_alias_to_id, store_alias_mapping
from apify.storage_clients._apify._utils import Alias

if TYPE_CHECKING:
from collections.abc import AsyncIterator
Expand Down Expand Up @@ -126,19 +127,19 @@ async def open(
# Normalize 'default' alias to None
alias = None if alias == 'default' else alias

# Handle alias resolution
if alias:
# Try to resolve alias to existing storage ID
resolved_id = await resolve_alias_to_id(alias, 'dataset', configuration)
if resolved_id:
id = resolved_id
else:
# Create a new storage and store the alias mapping
new_storage_metadata = DatasetMetadata.model_validate(
await apify_datasets_client.get_or_create(),
)
id = new_storage_metadata.id
await store_alias_mapping(alias, 'dataset', id, configuration)
# Check if there is pre-existing alias mapping in the default KVS.
async with Alias(storage_type=Dataset, alias=alias, configuration=configuration) as _alias:
id = await _alias.resolve_id()

# There was no pre-existing alias in the mapping.
# Create a new unnamed storage and store the mapping.
if id is None:
new_storage_metadata = DatasetMetadata.model_validate(
await apify_datasets_client.get_or_create(),
)
id = new_storage_metadata.id
await _alias.store_mapping(storage_id=id)

# If name is provided, get or create the storage by name.
elif name:
Expand Down
28 changes: 15 additions & 13 deletions src/apify/storage_clients/_apify/_key_value_store_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,10 @@
from apify_client import ApifyClientAsync
from crawlee.storage_clients._base import KeyValueStoreClient
from crawlee.storage_clients.models import KeyValueStoreRecord, KeyValueStoreRecordMetadata
from crawlee.storages import KeyValueStore

from ._models import ApifyKeyValueStoreMetadata, KeyValueStoreListKeysPage
from ._utils import resolve_alias_to_id, store_alias_mapping
from ._utils import Alias
from apify._crypto import create_hmac_signature

if TYPE_CHECKING:
Expand Down Expand Up @@ -117,19 +118,20 @@ async def open(
# Normalize 'default' alias to None
alias = None if alias == 'default' else alias

# Handle alias resolution
if alias:
# Try to resolve alias to existing storage ID
resolved_id = await resolve_alias_to_id(alias, 'kvs', configuration)
if resolved_id:
id = resolved_id
else:
# Create a new storage and store the alias mapping
new_storage_metadata = ApifyKeyValueStoreMetadata.model_validate(
await apify_kvss_client.get_or_create(),
)
id = new_storage_metadata.id
await store_alias_mapping(alias, 'kvs', id, configuration)
# Check if there is pre-existing alias mapping in the default KVS.
async with Alias(storage_type=KeyValueStore, alias=alias, configuration=configuration) as _alias:
id = await _alias.resolve_id()

# There was no pre-existing alias in the mapping.
# Create a new unnamed storage and store the mapping.
if id is None:
# Create a new storage and store the alias mapping
new_storage_metadata = ApifyKeyValueStoreMetadata.model_validate(
await apify_kvss_client.get_or_create(),
)
id = new_storage_metadata.id
await _alias.store_mapping(storage_id=id)

# If name is provided, get or create the storage by name.
elif name:
Expand Down
30 changes: 14 additions & 16 deletions src/apify/storage_clients/_apify/_request_queue_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,10 @@
from crawlee._utils.crypto import crypto_random_object_id
from crawlee.storage_clients._base import RequestQueueClient
from crawlee.storage_clients.models import AddRequestsResponse, ProcessedRequest, RequestQueueMetadata
from crawlee.storages import RequestQueue

from ._models import CachedRequest, ProlongRequestLockResponse, RequestQueueHead
from ._utils import resolve_alias_to_id, store_alias_mapping
from ._utils import Alias
from apify import Request

if TYPE_CHECKING:
Expand Down Expand Up @@ -192,22 +193,19 @@ async def open(
)
apify_rqs_client = apify_client_async.request_queues()

# Normalize 'default' alias to None
alias = None if alias == 'default' else alias

# Handle alias resolution
if alias:
# Try to resolve alias to existing storage ID
resolved_id = await resolve_alias_to_id(alias, 'rq', configuration)
if resolved_id:
id = resolved_id
else:
# Create a new storage and store the alias mapping
new_storage_metadata = RequestQueueMetadata.model_validate(
await apify_rqs_client.get_or_create(),
)
id = new_storage_metadata.id
await store_alias_mapping(alias, 'rq', id, configuration)
# Check if there is pre-existing alias mapping in the default KVS.
async with Alias(storage_type=RequestQueue, alias=alias, configuration=configuration) as _alias:
id = await _alias.resolve_id()

# There was no pre-existing alias in the mapping.
# Create a new unnamed storage and store the mapping.
if id is None:
new_storage_metadata = RequestQueueMetadata.model_validate(
await apify_rqs_client.get_or_create(),
)
id = new_storage_metadata.id
await _alias.store_mapping(storage_id=id)

# If name is provided, get or create the storage by name.
elif name:
Expand Down
49 changes: 24 additions & 25 deletions src/apify/storage_clients/_apify/_storage_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,36 +9,47 @@
from ._dataset_client import ApifyDatasetClient
from ._key_value_store_client import ApifyKeyValueStoreClient
from ._request_queue_client import ApifyRequestQueueClient
from ._utils import Alias
from apify._configuration import Configuration as ApifyConfiguration
from apify._utils import docs_group

if TYPE_CHECKING:
from crawlee.configuration import Configuration
from collections.abc import Hashable

from crawlee.configuration import Configuration as CrawleeConfiguration


@docs_group('Storage clients')
class ApifyStorageClient(StorageClient):
"""Apify storage client."""

# This class breaches Liskov Substitution Principle. It requires specialized Configuration compared to its parent.
_lsp_violation_error_message_template = (
'Expected "configuration" to be an instance of "apify.Configuration", but got {} instead.'
)

@override
def get_additional_cache_key(self, configuration: CrawleeConfiguration) -> Hashable:
if isinstance(configuration, ApifyConfiguration):
if configuration.api_base_url is None or configuration.token is None:
raise ValueError("'Configuration.api_base_url' and 'Configuration.token' must be set.")
return Alias.get_additional_cache_key(configuration)
raise TypeError(self._lsp_violation_error_message_template.format(type(configuration).__name__))

@override
async def create_dataset_client(
self,
*,
id: str | None = None,
name: str | None = None,
alias: str | None = None,
configuration: Configuration | None = None,
configuration: CrawleeConfiguration | None = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I might just be out of the loop, but wasn't this supposed to work with some "stored" configuration so that we could just remove this parameter? Is there some reason why we need it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that approach initially, but that led to more problems related to the order in which services can be set in the service locator. Since one service is dependent on another service, but they can be set independently...
So imagine you do:
service_locator.set_configuration(Configuration1)
service_locator.set_storage_client(SomeStorageClient(Configuration2))
What now? Better to prevent this from even happening

) -> ApifyDatasetClient:
# Import here to avoid circular imports.
from apify import Configuration as ApifyConfiguration # noqa: PLC0415

configuration = configuration or ApifyConfiguration.get_global_configuration()
if isinstance(configuration, ApifyConfiguration):
return await ApifyDatasetClient.open(id=id, name=name, alias=alias, configuration=configuration)

raise TypeError(
f'Expected "configuration" to be an instance of "apify.Configuration", '
f'but got {type(configuration).__name__} instead.'
)
raise TypeError(self._lsp_violation_error_message_template.format(type(configuration).__name__))

@override
async def create_kvs_client(
Expand All @@ -47,19 +58,13 @@ async def create_kvs_client(
id: str | None = None,
name: str | None = None,
alias: str | None = None,
configuration: Configuration | None = None,
configuration: CrawleeConfiguration | None = None,
) -> ApifyKeyValueStoreClient:
# Import here to avoid circular imports.
from apify import Configuration as ApifyConfiguration # noqa: PLC0415

configuration = configuration or ApifyConfiguration.get_global_configuration()
if isinstance(configuration, ApifyConfiguration):
return await ApifyKeyValueStoreClient.open(id=id, name=name, alias=alias, configuration=configuration)

raise TypeError(
f'Expected "configuration" to be an instance of "apify.Configuration", '
f'but got {type(configuration).__name__} instead.'
)
raise TypeError(self._lsp_violation_error_message_template.format(type(configuration).__name__))

@override
async def create_rq_client(
Expand All @@ -68,16 +73,10 @@ async def create_rq_client(
id: str | None = None,
name: str | None = None,
alias: str | None = None,
configuration: Configuration | None = None,
configuration: CrawleeConfiguration | None = None,
) -> ApifyRequestQueueClient:
# Import here to avoid circular imports.
from apify import Configuration as ApifyConfiguration # noqa: PLC0415

configuration = configuration or ApifyConfiguration.get_global_configuration()
if isinstance(configuration, ApifyConfiguration):
return await ApifyRequestQueueClient.open(id=id, name=name, alias=alias, configuration=configuration)

raise TypeError(
f'Expected "configuration" to be an instance of "apify.Configuration", '
f'but got {type(configuration).__name__} instead.'
)
raise TypeError(self._lsp_violation_error_message_template.format(type(configuration).__name__))
Loading