Skip to content

Conversation

Pijukatel
Copy link
Collaborator

Description

  • Reduce the amount of global side effects in service_locator by using an explicit KVS factory in RecoverableState.
  • Fix KeyValueStore.auto_saved_value not working properly if the global storage_client was different from the current kvs storage client.
  • Improve test isolation.

Issues

Testing

  • Added tests for some edge cases.

Explicit kvs to RecoverableState
@github-actions github-actions bot added this to the 124th sprint - Tooling team milestone Sep 30, 2025
@github-actions github-actions bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Sep 30, 2025
@Pijukatel
Copy link
Collaborator Author

This PR replaces #1368. That PR included many changes that were already released in Crawlee v1.0, and continuing in the previous PR would not be good for review.

@Pijukatel Pijukatel force-pushed the minimize-global-service-locator-side-effects branch from 97b11d4 to 537ed1a Compare September 30, 2025 12:33
@Pijukatel Pijukatel marked this pull request as ready for review September 30, 2025 13:14
Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't be afraid to call it a fix if it's (according to the description) fixing something, and then it should be included in the changelog.

But also, I'm not 100% I understand why we're doing this:

Reduce the amount of global side effects in service_locator by using an explicit KVS factory in RecoverableState.

What side effects?

Fix KeyValueStore.auto_saved_value not working properly if the global storage_client was different from the current kvs storage client.

👍

metadata=metadata,
path_to_rq=path_to_rq,
lock=asyncio.Lock(),
recoverable_state=await cls._create_recoverable_state(id=metadata.id, configuration=configuration),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of creating it three times, we can create it once, store it in a variable, and just pass it where needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That creation requires metadata to get the RQ.id, so we have to repeat this call, as in all three branches, we get metadata in a different way.

from crawlee.storage_clients import FileSystemStorageClient # noqa: PLC0415 avoid circular import
from crawlee.storages import KeyValueStore # noqa: PLC0415 avoid circular import

return await KeyValueStore.open(storage_client=FileSystemStorageClient(), configuration=configuration)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating a fresh filesystem storage client to open a request queue feels wrong - at this point, we can be pretty sure that another one already exists. Is there a specific reason to do this or is it just because you don't have access to the existing one?

Copy link
Collaborator Author

@Pijukatel Pijukatel Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At that point, we are not sure it exists. It could have been created through a class method without the client and even when created with the help of client, it is out of scope:
await FileSystemRequestQueueClient.open(...)

And why not open the KVS through such a class method as well? Because that way, you bypass the storage instance manager - and that is generally something we do not want.

FileSystemStorageClient is just a helper factory class, which is mainly for convenience and for registering the storage instance manager.

assert request_data['url'].startswith('https://example.com/')


async def test_opening_rq_does_not_have_side_effect_on_service_locator(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some explanation here? Also, inlining the rq_client fixture could lead to better readable code

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok


# Reset global class variables to ensure test isolation.
KeyValueStore._autosaved_values = {}
Statistics._Statistics__next_id = 0 # type:ignore[attr-defined] # Mangled attribute
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this contribute towards test isolation? Is there anything that depends on the persist state key that is derived from the ID?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The answer is I am not sure. But we have so many tests, so I think it is best if we restore all we can to the same state at the beginning of the test. This reduces the chance of some weird behavior based on the order of the test execution.

@Pijukatel
Copy link
Collaborator Author

I wouldn't be afraid to call it a fix if it's (according to the description) fixing something, and then it should be included in the changelog.

But also, I'm not 100% I understand why we're doing this:

Reduce the amount of global side effects in service_locator by using an explicit KVS factory in RecoverableState.

What side effects?

Fix KeyValueStore.auto_saved_value not working properly if the global storage_client was different from the current kvs storage client.

👍

Each time you use RecoverableState and you do not pass an explicit persist_state_kvs_factory the service_locator will be used to get the current global storage_client. If there is no global storage_client set yet, then it is created and set - and this is the side effect I want to minimize. In many cases, a specific RecoverableState does not really want to use the global storage_client as in many cases that does not make sense at all.

There are different cases, where I am not sure, for example, Statistics, so I did not touch those; I changed only those that I am 100% sure about.

@Pijukatel Pijukatel changed the title refactor: Minimize global service locator side effects fix: Fix KeyValueStore.auto_saved_value failing in some scenarios Oct 2, 2025
@Pijukatel Pijukatel requested review from janbuchar and vdusek October 2, 2025 07:52
Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

All crawlee persistence names should be valid names for Apify platform
3 participants