fix: Fix `KeyValueStore.auto_saved_value` failing in some scenarios #1438

Pijukatel · 2025-09-30T12:30:56Z

Description

Reduce the amount of global side effects in service_locator by using an explicit KVS factory in RecoverableState.
Fix KeyValueStore.auto_saved_value not working properly if the global storage_client was different from the current kvs storage client.
Improve test isolation.

Issues

Closes: All crawlee persistence names should be valid names for Apify platform #1354

Testing

Added tests for some edge cases.

Explicit kvs to RecoverableState

Pijukatel · 2025-09-30T12:31:52Z

This PR replaces #1368. That PR included many changes that were already released in Crawlee v1.0, and continuing in the previous PR would not be good for review.

vdusek

I wouldn't be afraid to call it a fix if it's (according to the description) fixing something, and then it should be included in the changelog.

But also, I'm not 100% I understand why we're doing this:

Reduce the amount of global side effects in service_locator by using an explicit KVS factory in RecoverableState.

What side effects?

Fix KeyValueStore.auto_saved_value not working properly if the global storage_client was different from the current kvs storage client.

👍

vdusek · 2025-10-01T12:14:07Z

src/crawlee/storage_clients/_file_system/_request_queue_client.py

                    metadata=metadata,
                    path_to_rq=path_to_rq,
                    lock=asyncio.Lock(),
+                    recoverable_state=await cls._create_recoverable_state(id=metadata.id, configuration=configuration),


Instead of creating it three times, we can create it once, store it in a variable, and just pass it where needed.

That creation requires metadata to get the RQ.id, so we have to repeat this call, as in all three branches, we get metadata in a different way.

janbuchar · 2025-10-01T14:49:13Z

src/crawlee/storage_clients/_file_system/_request_queue_client.py

+            from crawlee.storage_clients import FileSystemStorageClient  # noqa: PLC0415 avoid circular import
+            from crawlee.storages import KeyValueStore  # noqa: PLC0415 avoid circular import
+
+            return await KeyValueStore.open(storage_client=FileSystemStorageClient(), configuration=configuration)


Creating a fresh filesystem storage client to open a request queue feels wrong - at this point, we can be pretty sure that another one already exists. Is there a specific reason to do this or is it just because you don't have access to the existing one?

At that point, we are not sure it exists. It could have been created through a class method without the client and even when created with the help of client, it is out of scope:
await FileSystemRequestQueueClient.open(...)

And why not open the KVS through such a class method as well? Because that way, you bypass the storage instance manager - and that is generally something we do not want.

FileSystemStorageClient is just a helper factory class, which is mainly for convenience and for registering the storage instance manager.

janbuchar · 2025-10-01T14:51:33Z

tests/unit/storage_clients/_file_system/test_fs_rq_client.py

            assert request_data['url'].startswith('https://example.com/')


+async def test_opening_rq_does_not_have_side_effect_on_service_locator(


Can you add some explanation here? Also, inlining the rq_client fixture could lead to better readable code

janbuchar · 2025-10-01T14:55:11Z

tests/unit/conftest.py


+        # Reset global class variables to ensure test isolation.
+        KeyValueStore._autosaved_values = {}
+        Statistics._Statistics__next_id = 0  # type:ignore[attr-defined] # Mangled attribute


How does this contribute towards test isolation? Is there anything that depends on the persist state key that is derived from the ID?

The answer is I am not sure. But we have so many tests, so I think it is best if we restore all we can to the same state at the beginning of the test. This reduces the chance of some weird behavior based on the order of the test execution.

Pijukatel · 2025-10-02T07:50:16Z

I wouldn't be afraid to call it a fix if it's (according to the description) fixing something, and then it should be included in the changelog.

But also, I'm not 100% I understand why we're doing this:

Reduce the amount of global side effects in service_locator by using an explicit KVS factory in RecoverableState.

What side effects?

Fix KeyValueStore.auto_saved_value not working properly if the global storage_client was different from the current kvs storage client.

👍

Each time you use RecoverableState and you do not pass an explicit persist_state_kvs_factory the service_locator will be used to get the current global storage_client. If there is no global storage_client set yet, then it is created and set - and this is the side effect I want to minimize. In many cases, a specific RecoverableState does not really want to use the global storage_client as in many cases that does not make sense at all.

There are different cases, where I am not sure, for example, Statistics, so I did not touch those; I changed only those that I am 100% sure about.

vdusek

LGTM

Draft of minimizing side effects

ac9c95f

Explicit kvs to RecoverableState

github-actions bot assigned Pijukatel Sep 30, 2025

github-actions bot added this to the 124th sprint - Tooling team milestone Sep 30, 2025

github-actions bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Sep 30, 2025

Pijukatel mentioned this pull request Sep 30, 2025

feat: Update RecoverableState and StorageInstanceManager to ensure proper persistence #1368

Closed

Use factory method istead of the explicit kvs

537ed1a

Pijukatel force-pushed the minimize-global-service-locator-side-effects branch from 97b11d4 to 537ed1a Compare September 30, 2025 12:33

Pijukatel requested review from vdusek and janbuchar September 30, 2025 12:36

Pijukatel marked this pull request as ready for review September 30, 2025 13:14

vdusek reviewed Oct 1, 2025

View reviewed changes

vdusek mentioned this pull request Oct 1, 2025

fix: Fix memory leak in PlaywrightCrawler on browser context creation #1446

Merged

janbuchar reviewed Oct 1, 2025

View reviewed changes

Make test more readable

ab62d6c

Pijukatel changed the title ~~refactor: Minimize global service locator side effects~~ fix: Fix KeyValueStore.auto_saved_value failing in some scenarios Oct 2, 2025

Pijukatel requested review from janbuchar and vdusek October 2, 2025 07:52

vdusek approved these changes Oct 2, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Fix `KeyValueStore.auto_saved_value` failing in some scenarios #1438

fix: Fix `KeyValueStore.auto_saved_value` failing in some scenarios #1438

Uh oh!

Pijukatel commented Sep 30, 2025

Uh oh!

Pijukatel commented Sep 30, 2025

Uh oh!

vdusek left a comment

Uh oh!

vdusek Oct 1, 2025

Uh oh!

Pijukatel Oct 2, 2025

Uh oh!

janbuchar Oct 1, 2025

Uh oh!

Pijukatel Oct 2, 2025 •

edited

Loading

Uh oh!

janbuchar Oct 1, 2025

Uh oh!

Pijukatel Oct 2, 2025

Uh oh!

janbuchar Oct 1, 2025

Uh oh!

Pijukatel Oct 2, 2025

Uh oh!

Pijukatel commented Oct 2, 2025

Uh oh!

vdusek left a comment

Uh oh!

Uh oh!

		assert request_data['url'].startswith('https://example.com/')


		async def test_opening_rq_does_not_have_side_effect_on_service_locator(

fix: Fix KeyValueStore.auto_saved_value failing in some scenarios #1438

Are you sure you want to change the base?

fix: Fix KeyValueStore.auto_saved_value failing in some scenarios #1438

Uh oh!

Conversation

Pijukatel commented Sep 30, 2025

Description

Issues

Testing

Uh oh!

Pijukatel commented Sep 30, 2025

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

vdusek Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

Pijukatel Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

janbuchar Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

Pijukatel Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

janbuchar Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

Pijukatel Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

janbuchar Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

Pijukatel Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

Pijukatel commented Oct 2, 2025

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fix: Fix `KeyValueStore.auto_saved_value` failing in some scenarios #1438

fix: Fix `KeyValueStore.auto_saved_value` failing in some scenarios #1438

Pijukatel Oct 2, 2025 •

edited

Loading