-
Notifications
You must be signed in to change notification settings - Fork 487
fix: Fix KeyValueStore.auto_saved_value
failing in some scenarios
#1438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Explicit kvs to RecoverableState
This PR replaces #1368. That PR included many changes that were already released in Crawlee v1.0, and continuing in the previous PR would not be good for review. |
97b11d4
to
537ed1a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't be afraid to call it a fix if it's (according to the description) fixing something, and then it should be included in the changelog.
But also, I'm not 100% I understand why we're doing this:
Reduce the amount of global side effects in service_locator by using an explicit KVS factory in RecoverableState.
What side effects?
Fix KeyValueStore.auto_saved_value not working properly if the global storage_client was different from the current kvs storage client.
👍
metadata=metadata, | ||
path_to_rq=path_to_rq, | ||
lock=asyncio.Lock(), | ||
recoverable_state=await cls._create_recoverable_state(id=metadata.id, configuration=configuration), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of creating it three times, we can create it once, store it in a variable, and just pass it where needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That creation requires metadata to get the RQ.id, so we have to repeat this call, as in all three branches, we get metadata in a different way.
from crawlee.storage_clients import FileSystemStorageClient # noqa: PLC0415 avoid circular import | ||
from crawlee.storages import KeyValueStore # noqa: PLC0415 avoid circular import | ||
|
||
return await KeyValueStore.open(storage_client=FileSystemStorageClient(), configuration=configuration) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating a fresh filesystem storage client to open a request queue feels wrong - at this point, we can be pretty sure that another one already exists. Is there a specific reason to do this or is it just because you don't have access to the existing one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At that point, we are not sure it exists. It could have been created through a class method without the client and even when created with the help of client, it is out of scope:
await FileSystemRequestQueueClient.open(...)
And why not open the KVS through such a class method as well? Because that way, you bypass the storage instance manager - and that is generally something we do not want.
FileSystemStorageClient is just a helper factory class, which is mainly for convenience and for registering the storage instance manager.
assert request_data['url'].startswith('https://example.com/') | ||
|
||
|
||
async def test_opening_rq_does_not_have_side_effect_on_service_locator( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some explanation here? Also, inlining the rq_client
fixture could lead to better readable code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
|
||
# Reset global class variables to ensure test isolation. | ||
KeyValueStore._autosaved_values = {} | ||
Statistics._Statistics__next_id = 0 # type:ignore[attr-defined] # Mangled attribute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this contribute towards test isolation? Is there anything that depends on the persist state key that is derived from the ID?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The answer is I am not sure. But we have so many tests, so I think it is best if we restore all we can to the same state at the beginning of the test. This reduces the chance of some weird behavior based on the order of the test execution.
Each time you use There are different cases, where I am not sure, for example, |
KeyValueStore.auto_saved_value
failing in some scenarios
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description
service_locator
by using an explicit KVS factory inRecoverableState
.KeyValueStore.auto_saved_value
not working properly if the global storage_client was different from the current kvs storage client.Issues
Testing