-
Notifications
You must be signed in to change notification settings - Fork 15
fix: Fix RQ usage in Scrapy scheduler #385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 3 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,6 +3,8 @@ | |
import traceback | ||
from typing import TYPE_CHECKING | ||
|
||
from crawlee.storage_clients import MemoryStorageClient | ||
|
||
from apify._configuration import Configuration | ||
from apify.apify_storage_client import ApifyStorageClient | ||
|
||
|
@@ -52,8 +54,15 @@ def open(self, spider: Spider) -> None: # this has to be named "open" | |
self.spider = spider | ||
|
||
async def open_queue() -> RequestQueue: | ||
custom_loop_apify_client = ApifyStorageClient(configuration=Configuration.get_global_configuration()) | ||
return await RequestQueue.open(storage_client=custom_loop_apify_client) | ||
config = Configuration.get_global_configuration() | ||
|
||
# Use the ApifyStorageClient if the Actor is running on the Apify platform, | ||
# otherwise use the MemoryStorageClient. | ||
storage_client = ( | ||
ApifyStorageClient.from_config(config) if config.is_at_home else MemoryStorageClient.from_config(config) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is supposed to happen in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Because of the nested event loop, otherwise, it will result in:
when using Apify client. |
||
) | ||
|
||
return await RequestQueue.open(storage_client=storage_client) | ||
|
||
try: | ||
self._rq = nested_event_loop.run_until_complete(open_queue()) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like... universally, everywhere? I don't mind it, it just seems weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it is a problem of Apify proxies, I don't know, but it results in the following:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Humph. But the connect call should happen way before the path part of the URL matters, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that's strange. I'm not sure why we can't connect when it comes to robots.txt, while other URLs works. I've reverted the changes and kept only the storage client fix.