-
Notifications
You must be signed in to change notification settings - Fork 539
feat: Persist the SitemapRequestLoader state
#1347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds persistence functionality to the SitemapRequestLoader by implementing state management through a new SitemapRequestLoaderState model and RecoverableState integration. The changes enable the loader to save and restore its internal state, allowing it to resume sitemap processing after interruptions.
- Added state persistence model with queue, progress tracking, and completion status
- Refactored internal data structures to use deques and sets that can be serialized
- Added context manager support for proper resource cleanup
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
src/crawlee/request_loaders/_sitemap_request_loader.py |
Core implementation of state persistence with new state model and async context manager |
src/crawlee/_utils/sitemap.py |
Added exception handling for SAXParseException during parser cleanup |
tests/unit/request_loaders/test_sitemap_request_loader.py |
Added asyncio.sleep calls to allow background loading time in tests |
docs/guides/code_examples/request_loaders/sitemap_example.py |
Updated example to include sleep for proper loader initialization |
62b743d to
146e805
Compare
146e805 to
c3f0f68
Compare
| ) as sitemap_loader, | ||
| ): | ||
| # Allow some time for the loader to fetch the sitemap and extract some URLs | ||
| await asyncio.sleep(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if we omit this? Won't the fetch_next_request just take a bit longer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if we omit this? Won't the fetch_next_request just take a bit longer?
I decided not to wait inside fetch_next_request until there are links to process in the queue.
So it just returns None.
Neither the old nor the new behavior is a problem for Crawlee. But the old behavior with cyclic waiting may be unexpected for the user if they decide to use SitemapRequestLoader directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that is not the correct behavior. A 1-second wait does not guarantee that there will be something to process. Maybe it'll take less time, maybe more. How should a user know that?
If you want to check if there's something to process, you can always use is_empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I completely agree with that. This is poorly used in the example, I will update it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't make myself clear the last time. I meant that this:
I decided not to wait inside fetch_next_request until there are links to process in the queue.
So it just returns None.
is not the correct behavior. In the JS version, fetchNextRequest blocks until a link can be returned (or the sitemap is completely processed). What is the reason to do it differently here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand your point, it is indeed different from JS. But in my opinion, this behavior is more consistent with the loader API
@abstractmethod
async def fetch_next_request(self) -> Request | None:
"""Return the next request to be processed, or `null` if there are no more pending requests."""The loader will return None if there are no requests ready to be processed at the moment.
But I won't insist that my interpretation of the API description is correct. 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that 1) I wrote the docblock and 2) the JS version waits for requests to appear, I think the "return None if and only if there will be no more requests, ever" is correct 😁
I'll expand the docblock in the meantime and fix that "null", too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated 🙂
| logger = getLogger(__name__) | ||
|
|
||
|
|
||
| class SitemapRequestLoaderState(BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I see this from the outside, it would be great if you could write down how the persistence mechanism works in the docblock of this class.
Also, I don't see processed sitemap URLs being tracked in a any way, is that intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see processed sitemap URLs being tracked in a any way, is that intentional?
Yes, I may be wrong, but I think that cyclic links are not expected in sitemaps. Thanks to this, we don't need to store links to processed sitemaps.
JS uses similar behavior - https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L108
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't be surprised to encounter a cyclic sitemap somewhere, but I don't have a real-world example 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cyclic sitemaps aside, can you please briefly describe how is this state model used by the loader? Like "The crawler processes one sitemap at a time. The current one is kept in in_progress_sitemap_url..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. 🙂
The crawler processes one sitemap at a time. The current sitemap is stored in in_progress_sitemap_url. The parse_sitemap function parses the sitemap and returns elements as an async iterator. Each element retrieved from the iterator is processed based on its type. If the element is a NestedSitemap, its URL is added to pending_sitemap_urls. If the element is a SitemapUrl, the system checks whether it already exists in current_sitemap_processed_urls. If it exists, the loader was restarted from a saved state and the URL is skipped.
If the URL is new, it is first added to url_queue, then to current_sitemap_processed_urls, and total_count is incremented by 1. When all elements from the current sitemap iterator have been processed, in_progress_sitemap_url is set to None and current_sitemap_processed_urls is cleared. The next sitemap is retrieved from pending_sitemap_urls. If pending_sitemap_urls is empty, completed is set to True.
When fetch_next_request is called, a URL is extracted from url_queue and placed in in_progress. When mark_request_as_handled is called for the extracted URL, it is removed from in_progress and handled_count is incremented by 1.
During initial startup or restart after persistence, state validation occurs in _get_state. If both pending_sitemap_urls and in_progress_sitemap_url are empty and completed is False, this indicates a fresh start. In this case, self._sitemap_urls are moved to pending_sitemap_urls. Otherwise, the system is restarting from a persisted state. If in_progress contains any URLs, they are moved back to url_queue and in_progress is cleared.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding cyclic sitemaps. According to sitemaps.org, only a sitemapindex should contain links to sitemaps, and a sitemapindex should not contain links to other sitemapindex files.
This means the hierarchy should be: a sitemapindex contains links to one or more sitemaps, where each sitemap contains one or more links to website pages.
If a website follows the recommended protocol, this should prevent circular sitemaps. However, a website might have multiple sitemapindex files that reference the same sitemap (I don't think search engines would like this, but it's possible).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. 🙂 The crawler processes one sitemap at a time. The current sitemap is stored in...
Cool. Now put it inside the docblock please 😁
However, a website might have multiple sitemapindex files that reference the same sitemap (I don't think search engines would like this, but it's possible).
I could imagine a website implementing something like this as a scraper tarpit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. Now put it inside the docblock please 😁
Added 🙂
I could imagine a website implementing something like this as a scraper tarpit.
I added processed_sitemap_urls so that we can ensure we don't reprocess the same sitemap. 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, however, let's wait for @Pijukatel and/or @janbuchar as well
0524cfc to
0600583
Compare
janbuchar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
### Description - Persist the `SitemapRequestLoader` state ### Issues - Closes: apify#1269
Description
SitemapRequestLoaderstateIssues
SitemapRequestLoaderstate #1269