Skip to content

Conversation

@Mantisus
Copy link
Collaborator

Description

  • Persist the SitemapRequestLoader state

Issues

@Mantisus Mantisus self-assigned this Aug 11, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds persistence functionality to the SitemapRequestLoader by implementing state management through a new SitemapRequestLoaderState model and RecoverableState integration. The changes enable the loader to save and restore its internal state, allowing it to resume sitemap processing after interruptions.

  • Added state persistence model with queue, progress tracking, and completion status
  • Refactored internal data structures to use deques and sets that can be serialized
  • Added context manager support for proper resource cleanup

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
src/crawlee/request_loaders/_sitemap_request_loader.py Core implementation of state persistence with new state model and async context manager
src/crawlee/_utils/sitemap.py Added exception handling for SAXParseException during parser cleanup
tests/unit/request_loaders/test_sitemap_request_loader.py Added asyncio.sleep calls to allow background loading time in tests
docs/guides/code_examples/request_loaders/sitemap_example.py Updated example to include sleep for proper loader initialization

@Mantisus Mantisus force-pushed the persist-sitemap-loader branch 2 times, most recently from 62b743d to 146e805 Compare August 11, 2025 20:17
@Mantisus Mantisus force-pushed the persist-sitemap-loader branch from 146e805 to c3f0f68 Compare August 11, 2025 20:51
@Mantisus Mantisus requested a review from vdusek August 12, 2025 14:13
vdusek

This comment was marked as resolved.

@Mantisus Mantisus requested a review from vdusek August 12, 2025 23:59
) as sitemap_loader,
):
# Allow some time for the loader to fetch the sitemap and extract some URLs
await asyncio.sleep(1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we omit this? Won't the fetch_next_request just take a bit longer?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we omit this? Won't the fetch_next_request just take a bit longer?

I decided not to wait inside fetch_next_request until there are links to process in the queue.
So it just returns None.

Neither the old nor the new behavior is a problem for Crawlee. But the old behavior with cyclic waiting may be unexpected for the user if they decide to use SitemapRequestLoader directly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is not the correct behavior. A 1-second wait does not guarantee that there will be something to process. Maybe it'll take less time, maybe more. How should a user know that?

If you want to check if there's something to process, you can always use is_empty.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I completely agree with that. This is poorly used in the example, I will update it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't make myself clear the last time. I meant that this:

I decided not to wait inside fetch_next_request until there are links to process in the queue.
So it just returns None.

is not the correct behavior. In the JS version, fetchNextRequest blocks until a link can be returned (or the sitemap is completely processed). What is the reason to do it differently here?

Copy link
Collaborator Author

@Mantisus Mantisus Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your point, it is indeed different from JS. But in my opinion, this behavior is more consistent with the loader API

    @abstractmethod
    async def fetch_next_request(self) -> Request | None:
        """Return the next request to be processed, or `null` if there are no more pending requests."""

The loader will return None if there are no requests ready to be processed at the moment.

But I won't insist that my interpretation of the API description is correct. 🙂

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that 1) I wrote the docblock and 2) the JS version waits for requests to appear, I think the "return None if and only if there will be no more requests, ever" is correct 😁

I'll expand the docblock in the meantime and fix that "null", too

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated 🙂

logger = getLogger(__name__)


class SitemapRequestLoaderState(BaseModel):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I see this from the outside, it would be great if you could write down how the persistence mechanism works in the docblock of this class.

Also, I don't see processed sitemap URLs being tracked in a any way, is that intentional?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see processed sitemap URLs being tracked in a any way, is that intentional?

Yes, I may be wrong, but I think that cyclic links are not expected in sitemaps. Thanks to this, we don't need to store links to processed sitemaps.

JS uses similar behavior - https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L108

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't be surprised to encounter a cyclic sitemap somewhere, but I don't have a real-world example 🤷

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cyclic sitemaps aside, can you please briefly describe how is this state model used by the loader? Like "The crawler processes one sitemap at a time. The current one is kept in in_progress_sitemap_url..."

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. 🙂

The crawler processes one sitemap at a time. The current sitemap is stored in in_progress_sitemap_url. The parse_sitemap function parses the sitemap and returns elements as an async iterator. Each element retrieved from the iterator is processed based on its type. If the element is a NestedSitemap, its URL is added to pending_sitemap_urls. If the element is a SitemapUrl, the system checks whether it already exists in current_sitemap_processed_urls. If it exists, the loader was restarted from a saved state and the URL is skipped.

If the URL is new, it is first added to url_queue, then to current_sitemap_processed_urls, and total_count is incremented by 1. When all elements from the current sitemap iterator have been processed, in_progress_sitemap_url is set to None and current_sitemap_processed_urls is cleared. The next sitemap is retrieved from pending_sitemap_urls. If pending_sitemap_urls is empty, completed is set to True.

When fetch_next_request is called, a URL is extracted from url_queue and placed in in_progress. When mark_request_as_handled is called for the extracted URL, it is removed from in_progress and handled_count is incremented by 1.

During initial startup or restart after persistence, state validation occurs in _get_state. If both pending_sitemap_urls and in_progress_sitemap_url are empty and completed is False, this indicates a fresh start. In this case, self._sitemap_urls are moved to pending_sitemap_urls. Otherwise, the system is restarting from a persisted state. If in_progress contains any URLs, they are moved back to url_queue and in_progress is cleared.

Copy link
Collaborator Author

@Mantisus Mantisus Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding cyclic sitemaps. According to sitemaps.org, only a sitemapindex should contain links to sitemaps, and a sitemapindex should not contain links to other sitemapindex files.
This means the hierarchy should be: a sitemapindex contains links to one or more sitemaps, where each sitemap contains one or more links to website pages.
If a website follows the recommended protocol, this should prevent circular sitemaps. However, a website might have multiple sitemapindex files that reference the same sitemap (I don't think search engines would like this, but it's possible).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. 🙂 The crawler processes one sitemap at a time. The current sitemap is stored in...

Cool. Now put it inside the docblock please 😁

However, a website might have multiple sitemapindex files that reference the same sitemap (I don't think search engines would like this, but it's possible).

I could imagine a website implementing something like this as a scraper tarpit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Now put it inside the docblock please 😁

Added 🙂

I could imagine a website implementing something like this as a scraper tarpit.

I added processed_sitemap_urls so that we can ensure we don't reprocess the same sitemap. 🙂

Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, however, let's wait for @Pijukatel and/or @janbuchar as well

@Mantisus Mantisus force-pushed the persist-sitemap-loader branch from 0524cfc to 0600583 Compare August 21, 2025 20:04
@janbuchar janbuchar self-requested a review August 25, 2025 15:18
Copy link
Collaborator

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vdusek vdusek merged commit 27ef9ad into apify:master Aug 29, 2025
19 checks passed
Mantisus added a commit to Mantisus/crawlee-python that referenced this pull request Aug 30, 2025
### Description

- Persist the `SitemapRequestLoader` state

### Issues

- Closes: apify#1269
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Persist the SitemapRequestLoader state

4 participants