feat: Persist the `SitemapRequestLoader` state #1347

Mantisus · 2025-08-11T18:53:32Z

Description

Persist the SitemapRequestLoader state

Issues

Closes: Persist the SitemapRequestLoader state #1269

Copilot

Pull Request Overview

This PR adds persistence functionality to the SitemapRequestLoader by implementing state management through a new SitemapRequestLoaderState model and RecoverableState integration. The changes enable the loader to save and restore its internal state, allowing it to resume sitemap processing after interruptions.

Added state persistence model with queue, progress tracking, and completion status
Refactored internal data structures to use deques and sets that can be serialized
Added context manager support for proper resource cleanup

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File	Description
`src/crawlee/request_loaders/_sitemap_request_loader.py`	Core implementation of state persistence with new state model and async context manager
`src/crawlee/_utils/sitemap.py`	Added exception handling for SAXParseException during parser cleanup
`tests/unit/request_loaders/test_sitemap_request_loader.py`	Added asyncio.sleep calls to allow background loading time in tests
`docs/guides/code_examples/request_loaders/sitemap_example.py`	Updated example to include sleep for proper loader initialization

src/crawlee/request_loaders/_sitemap_request_loader.py

janbuchar · 2025-08-13T10:57:51Z

docs/guides/code_examples/request_loaders/sitemap_example_with_persist.py

+        ) as sitemap_loader,
+    ):
+        # Allow some time for the loader to fetch the sitemap and extract some URLs
+        await asyncio.sleep(1)


What happens if we omit this? Won't the fetch_next_request just take a bit longer?

What happens if we omit this? Won't the fetch_next_request just take a bit longer?

I decided not to wait inside fetch_next_request until there are links to process in the queue.
So it just returns None.

Neither the old nor the new behavior is a problem for Crawlee. But the old behavior with cyclic waiting may be unexpected for the user if they decide to use SitemapRequestLoader directly.

I think that is not the correct behavior. A 1-second wait does not guarantee that there will be something to process. Maybe it'll take less time, maybe more. How should a user know that?

If you want to check if there's something to process, you can always use is_empty.

Yes, I completely agree with that. This is poorly used in the example, I will update it.

I didn't make myself clear the last time. I meant that this:

I decided not to wait inside fetch_next_request until there are links to process in the queue.
So it just returns None.

is not the correct behavior. In the JS version, fetchNextRequest blocks until a link can be returned (or the sitemap is completely processed). What is the reason to do it differently here?

I understand your point, it is indeed different from JS. But in my opinion, this behavior is more consistent with the loader API

@abstractmethod async def fetch_next_request(self) -> Request | None: """Return the next request to be processed, or `null` if there are no more pending requests."""

The loader will return None if there are no requests ready to be processed at the moment.

But I won't insist that my interpretation of the API description is correct. 🙂

Given that 1) I wrote the docblock and 2) the JS version waits for requests to appear, I think the "return None if and only if there will be no more requests, ever" is correct 😁

I'll expand the docblock in the meantime and fix that "null", too

Updated 🙂

src/crawlee/request_loaders/_sitemap_request_loader.py

janbuchar · 2025-08-13T11:44:52Z

src/crawlee/request_loaders/_sitemap_request_loader.py

 logger = getLogger(__name__)


+class SitemapRequestLoaderState(BaseModel):


Now that I see this from the outside, it would be great if you could write down how the persistence mechanism works in the docblock of this class.

Also, I don't see processed sitemap URLs being tracked in a any way, is that intentional?

I don't see processed sitemap URLs being tracked in a any way, is that intentional?

Yes, I may be wrong, but I think that cyclic links are not expected in sitemaps. Thanks to this, we don't need to store links to processed sitemaps.

JS uses similar behavior - https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L108

I wouldn't be surprised to encounter a cyclic sitemap somewhere, but I don't have a real-world example 🤷

Cyclic sitemaps aside, can you please briefly describe how is this state model used by the loader? Like "The crawler processes one sitemap at a time. The current one is kept in in_progress_sitemap_url..."

Sure. 🙂

The crawler processes one sitemap at a time. The current sitemap is stored in in_progress_sitemap_url. The parse_sitemap function parses the sitemap and returns elements as an async iterator. Each element retrieved from the iterator is processed based on its type. If the element is a NestedSitemap, its URL is added to pending_sitemap_urls. If the element is a SitemapUrl, the system checks whether it already exists in current_sitemap_processed_urls. If it exists, the loader was restarted from a saved state and the URL is skipped.

If the URL is new, it is first added to url_queue, then to current_sitemap_processed_urls, and total_count is incremented by 1. When all elements from the current sitemap iterator have been processed, in_progress_sitemap_url is set to None and current_sitemap_processed_urls is cleared. The next sitemap is retrieved from pending_sitemap_urls. If pending_sitemap_urls is empty, completed is set to True.

When fetch_next_request is called, a URL is extracted from url_queue and placed in in_progress. When mark_request_as_handled is called for the extracted URL, it is removed from in_progress and handled_count is incremented by 1.

During initial startup or restart after persistence, state validation occurs in _get_state. If both pending_sitemap_urls and in_progress_sitemap_url are empty and completed is False, this indicates a fresh start. In this case, self._sitemap_urls are moved to pending_sitemap_urls. Otherwise, the system is restarting from a persisted state. If in_progress contains any URLs, they are moved back to url_queue and in_progress is cleared.

Regarding cyclic sitemaps. According to sitemaps.org, only a sitemapindex should contain links to sitemaps, and a sitemapindex should not contain links to other sitemapindex files.
This means the hierarchy should be: a sitemapindex contains links to one or more sitemaps, where each sitemap contains one or more links to website pages.
If a website follows the recommended protocol, this should prevent circular sitemaps. However, a website might have multiple sitemapindex files that reference the same sitemap (I don't think search engines would like this, but it's possible).

Sure. 🙂 The crawler processes one sitemap at a time. The current sitemap is stored in...

Cool. Now put it inside the docblock please 😁

However, a website might have multiple sitemapindex files that reference the same sitemap (I don't think search engines would like this, but it's possible).

I could imagine a website implementing something like this as a scraper tarpit.

Cool. Now put it inside the docblock please 😁

Added 🙂

I could imagine a website implementing something like this as a scraper tarpit.

I added processed_sitemap_urls so that we can ensure we don't reprocess the same sitemap. 🙂

vdusek

LGTM, however, let's wait for @Pijukatel and/or @janbuchar as well

docs/guides/code_examples/request_loaders/sitemap_example.py

janbuchar

LGTM

### Description - Persist the `SitemapRequestLoader` state ### Issues - Closes: apify#1269

Mantisus added 2 commits August 11, 2025 16:42

add persist state in sitemap loader

0f15326

add test

d7c8c9e

Mantisus self-assigned this Aug 11, 2025

Mantisus requested review from Pijukatel, Copilot and vdusek August 11, 2025 18:56

Copilot AI reviewed Aug 11, 2025

View reviewed changes

Mantisus added 4 commits August 11, 2025 19:38

update use state

91118b5

up tests

9f8d7b4

up tests

f7b0dfb

check test

c8cf0d5

Mantisus force-pushed the persist-sitemap-loader branch 2 times, most recently from 62b743d to 146e805 Compare August 11, 2025 20:17

use fixture

c3f0f68

Mantisus force-pushed the persist-sitemap-loader branch from 146e805 to c3f0f68 Compare August 11, 2025 20:51

vdusek requested changes Aug 12, 2025

View reviewed changes

src/crawlee/request_loaders/_sitemap_request_loader.py Show resolved Hide resolved

src/crawlee/request_loaders/_sitemap_request_loader.py Outdated Show resolved Hide resolved

up docs

7ee1ab9

Mantisus requested a review from vdusek August 12, 2025 14:13

This comment was marked as resolved.

Sign in to view

vdusek reviewed Aug 12, 2025

View reviewed changes

src/crawlee/request_loaders/_sitemap_request_loader.py Show resolved Hide resolved

src/crawlee/request_loaders/_sitemap_request_loader.py Show resolved Hide resolved

Mantisus added 5 commits August 12, 2025 19:57

up docs

fa76aca

add smaap wait in abort loading

34008c9

up default example

1c30115

up basic guide

8cfec56

up guide

0bdec01

Mantisus requested a review from vdusek August 12, 2025 23:59

janbuchar reviewed Aug 13, 2025

View reviewed changes

Mantisus added 3 commits August 13, 2025 15:05

remove extra sleep

ed7116c

remove sleep from example

e28dccf

up example

82ae318

Mantisus added 3 commits August 13, 2025 17:00

protect _get_state

812f9fa

fix flaky tests

b34ad2b

move requests from in_progress to queue after restore

ccc9b68

vdusek approved these changes Aug 19, 2025

View reviewed changes

docs/guides/code_examples/request_loaders/sitemap_example.py Outdated Show resolved Hide resolved

Mantisus added 2 commits August 20, 2025 17:29

add comment in example

97570f3

resolve

0600583

Mantisus force-pushed the persist-sitemap-loader branch from 0524cfc to 0600583 Compare August 21, 2025 20:04

Pijukatel approved these changes Aug 25, 2025

View reviewed changes

janbuchar self-requested a review August 25, 2025 15:18

Mantisus added 3 commits August 25, 2025 20:06

Merge branch 'master' into persist-sitemap-loader

24cf32b

await link in fetch_next_request

c5e0609

add processed_sitemap_urls

c1bfd79

janbuchar approved these changes Aug 28, 2025

View reviewed changes

Mantisus added 2 commits August 28, 2025 12:14

resolve

097ab25

add comment

0f71af2

vdusek merged commit 27ef9ad into apify:master Aug 29, 2025
19 checks passed

Mantisus added a commit to Mantisus/crawlee-python that referenced this pull request Aug 30, 2025

feat: Persist the SitemapRequestLoader state (apify#1347)

3241785

### Description - Persist the `SitemapRequestLoader` state ### Issues - Closes: apify#1269

janbuchar mentioned this pull request Oct 7, 2025

fix: Use Self type in the open() method of storage clients #1462

Merged

		logger = getLogger(__name__)


		class SitemapRequestLoaderState(BaseModel):

feat: Persist the SitemapRequestLoader state #1347

feat: Persist the SitemapRequestLoader state #1347

Uh oh!

Conversation

Mantisus commented Aug 11, 2025

Description

Issues

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Mantisus Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Mantisus Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vdusek left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

janbuchar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: Persist the `SitemapRequestLoader` state #1347

feat: Persist the `SitemapRequestLoader` state #1347

Mantisus Aug 25, 2025 •

edited

Loading

Mantisus Aug 25, 2025 •

edited

Loading

vdusek left a comment •

edited

Loading